Introducing Datagrunt: Your New CSV Power Tool

Blog

March 6, 2025by Martin Graham

No matter how many sophisticated data processing tools exist in the world today through the wonders of modern technology, one thing is still true: CSV files are usually the most popular format for sharing data throughout organizations and even within public domains.

Despite their ubiquity, CSV files can provide some of the most frustrating experiences with data processing that you will ever encounter. Sure, there is a CSV standard that provides a baseline for formatting CSV files. However, as I’m sure many others will attest, while a CSV standard exists, it’s common to encounter CSV files that mostly follow the standard format. That word “mostly” is critical because the slightest deviation from the CSV standard often causes countless headaches while reading and parsing CSV files with code. If you ask any data engineer or data scientist be prepared to hang out for a while during the venting session that follows.

The Story and the Inspiration For Creating Datagrunt

Throughout my tenure at Google as a customer facing data engineer, nearly every data project has involved CSV files in some capacity, and usually CSVs were the bulk of the file count. During one particular customer engagement for a healthcare company, we faced a scenario where they received CSV files from over one thousand different partner facilities daily. All of the data in the files represented the same type of information, but the files were in different formats. Within each CSV file, the delimiters were not guaranteed to be consistent, even within files shared from the same facility.

Given the problem statement, the question became: How do we use the tools that the customer was familiar with to solve this problem at scale, so that data processing can be automated? The answer was to create a data processing module in Python that could automatically determine CSV delimiters on the fly. That project became the inspiration for Datagrunt.

One weekend during summer of 2024, I fired up my personal MacBook Air at my home workstation, I opened a blank text editor, and I started typing. Two days later I had the core of Datagrunt. This version of Datagrunt had brand new and far more robust custom algorithms than anything I’d ever done before. The code was modular and decently well organized into classes, and most importantly it was deployable to PyPi and could be pip installed successfully.

Data Analysis for Everyone

Datagrunt isn’t just for data engineers and data scientists; it’s designed for anyone who wants to understand their data better. Whether you’re a seasoned analyst or a curious explorer, Datagrunt empowers you to dive into your data without the usual headaches. Forget tedious manual inspections and complicated setups — Datagrunt streamlines the entire process, allowing you to focus on uncovering valuable insights.

Key Features of Datagrunt

You can look forward to a follow up post regarding the design decisions for Datagrunt. It will include why I made some of the design choices for the library, particularly diving a little into the choice to use DuckDB and Polars as part of the library. That said, here are the key features:

Intelligent Delimiter Inference: Datagrunt automatically detects and applies the correct delimiter for your CSV files.
Seamless Data Processing: Leverage the robust capabilities of DuckDB and Polars to perform advanced data processing tasks directly on your CSV data.
Flexible Transformation: Easily convert your processed CSV data into various formats to suit your needs.
Pythonic API: Enjoy a clean and intuitive API that integrates seamlessly into your existing Python workflows.

Usage Examples

The way I use Datagrunt, for the most part, is to automatically infer the delimiter of a CSV file so that I can then perform minimal transformations to get the data loaded into a database (usually Google Cloud BigQuery). Often I don’t perform any transformations at all and I simply use Datagrunt as a preprocessor to get the CSV attributes so that I have the required information to load the CSV file(s) into BigQuery (to be clear Datagrunt doesn’t interface with any database other than DuckDB as of now; the assumption is Datagrunt produces output or produces CSV file attributes used for downstream processing). Here is an example of what that looks like:

from datagrunt import CSVReader

csv_file = 'path_to_my_file.csv'
dg = CSVReader(csv_file)
print(dg.delimiter)

# Convert to a dataframe
df = dg.to_dataframe()
print(df.head())

There is also a built in feature to normalize column names to make them friendly for loading into a database or just for normalizing for readability:

from datagrunt import CSVReader

csv_file = 'path_to_my_file.csv'
dg = CSVReader(csv_file)
print(dg.delimiter) # there are a myriad of provided attributes other than just delimiter
print(dg.columns)
print(dg.columns_normalized)
print(dg.columns_to_normalized_mapping) # returns a dict of original to normalized column names

# Convert to a dataframe
df = dg.to_dataframe(normalize_columns=True) # this is an optional param and is false by default
print(df.head())

df_pandas = df.to_pandas() # this converts the polars dataframe to pandas assuming you have pandas installed in your python environment

You can also convert CSV files to some other popular formats:

from datagrunt import CSVWriter

csv_file = 'path_to_my_file.csv'
dw = CSVWriter(csv_file)
print(dw.delimiter) # same attributes available as in the CSVReader class

dw.write_json()
dw.write_json(out_filename='my_new_file.json') # optionally set the name of the output file
dw.write_json(normalize_columns=True) # The normalize_columns=True param is available for any of the write methods
dw.write_parquet('my_file.parquet')

Finally, one of the most powerful features in Datagrunt is the ability to write SQL queries directly against the data from the CSV file, powered by DuckDB:

from datagrunt import CSVReader

csv_file = 'electric_vehicle_population_data.csv'
engine = 'duckdb'
dg = CSVReader(csv_file, engine=engine)

# Construct your SQL query
query = f"""
WITH core AS (
    SELECT
        City AS city,
        "VIN (1-10)" AS vin
    FROM {dg.db_table}
)
SELECT
    city,
    COUNT(vin) AS vehicle_count
FROM core
GROUP BY 1
ORDER BY 2 DESC
"""

# Execute the query and get results as a Polars DataFrame
df = dg.query_data(query).pl()
print(df)

┌────────────────┬───────────────┐
│ city           ┆ vehicle_count │
│ ---            ┆ ---           │
│ str            ┆ i64           │
╞════════════════╪═══════════════╡
│ Seattle        ┆ 32602         │
│ Bellevue       ┆ 9960          │
│ Redmond        ┆ 7165          │
│ Vancouver      ┆ 7081          │
│ Bothell        ┆ 6602          │
│ …              ┆ …             │
│ Glenwood       ┆ 1             │
│ Walla Walla Co ┆ 1             │
│ Pittsburg      ┆ 1             │
│ Decatur        ┆ 1             │
│ Redwood City   ┆ 1             │
└────────────────┴───────────────┘

Conclusion and Project URLs

Datagrunt is designed to simplify interacting with CSV files. It’s designed to be light weight, and it’s not intended to have every kind of functionality under the sun. Awesome libraries like Pandas and Polars already provide a myriad of functionality that I see no need to replicate with Datagrunt, especially since Datagrunt can convert CSV files to dataframes for Pandas and Polars.

Project Links:

Contact: datagrunt@datagrunt.io

Please let me know if you have any feedback for Datagrunt by posting to the issue tracker or by sending an email to the address listed above. We’d also love to hear success stories if Datagrunt helps you solve a particular problem. Happy coding!