Datagrunt Documentation
This documentation provides comprehensive guidance for installing, using, and understanding Datagrunt - a Python library designed to simplify the way you work with CSV files.
Installation
We recommend using UV. However, you may get started with Datagrunt in seconds using UV or pip.
Get started with UV:
uv pip install datagrunt
Get started with pip:
pip install datagrunt
Quick Start
Reading CSV Files with Multiple Engine Options
from datagrunt import CSVReader
from pathlib import Path
# Load your CSV file with different engines
# Accepts both string paths and Path objects
csv_file = 'electric_vehicle_population_data.csv'
csv_path = Path('electric_vehicle_population_data.csv')
# Choose your engine: 'polars' (default), 'duckdb', or 'pyarrow'
reader_polars = CSVReader(csv_file, engine='polars') # String path - fast DataFrame ops
reader_duckdb = CSVReader(csv_path, engine='duckdb') # Path object - best for SQL queries
reader_pyarrow = CSVReader(csv_file, engine='pyarrow') # Arrow ecosystem integration
# Get a sample of the data
reader_duckdb.get_sample()
DuckDB Integration for Performant SQL Queries
from datagrunt import CSVReader
# Set up DuckDB engine for SQL capabilities
dg = CSVReader('electric_vehicle_population_data.csv', engine='duckdb')
# Construct your SQL query using the auto-generated table name
query = f"""
WITH core AS (
SELECT
City AS city,
"VIN (1-10)" AS vin
FROM {dg.db_table}
)
SELECT
city,
COUNT(vin) AS vehicle_count
FROM core
GROUP BY 1
ORDER BY 2 DESC
"""
# Execute the query and get results as a Polars DataFrame
df = dg.query_data(query).pl()
print(df)
AI-Powered Schema Analysis
from datagrunt import CSVSchemaReportAIGenerated
from pathlib import Path
import os
# Generate detailed schema reports with AI (accepts both strings and Path objects)
api_key = os.environ.get("GEMINI_API_KEY")
data_file = Path('your_data.csv')
schema_analyzer = CSVSchemaReportAIGenerated(
filepath=data_file, # Path object works seamlessly
engine='google',
api_key=api_key
)
# Get comprehensive schema analysis
report = schema_analyzer.generate_csv_schema_report(
model='gemini-2.5-flash',
return_json=True
)
print(report) # Detailed JSON schema with data types, classifications, and more
Primary Classes
Datagrunt provides three primary classes for interacting with data: CSVReader
, CSVWriter
, and CSVSchemaReportAIGenerated
. These classes are designed to simplify the process of reading, writing, and analyzing CSV files.
CSVReader
The CSVReader
class is used to read data from a CSV file. It accepts both string paths and pathlib.Path
objects, providing a simple interface for reading data from a CSV file and converting it into various formats.
from datagrunt import CSVReader
from pathlib import Path
reader = CSVReader('path/to/file.csv') # String path
reader = CSVReader(Path('path/to/file.csv')) # Path object
You may optionally specify the engine to use for reading the CSV file. The three options are polars
(default), duckdb
, and pyarrow
.
reader = CSVReader(Path('path/to/file.csv'), engine='duckdb') # DuckDB engine
reader = CSVReader('path/to/file.csv', engine='pyarrow') # PyArrow engine
reader = CSVReader(Path('path/to/file.csv')) # Default: Polars engine
Primary Methods
get_sample(normalize_columns=False)
: Returns a sample of the data in the CSV file (20 rows).to_dataframe(normalize_columns=False)
: Converts the data in the CSV file into a Polars DataFrame.to_arrow_table(normalize_columns=False)
: Converts the data in the CSV file into a PyArrow Table.to_dicts(normalize_columns=False)
: Converts the data in the CSV file into a list of dictionaries.query_data(sql_query, normalize_columns=False)
: Executes a SQL query on the data in the CSV file.
CSVWriter
The CSVWriter
class is used to convert and export CSV data to various file formats. It accepts both string paths and pathlib.Path
objects and supports three engines: duckdb
(default), polars
, and pyarrow
.
from datagrunt import CSVWriter
from pathlib import Path
writer = CSVWriter(Path('path/to/file.csv'), engine='duckdb') # Default: DuckDB engine
writer = CSVWriter('path/to/file.csv', engine='polars') # Polars engine
writer = CSVWriter(Path('path/to/file.csv'), engine='pyarrow') # PyArrow engine
Primary Methods
write_csv(self, out_filename=None, normalize_columns=False)
: Writes the data in the CSV file to a CSV file.write_excel(self, out_filename=None, normalize_columns=False)
: Writes the data in the CSV file to an Excel file.write_json(self, out_filename=None, normalize_columns=False)
: Writes the data in the CSV file to a JSON file.write_json_newline_delimited(self, out_filename=None, normalize_columns=False)
: Writes the data in the CSV file to a JSON file with newline delimiters.write_parquet(self, out_filename=None, normalize_columns=False)
: Writes the data in the CSV file to a Parquet file.
CSVSchemaReportAIGenerated
The CSVSchemaReportAIGenerated
class generates detailed schema reports for CSV files using AI. It provides a simple interface for analyzing CSV file structure and data types.
It is currently configured to work only with Google’s Gemini and accepts either an api_key
or can access Vertex AI if you pass in the following parameters:
vertexai=True
gcp_project=my-gcp-project-id
gcp_location=global
or a supported Google Cloud region such asus-central1
Primary Method
generate_csv_schema_report(self, model, prompt=None, system_instructions=None, return_json=False)
: Generates a comprehensive schema report for a CSV file.- model: Any supported Gemini model. No default model is set by design. See available Google Gemini models
- prompt: Optional custom prompt. Uses a default prompt if none provided.
- system_instructions: Optional system instructions. Uses default instructions if none provided.
- return_json: Returns a Python dict by default (
False
). Set toTrue
to return formatted JSON string.