Datagrunt Documentation

Datagrunt Documentation

This documentation provides comprehensive guidance for installing, using, and understanding Datagrunt - a Python library designed to simplify the way you work with CSV files.

Installation

We recommend using UV. However, you may get started with Datagrunt in seconds using UV or pip.

Get started with UV:

uv pip install datagrunt

Get started with pip:

pip install datagrunt

Quick Start

Reading CSV Files with Multiple Engine Options

from datagrunt import CSVReader
from pathlib import Path

# Load your CSV file with different engines
# Accepts both string paths and Path objects
csv_file = 'electric_vehicle_population_data.csv'
csv_path = Path('electric_vehicle_population_data.csv')

# Choose your engine: 'polars' (default), 'duckdb', or 'pyarrow'
reader_polars = CSVReader(csv_file, engine='polars')    # String path - fast DataFrame ops
reader_duckdb = CSVReader(csv_path, engine='duckdb')    # Path object - best for SQL queries
reader_pyarrow = CSVReader(csv_file, engine='pyarrow')  # Arrow ecosystem integration

# Get a sample of the data
reader_duckdb.get_sample()

DuckDB Integration for Performant SQL Queries

from datagrunt import CSVReader

# Set up DuckDB engine for SQL capabilities
dg = CSVReader('electric_vehicle_population_data.csv', engine='duckdb')

# Construct your SQL query using the auto-generated table name
query = f"""
WITH core AS (
    SELECT
        City AS city,
        "VIN (1-10)" AS vin
    FROM {dg.db_table}
)
SELECT
    city,
    COUNT(vin) AS vehicle_count
FROM core
GROUP BY 1
ORDER BY 2 DESC
"""

# Execute the query and get results as a Polars DataFrame
df = dg.query_data(query).pl()
print(df)

AI-Powered Schema Analysis

from datagrunt import CSVSchemaReportAIGenerated
from pathlib import Path
import os

# Generate detailed schema reports with AI (accepts both strings and Path objects)
api_key = os.environ.get("GEMINI_API_KEY")
data_file = Path('your_data.csv')

schema_analyzer = CSVSchemaReportAIGenerated(
    filepath=data_file,  # Path object works seamlessly
    engine='google',
    api_key=api_key
)

# Get comprehensive schema analysis
report = schema_analyzer.generate_csv_schema_report(
    model='gemini-2.5-flash',
    return_json=True
)

print(report)  # Detailed JSON schema with data types, classifications, and more

Primary Classes

Datagrunt provides three primary classes for interacting with data: CSVReader, CSVWriter, and CSVSchemaReportAIGenerated. These classes are designed to simplify the process of reading, writing, and analyzing CSV files.

CSVReader

The CSVReader class is used to read data from a CSV file. It accepts both string paths and pathlib.Path objects, providing a simple interface for reading data from a CSV file and converting it into various formats.

from datagrunt import CSVReader
from pathlib import Path

reader = CSVReader('path/to/file.csv')          # String path
reader = CSVReader(Path('path/to/file.csv'))    # Path object

You may optionally specify the engine to use for reading the CSV file. The three options are polars (default), duckdb, and pyarrow.

reader = CSVReader(Path('path/to/file.csv'), engine='duckdb')   # DuckDB engine
reader = CSVReader('path/to/file.csv', engine='pyarrow')        # PyArrow engine
reader = CSVReader(Path('path/to/file.csv'))                    # Default: Polars engine

Primary Methods

  • get_sample(normalize_columns=False): Returns a sample of the data in the CSV file (20 rows).
  • to_dataframe(normalize_columns=False): Converts the data in the CSV file into a Polars DataFrame.
  • to_arrow_table(normalize_columns=False): Converts the data in the CSV file into a PyArrow Table.
  • to_dicts(normalize_columns=False): Converts the data in the CSV file into a list of dictionaries.
  • query_data(sql_query, normalize_columns=False): Executes a SQL query on the data in the CSV file.

CSVWriter

The CSVWriter class is used to convert and export CSV data to various file formats. It accepts both string paths and pathlib.Path objects and supports three engines: duckdb (default), polars, and pyarrow.

from datagrunt import CSVWriter
from pathlib import Path

writer = CSVWriter(Path('path/to/file.csv'), engine='duckdb')   # Default: DuckDB engine
writer = CSVWriter('path/to/file.csv', engine='polars')        # Polars engine
writer = CSVWriter(Path('path/to/file.csv'), engine='pyarrow') # PyArrow engine

Primary Methods

  • write_csv(self, out_filename=None, normalize_columns=False): Writes the data in the CSV file to a CSV file.
  • write_excel(self, out_filename=None, normalize_columns=False): Writes the data in the CSV file to an Excel file.
  • write_json(self, out_filename=None, normalize_columns=False): Writes the data in the CSV file to a JSON file.
  • write_json_newline_delimited(self, out_filename=None, normalize_columns=False): Writes the data in the CSV file to a JSON file with newline delimiters.
  • write_parquet(self, out_filename=None, normalize_columns=False): Writes the data in the CSV file to a Parquet file.

CSVSchemaReportAIGenerated

The CSVSchemaReportAIGenerated class generates detailed schema reports for CSV files using AI. It provides a simple interface for analyzing CSV file structure and data types.

It is currently configured to work only with Google’s Gemini and accepts either an api_key or can access Vertex AI if you pass in the following parameters:

  • vertexai=True
  • gcp_project=my-gcp-project-id
  • gcp_location=global or a supported Google Cloud region such as us-central1

Primary Method

  • generate_csv_schema_report(self, model, prompt=None, system_instructions=None, return_json=False): Generates a comprehensive schema report for a CSV file.
    • model: Any supported Gemini model. No default model is set by design. See available Google Gemini models
    • prompt: Optional custom prompt. Uses a default prompt if none provided.
    • system_instructions: Optional system instructions. Uses default instructions if none provided.
    • return_json: Returns a Python dict by default (False). Set to True to return formatted JSON string.