Welcome To Datagrunt
Streamline your CSV workflows with intelligent delimiter inference, multiple processing engines, and AI-powered analysis.
Datagrunt is a Python library designed to simplify the way you work with CSV files. It provides a streamlined approach to reading, processing, and transforming your data into various formats, making data manipulation efficient and intuitive.
Why Datagrunt?
π‘
Born out of real-world frustration, Datagrunt eliminates the need for repetitive coding when handling CSV files. Whether you’re a data analyst, data engineer, or data scientist, Datagrunt empowers you to focus on insights, not tedious data wrangling.
What Datagrunt Is Not
Datagrunt is not an extension of or a replacement for DuckDB or Polars, nor is it a comprehensive data processing solution. It is not designed to be a comprehensive one-stop shop for all of your CSV processing needs. Instead, it’s designed to simplify the way you work with CSV files and to help solve the pain point of inferring delimiters when a file structure is unknown.
Key Features
Datagrunt automatically detects and applies the correct delimiter for your CSV files.
Full support for both string paths and pathlib.Path objects for modern, cross-platform file handling.
Choose from three powerful engines - DuckDB, Polars, and PyArrow - to handle your data processing needs.
Easily convert your processed CSV data into various formats including CSV, Excel, JSON, JSONL, and Parquet.
Use Google’s Gemini models to automatically generate detailed schema reports for your CSV files.
Enjoy a clean and intuitive API that integrates seamlessly into your existing Python workflows.
Powertools Under The Hood
Fast in-process analytical database with excellent SQL support. Perfect for complex queries and analytics workloads.
Multi-threaded DataFrame library written in Rust, optimized for performance. Built for speed and memory efficiency.
Python bindings for Apache Arrow with efficient columnar data processing. Seamless integration with the Arrow ecosystem.
A powerful family of generative AI models for schema analysis. Intelligent schema analysis and data type detection.
Engine Comparison
| Feature | Polars | DuckDB | PyArrow |
|---|---|---|---|
| Best for | DataFrame operations | SQL queries & analytics | Arrow ecosystem integration |
| Performance | Fast in-memory processing | Excellent for large datasets | Optimized columnar operations |
| Default for | CSVReader | CSVWriter | - |
| Export Quality | Good | Excellent (especially JSON) | Native Parquet support |
Datagrunt’s Role
π
Datagrunt’s Primary Functions:
- Accurately inferring CSV delimiters
- Providing helper methods for common data tasks
- Facilitating CSV file loading into Polars dataframes
- Enabling conversion to various output formats
- Generating AI-powered schema reports
Flexibility and Integration
Use the to_pandas method to convert dataframes when needed
Integrate Datagrunt’s output in other contexts within your applications
Leverage only specific features (e.g., delimiter inference) as needed
Works alongside your existing data processing tools without restrictions
License
This project is licensed under the MIT License
Acknowledgements
A HUGE thank you to the open source community and the creators of DuckDB, Polars, and PyArrow for their fantastic libraries that power Datagrunt.