GitHub - pplonski/dataprep: DataPrep — The easiest way to prepare data in Python

DataPrep lets you prepare your data using a single library with a few lines of code.

Currently, you can use DataPrep to:

Collect data from common data sources (through dataprep.connector)
Do your exploratory data analysis (through dataprep.eda)
Clean and standardize data (through dataprep.clean)
...more modules are coming

Releases

Repo	Version	Downloads
PyPI
conda-forge

Installation

pip install -U dataprep

Connector

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter API, Yelp Fusion API, Spotify API, DBLP API), making web data collection easy and efficient, without requiring advanced programming skills.

Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!

Let's check out the several benefits that Connector offers:

A unified API: You can fetch data using one or two lines of code to get data from many websites.

Auto Pagination: Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument _count) without getting into unnecessary detail about a specific pagination scheme.

Smart API request strategy: Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the _concurrency argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.

In configuration files, Connector specifies how to connect with each Web API for data gathering. If you want to connect with any of the APIs mentioned in the table below, with one line of code, you can get the most up-to-date version of the config file from our codebase and use it right away!

Many websites in different domains are currently supported. These are some examples:

Category	Web API	Auth Method	Connector Config File(s)	Jupyter Notebook / Tutorial	Description
Social Media	Twitter	`OAuth2`	Twitter config file(s)	Twitter Jupyter Notebook	API endpoint for Tweets information retrieval.
Music	Spotify	`OAuth2`	Spotify config file(s)	Spotify tutorial	Comprehensive API for retrieving albums, artists, and tracks metadata.
Restaurants	Yelp	`Bearer Token`	Yelp config file(s)	Yelp Jupyter Notebook	Leading API to access restaurant information by location.
Science	DBLP	No	DBLP config file(s)	DBLP Jupyter Notebook	Open bibliographic API for computer science publications.
Social Media	Youtube	`API Key`	Youtube config file(s)	Youtube Jupyter Notebook	API for retrieving Youtube's content information.
Finance	Finnhub	`API Key`	Finnhub config file(s)	Finnhub Jupyter Notebook	Comprehensive API for financial, market, and economic data.
Music	Musixmatch	`API Key`	Musixmatch config file(s)	Coming soon	Leading API for searching music lyrics.
Weather	OpenWeatherMap	`API Key`	OpenWeatherMap config file(s)	Coming soon	API for retrieving current and historical weather data.
Lifestyle	Spoonacular	`API Key`	Spoonacular config file(s)	Coming soon	Recipe, food, and nutritional information API.

If you want to connect with a different web API, Connector is designed to be easy to extend. You just have to write a simple configuration file to support the new web API. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.

In the following link, you can see detailed examples of how to use Connector for retrieving data from DBLP, Spotify, Yelp, and other sites, without taking an in-depth look into the web APIs documentation!: Examples.

EDA

DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

Create Profile Reports, Fast

You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

10-100X Faster: DataPrep.EDA is 10-100X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
Interactive Visualization: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
Big Data Support: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()

Click here to see the generated report of the above code.

Innovative System Design

DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.

Task-Centric API Design: You can declaratively specify a wide range of EDA tasks in different granularities with a single function call. All needed visualizations will be automatically and intelligently generated for you.
Auto-Insights: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
How-to Guide (available soon): A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.

Understand the Titanic dataset with Task-Centric API:

Click here to check all the supported tasks.

Check plot, plot_correlation, plot_missing and create_report to see how each function works.

Clean

DataPrep.Clean contains simple functions designed for cleaning and standardizing a column in a DataFrame. It provides

A unified API: each function follows the syntax clean_{type}(df, "column name") (see an example below)
Python Data Science Support: its design for cleaning pandas and Dask DataFrames enables seamless integration into the Python data science workflow
Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning

The following example shows how to clean a column containing messy emails:

Type validation is also supported:

Below are the supported semantic types (more are currently being developed).

Semantic Types
longitude/latitude
country
email
url
phone

For more information, refer to the User Guide.

Documentation

The following documentation can give you an impression of what DataPrep can do:

Contribute

There are many ways to contribute to DataPrep.

Submit bugs and help us verify fixes as they are checked in.
Review the source code changes.
Engage with other DataPrep users and developers on StackOverflow.
Help each other in the DataPrep Community Discord and Mail list & Forum.
Contribute bug fixes.
Providing use cases and writing down your user experience.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

Pandas Profiling

Inspired the report functionality and insights provided in dataprep.eda.
missingno

Inspired the missing value analysis in dataprep.eda.

Name		Name	Last commit message	Last commit date
Latest commit History 550 Commits
.circleci		.circleci
.github		.github
assets		assets
dataprep		dataprep
docs		docs
examples		examples
scripts		scripts
.coveragerc		.coveragerc
.gitignore		.gitignore
.pylintrc		.pylintrc
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
codecov.yaml		codecov.yaml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytype.cfg		pytype.cfg

License

pplonski/dataprep

Folders and files

Latest commit

History

Repository files navigation

Releases

Installation

Connector

EDA

Create Profile Reports, Fast

Innovative System Design

Understand the Titanic dataset with Task-Centric API:

Clean

Documentation

Contribute

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages