Character File Repair Utility

#Motivation

This engine demonstrates how a learning component can serve as a solution to a problem that would otherwise have been addressed by developing custom parsers to every field and error associated with a dataset.

1. Reducing data preparation tasks

In machine learning most of time is spent on [janitorial tasks] (http://www.pcworld.com/article/3047665/hottest-job-data-scientists-say-theyre-still-mostly-digital-janitors.html]). The file repair utility handles some of that.

2. Integration of learning components in software engineering

Writing parsers for every case/field/error and dataset is NEITHER practical NOR scalable. So we propose a learning algorithm/component because it is code written once that can generalize without new code being written

Character File Repair Utility

This utility repairs character delimited files that have either a misplaced delimiter and or an arbitrary new line where it shouldn't. The utility leverages machine-learning approaches and can be run in either

Filter mode, where the utility removes structurally problematic records
Repair mode, where the utility will attempt to repair problematic records.

In general character delimited files issues are rooted in arbitrary delimiter or unexpected new lines. Additionally we address the issue of encoding and non-ascii character. The learning algorithm we design focuses on these issues.

#Description

This engine repairs any character delimited file using an approach based on anomaly detection & ensemble learning across n-features in any dataset. The file repair engine also provides a quantitative assessment of the data to determine what kind of processing this data would lend itself to.

#Features:

Filter or Repair broken records automatically
Include scrubbing the data for non-ascii characters and extra whitespaces.
Quantitative assessment of the data processed

This engine can be applied in either filter-mode or repair-mode:

Filter mode is a passive mode (recommended if data loss is acceptable)
Repair mode is designed to minimize data loss

Example Usage


  import repair


  repairThread = repair.Repair('sample-broken.csv')


  repairThread.start()

The output is contained in a folder called tmp.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
README.md		README.md
context.py		context.py
datacleaning.ipynb		datacleaning.ipynb
repair.py		repair.py
required.sh		required.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

context.py

context.py

datacleaning.ipynb

datacleaning.ipynb

repair.py

repair.py

required.sh

required.sh

Repository files navigation

Character File Repair Utility

About

Releases

Packages

Contributors 2

Languages

weiyixia/CSV-file-repair

Folders and files

Latest commit

History

Repository files navigation

Character File Repair Utility

About

Resources

Stars

Watchers

Forks

Languages