Skip to content

weiyixia/CSV-file-repair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Motivation

This engine demonstrates how a learning component can serve as a solution to a problem that would otherwise have been addressed by developing custom parsers to every field and error associated with a dataset.

1. Reducing data preparation tasks

In machine learning most of time is spent on [janitorial tasks] (http://www.pcworld.com/article/3047665/hottest-job-data-scientists-say-theyre-still-mostly-digital-janitors.html]). The file repair utility handles some of that.

2. Integration of learning components in software engineering

Writing parsers for every case/field/error and dataset is NEITHER practical NOR scalable. So we propose a learning algorithm/component because it is code written once that can generalize without new code being written

Character File Repair Utility

This utility repairs character delimited files that have either a misplaced delimiter and or an arbitrary new line where it shouldn't. The utility leverages machine-learning approaches and can be run in either

  • Filter mode, where the utility removes structurally problematic records
  • Repair mode, where the utility will attempt to repair problematic records.

In general character delimited files issues are rooted in arbitrary delimiter or unexpected new lines. Additionally we address the issue of encoding and non-ascii character. The learning algorithm we design focuses on these issues.

<script src="https://google-code-prettify.googlecode.com/svn/loader/run_prettify.js"></script>

#Description

This engine repairs any character delimited file using an approach based on anomaly detection & ensemble learning across n-features in any dataset. The file repair engine also provides a quantitative assessment of the data to determine what kind of processing this data would lend itself to.

#Features:

  • Filter or Repair broken records automatically
  • Include scrubbing the data for non-ascii characters and extra whitespaces.
  • Quantitative assessment of the data processed

This engine can be applied in either filter-mode or repair-mode:

  • Filter mode is a passive mode (recommended if data loss is acceptable)
  • Repair mode is designed to minimize data loss

Example Usage

import repair repairThread = repair.Repair('sample-broken.csv') repairThread.start()

The output is contained in a folder called tmp.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published