Skip to content

AnilSingh79/pydan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyDAN : TAKING DATA TO ALGORITHMS

-----------------------------------------------
Author (contact): Anil Pratap Singh (singh.ap79@gm*l.*o*)
===========================

What is it?

The pyDAN is an ongoing attempt to make the power of SQL analytics available to data scientists and statisticians and minimize dependence over database engineer. The idea is to use all the tools at disposal namely SQL, Python, Java and Cython in one single framework and provide user with an unprecendented ability to pull, clean merge, munge and preanalyze data from multiple sources and stored across several platforms.

How do we do it?

The SQL standard a high level layer providing uniform (okay nearly uniform) interface to a large number of the database management systems. We propose to put up and additional layer (mostly in python) to relieve user from creating SQL (or most of it) code.
      ##Set the data source
      csv = pycsv_reader("mycsvfile.csv")
      remote = sql_reader("myurl","myuname","mypasswd")
    

    ##Read into sqlite scratch area dset1 = csv.to_database(tabName="mycsvtab",database="mysqlite.db") dset2 = csv.to_database(tabName="myremotetab",database="mysqlite.db"

    ##Let us do a join data = dset1.join(resultTab = "mymergedtab",d2=dset2, varNames1=['*'], varNames2=['var2'], on=["csv_var1=sql_var2"] )

    ##Apply my business logic to var1, var2 and add result in var_X data.addColumn("var_X","number","business_logic1", bizLogicFunc,argNames=['var1','var2'])

    ##Dump the results into a csv file data.to_csv("myresultdata.csv")

For every action requested on data, pyDAN generate appropriate SQL under the hood and executes it using pysqlite library. Each time a track is kept for the result tables (if any), and they too become available in program for futher manipulation.

##What do I need to learn ?

If you have beginner's skills at python and SQL, then its all you need to get started. Just know that in heart of pyDAN is the pydset class, which provides a pipe to an underlying tables in scratch sqlite. The pydset allows user to request maniputlations/queries on the data.

Table 1: A part of pydset interface.
pydset.subset
pydset.join
pydset.union
pydset.unpivot
pydset.apply
pydset.aggregate
pydset.describe
pydset.sum/total/count
pydset.head/tail/show
pydset.mean/std/quantile/median
pydset.get1/get2
ptdset.skeleton
pydset.to_numpy
pydset.to_csv
pydset.to_database
pydset.hist1D
pydset.graph1D
##How do I start? ------------------ The pyDAN project is still in infancy. The devel version can be obtained from this repository but there is no guarantee that it will work. A more stable and tested beta version can be obtained on request from the author. The beta version is supplied as a single python source file, and can be made available in your analysis by doing:
import sys
sys.path.insert(0,"C:\Users\SingAn22\Desktop\PYDAN\Beta")
from pydan import *

##What can I expect in future?

We are working on several channels

  • Enhancement possible formats for data input.
  • Enhancement of analytics library.
  • Enrichment of the SQL engine to automate generation of complex statements.
  • Enrichment of remote data processing abilities.
  • Enrichment of matplotlib interface for visualization.

##How can I contribute?

Well the first obvious step is to request a beta version and start using it. And then we can direct you to one of the several open lines of development at pyDAN.

##Disclaimer

The pyDAN is a private project and yet not released for general consumption. We can not guarantee support of back-version compatibility at the present moment but we do encourage people to start using the package as it matures into a release version.




#Tutorial 1: Data from Stock Market ##Introduction We use a file from Bombay Stock Exchange containing 3 years worth of daily listings for a particular stock. Using pyDAN we will demonstrate how to build a simple analytic routine. Here are some visualizations from analytics:

A Walkthrough

  1. Libraries
      from pydan.pycsv import pycsv_reader
      from pydan.pytools import month_enum, print_error,clean_string
      from pypaint.pygraph import TGraph
    

    from matplotlib import pyplot as plt from matplotlib import ticker from datetime import datetime

  2. Load data into SQLITE scratchpad (data ingestion)
      ##Register the source csv
      csv = pycsv_reader("fullPathTo\\533098.csv")
    

    ##Load the data into sqlite database, obtain pydset object. data = csv.to_database(tabName='pydanTable_prelim',database ='fullPathTo\dset.db')

  3. Change the Date format (String Level Manipulation)
      #weave business requirements into a python function
      def refine_date(element):
        try:
          myDate = element.split('-')
          myMnth = '{:02d}'.format(int(month_enum(myDate[1])))
          myDay  = '{:02d}'.format(int(myDate[0]))
          myYear = myDate[2]
          newDate = '-'.join([myYear,myMnth,myDay])
          return newDate
        except Exception as err:
          print_error(err,'pydan_test1.refine_date')
    

    #Apply the business requirement to one of the columns data.apply(colName='Dat', funcName='to_date', func=refine_date, argNames=['Dat'])

  4. Analyze data in each column and set types (Automated Data Recognition)
      #Analyze the data in columns for numericity
      colTypes = data.analyze_columns()
    

    #transform columns to their proper datatypes. dset = data.transform(resultTab='pydanTable',colTypes=colTypes)

  5. Collapse the daily data to weekly level.
     #Creat logic for a unique marker for year-week combo
     def unqWeekMarker(date):
        year = int(datetime.strptime(str(date),'%Y-%m-%d').date().strftime('%Y'))
        week =  int(datetime.strptime(str(date),'%Y-%m-%d').date().strftime('%W'))
        u= 100*int(year)+1*int(week)
    

    #Time to add this new column to the dset dset.addColumn("unqWeek","NUMBER","to_unwk",uniqueWeek,argNames=['date'])

    #Now is the time to collapse (weekly averages) dt3 = dset.aggregate(resultTab='WEEKLY',aggOp=' AVG ', varNames=['NoofShares','ClosePrice','TotalTurnoverRs','LowPrice','HighPrice','SpreadCloseOpen'], groupBy=['unqWeek'], orderBy = ['unqWeek'], conditions = ['1=1'] )

  6. Simple Visualization
      #Graph low price against week-id
      gr = dt3.graph1D(
            outfile='o.db',
            varNameX='unqWeek',
            varNameY='LowPrice', 
            plotName='lPrice_Graph',
            plotTitle='Low Price (Weekly)'
          )
    

    #Graph hi price against week-id gr2 = dt3.graph1D( outfile='o.db', varNameX='unqWeek', varNameY='HighPrice', plotName='hPrice_Graph', plotTitle='High Price (Weekly)' )

    #Create the canvas fig = plt.figure(facecolor='white') ax = fig.add_subplot(1,1,1) gr.draw(ax) gr2.draw(ax)

  7. Cosmetic Settings
      #Logic to turn ugly weekId (201245) into beautiful market 2012-45
      def make_label(value,pos):
        return str(int(value/100))+"-"+str(int(value-int(value/100)*100))
    

    #Set xaxis labels using this logic ax.xaxis.set_major_formatter(ticker.FuncFormatter(make_label))

    #Add a grid ax.grid(True)

  8. Finally
      plt.show()
      

About

cool quick data analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages