Shows the content of the 25-days Data Scientist certification from Telecom-ParisTech.
One can find courses of the 12 different thematics and especially, all the work i had to provide.
Credit to Telecom-Paristech, See CES Data Scientist-Telecom ParisTech for more details.
- cours-ces-1-intro.pdf (credit to Florence d'Alché).
keywords : features classification, cost function, empirical risk, bayesian classifier, bias/variance compromise.
-
boosting-ces.pd (credit to Florence d'Alché).
-
cours2-SD210-part2.pdf (credit to Florence d'Alché).
-
cours-ces-1-introv0.pdf (credit to Joseph Salmon)
-
Papers/
Some introductive papers to Machine Learning.
- Exercises/TP_intro_python_ml_digits.v3.ipynb : K-neighrest-neighboors on MNIST digits dataset using scikit-learn, cross validation, confusion matrix.
- COURS2.pdf (credit to Stephan Clemençon)
keywords : basic algorithms for classification (Logistic Regression, Linear Discriminant Analysis, Perceptron, Partitionning algorithms, Decision Tree), model complexity, model selection (penalization, cross validation, bootstrap).
- Papers/
Some introductive papers to Machine Learning.
- Exercises/
- TP_intro_Pandas_lin_model.ipynb : Linear regression, Ridge Regression on auto-mpg data using scikit-learn, polymial complexity, cross-validation.
-
TelecomParisTechCours1-NoSQL.pdf (credit to Raja Chiky)
-
TelecomParisTechCours2-NoSQL.pdf
-
TelecomParisTechCours3-streaming.pdf
-
TP_MongoDB/
- TPMongoDB.pdf : exercises to get familiar with document-oriented NoSQL MongoDB.
- test.js : short javascript to familiarize with MongoDB.
- lapins.js : create a collection and run basic queries.
- earthquakes.js : use earthquakes_big.geojson database, re-formate data and run queries (regex,2dsphere).
- populate_phones.js : create huge collection of phone number.
- earthquakes_big.geojson : earthquakes database.
- run.sh : bash file to import data, run above javascripts files and test master/slave replication and sharding database with MongoDB.
- Project/
- Big Data Project with NoSQL DB.pdf : exercises for MongoDB project.
- mongodb_1_data_integration.js : javascript to import .csv data to be reformated in MongoDB collections.
- mongodb_2_data_queries.js : run queries using aggregate method of MongoDB.
- run_mongo.sh : bash file to run javascripts for MongoDB.
- mysql_1_data_integration.sql : SQL file to integrate .csv data to be reformated in MySQL tables.
- mysql_2_data_queries.sql : run queries using SQL jointure and group by.
- run_mysql.sh : bash file to run SQL files for MySQL.
- report.pdf : report that resume results and compare time execution between MongoDB and MySQL.
- wikifirst.txt : 150000 articles taken from the Simple English Wikipedia.
- ie_dates.py : python script to extract dates from the 150000 articles.
- ie_dates_evaluation.txt : containing the first 20 dates found and the manually measured precision on the first 20 dates found.
- ie_types.py : python script to extract the type of each article
- ie_types_evaluation.txt : containing the first 20 types found and the manually measured precision on the first 20 types found.
- audio/
- audio-analysis-lecture_2016.pdf (credit to Slim Essid) : keywords : audio content analysis, short-term analysis, spectral analysis (DFT,DCT) and spectrogram, MFCC features extraction, temporal integration.
- test.py : small python script to get hands on scipy.fftpack module analyzing small wav audio files
- wav/ : short audio samples taken from www.spirit-science.fr
- image/ (credit to Michel Roux)
- bcontours-ds.pdf
- Donnees Multimedia - Images et Video.pdf
- introtdi-ds [Mode de compatibilité].pdf
- sift [Mode de compatibilité].pdf
- calibrage-ds.pdf
- formes-ds.pdf
- notebooks/ : some python notebook files for advanced tutorials on optimization and algebra operations using scipy (credit to Alexandre Chamfort and Slim Essid)
- Project/
- run.sh: download a mp4 video 06-11-22.mp4 of a debate from swiss TV, and a related annotation file 06-11-22.trs. Then, the video is cut into jpeg images, each frame is labelled using annotation file, see labels.csv created (take time...xmlstarlet is needed)
- apprentissage_classique.pdf (credit to Joseph Salmon)
keywords : Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayesian classifier, Logistic Regression, K-Neighrest Neighboors.
- tp_classifiers/
- ClassifieurNaif.ipynb : LDA, QDA, Naive Gaussian Bayes, Logistic regression, KNN on IRIS dataset using scikit learn
- module5_source.py : hand crafted module imported for plotting
- module5_source.pyc : compiled module
- cours_arbres_selection_modele.pdf (credit to Aurélien Bellet)
keywords : decision tree, CART algorithm, entropy, regression tree, regularization, random forest
- tp_learning_curve/
- tp_learning_curve.pdf : exercises dealing with decision trees and hyperparameters, random forests, selection model, regularization parameters on digits dataset using scikit-learn
- tp_learning_curve.ipynb : python code for exercises
- learning_curve.py : function using module learning_curve from scikit-learn
- learning_curve.pyc : compiled module
- 1_cours_nnet_basics.pdf (credit to Alexandre Allauzen)
keywords : feed-forward neural networks, gradient back propagation, activation function, loss function
- 2_cours_nnet_basics.pdf (credit to Alexandre Allauzen)
keywords : Regularization and dropout, vanishing gradient and Rectified Linear Unit
- lab_mnist/ (credit to Gaetan Marceau-Caron)
- telecom.odp : lab queries
- train_mnist.lua : Lua file that runs a 400x400x10 neural network on 60000 28x28 handwritten mnist images using Torch. Apply 'nn' module for a sequential network, and 'optim' module for gradient back propagation solver. Apply 'nn.Dropout' function for regularization.
- cours-svm.pdf (credit to Aurelien Bellet)
keywords : margin maximization for linearly separable case, slackness for linearly non-separable case, non-linear case and kernel trick, regression case.
- clustering-2016.pdf (credit to Slim Essid and Florence d'Alché)
keywords : K-means algorithm, Gaussian Mixture Model, Expectation-Maximization algorithm for solving GMM,
-
PCA_NMF.pdf (credit to Slim Essid and Alexey Ozerov)
-
cours_ica.pdf (credit to Slim Essid and Cédric Févotte)
-
TP_clustering/TP_ML_clustering.pdf : exercises queries
-
TP_clustering/TP_clustering_kmeans.py : re-code kmeans algorithm in Python and compare with scikit-learn implementation for simple data example
-
TP_clustering/TP_clustering_gmm.py : perform mixture gaussian with scikit-learn for simple data generation
-
TP_clustering/TP_clustering_image.py : perform kmeans segmentation for differents number of clusters on Grey_scale_optical_illusion.png image
-
TP_clustering/TP_document_clustering.py : performs Kmeans clustering on the 20 newsgroups text dataset
-
TP_nmf/TP_pca_nmf.pdf : exercises queries
-
TP_nmf/pca_nmf_faces.py : performs pca and mnf on Olivetti faces dataset (credit to AT&T Laboratories Cambridge), for different number of reduction components. For each number of components, a cross-validation is performed using Linear Discriminant Analysis, resulting scores are plotted as results.
-
TP_nmf/topics_extraction_with_nmf_.py : performs reduction on the 20 newsgroups text dataset to extract topics
-
TP_nmf/ica_audio.py : performs a 2-audio-sources separation using ICA
-
TP_nmf/snd/ : contains sound wav files using for ICA
-
Project/utils.py : python utilities to cut image into patchs and to post-process clustering (scikit-image)
-
Project/functions.py : function that extract HOG and HSV descriptors from patchs (OpenCV,scikit-learn)
-
Project/run.py : main file that download digitized russian page magazines and newspapers from UCI archives, and apply clustering to divide image into 3 classes : background, text and image. From Linux terminal :
python run.py
-
Project/test.py : file that test any image that represents a scan of a page from magazine or newpapers. Provides image clustering result into same folder. From linux terminal :
python test.py image_name.[jpg|png|bmp|tif]
-
test/ : contains three digitized documents from three other languages to test generalizaiton of the method
-
Project/report_latex/ report.pdf : Latex report that describes the whole clustering method
- BN-CES-2016.pdf : (Credit to Pierre-Henri Wuillemin)
keywords : graphical model, D-separation, inference, model selection, supervised learning, Agrum libray for Bayesian Netwotk simulation
-
TP/01-Probabilités.ipynb
-
TP/02-CPTdeterministe.ipynb
-
TP/03-Modélisation1.ipynb
-
TP/04-Modélisation2.ipynb
-
TP/05-Modélisation3.ipynb
-
TP/06-ModelSelection.ipynb
-
TP/07-ClassificationSupervise.ipynb
-
TP/08-dynamicBN.ipynb
-
TP/fra_l1_app.csv
-
TP/fra_l1_test.csv
-
TP/livretA_10000.csv
- HMM_ces_2016_distri.pptx.pdf : (Credit to Laurence Likforman-Sulem)
keywords : Discret and continuous Markov Models, Hidden Markov Model, Monte-Carlo generation, observation likelihood (Viterbi, Forward-Backward algorithms), prediction, learning (Baum-Welch algorithm)
-
TP/texte_TP_chaines_2016.pdf : exercises with Python using Markov Model for texts and using Hidden Markov Model for images
-
TP/HMM_text.ipynb : ipython notebook dealing with Markov-Model-Monte-Carlo simultation for font, word and phrase generation
-
TP/bigramenglish.txt : Markov Model for english font
-
TP/bigramfrench.txt : Markov Model for french font
-
TP/dictionnaire.txt : dictionary for correspondance between fonts and Markov Model matrix
-
TP/data_txt_compact : data for Hidden Markov Model with images
-
distributed_storage.pdf : (Credit to Pierre Senellart)
-
gfs.pdf : (Credit to Pierre Senellart)
-
inverted_index.pdf : (Credit to Pierre Senellart)
-
slides.pdf : (Credit to Pierre Senellart)
-
search.pdf : (Credit to Quentin Lobbé)
-
TP/tp.pdf : some queries to create a search engine for local mini wikipedia using HBase (Credit to Pierre Senellart)
-
TP/wiki_crawler.py : crawl localhost simple wikipedia using scrapy and happyBase python libraries, to create an HBase table
-
TP/wiki_indexation.py : create an Hbase table containing inverted index from previous wikipedia crawling
-
TP/wiki_request.py : perform a search engine using previous inverted index
-
stop_words.txt : a list of english stop words used to remove unrelevant words.
-
1—Intro.pdf : (Credit to James Eagan)
-
2—Data, Marks, Implantations.pdf : (Credit to James Eagan)
-
3—Tasks & Interaction.pdf : (Credit to James Eagan)
-
4—Perception.pdf : (Credit to James Eagan)
-
7—Tufte's Principles.pdf part 1 : (Credit to James Eagan)
-
lab/France/data/france.tsv : file containing population information for different french regions
-
lab/France/js/hello-france.js : javascript to call D3 functions for french population visualization
-
lab/France/js/D3 : D3 library for visualisation
-
lab/France/index.html : html file including hello-france.js file for visualization
-
lab/France/README.txt : information on which and how french population characteristics are visualized
-
example_mapreduce/WordCount.py : python script implementing map and reduce functions for couting word
-
example_mapreduce/WordCountDriver.py : put Alice.txt (Alice in Wonderlands book) in HDFS system, run hadoopy mapreduce using hadoopy and map/reduce classes defined into WordCount.py, in order to count words ocurrences (results is recorded on HDFS)
-
example_mapreduce/WordCount.py : python implementation of map and reduce classes
-
example_spark/spark.py : count Alice.txt word occurences using pysark (Map and reduceByKey methods on RDD)
-
alice.txt : txt fiel containing full text for "Alice in Wonderlands"
-
TP/tp.pdf : queries for use of MapReduce and Spark on simple wiki hbase table, to create an inverted index
-
TP/bashrc : linux config files that export some necessary hadoop jar
-
TP/TP_MapReduce/1_wikiFromHBaseToHdfs.py : file that tranforms simple wiki hbase table (see TP/wiki_crawler.py in "Stockage Distribué" module) into HDFS file using hadoopy
-
TP/TP_MapReduce/2_wikiIndexMapReduce.py : file that runs hadoop MapReduce using hadoopy in order to create an index using HDFS file created just above. Store this index into HDFS
-
TP/TP_MapReduce/3_wikiIndexMapReduceToHBase.py : file that puts index HDFS file into an hbase table using hadoopy
-
TP/TP_MapReduce/WordCount.py : file that implements map/reduce classes used to compute index
-
TP/TP_spark/1_wikiFromHBaseToHdfs.py : file that tranforms simple wiki hbase table (see TP/wiki_crawler.py in "Stockage Distribué" module) into HDFS file using hadoopy
-
TP/TP_spark/2_spark.py : file that uses pyspark to perform map/reduce to create index. Then put created index into HDFS.
-
BigData-DataScience.pdf : presentation of some Big Data frameworks (Credit to Albert Bifet)
-
giraph.pdf (Credit to Pierre Senellart)
-
SAMOA-CES.pdf (Credit to Pierre Senellart)
-
spark.pdf (Credit to Pierre Senellart)
-
storm.pdf (Credit to Pierre Senellart)
-
Lec3-PageRank.pdf : Page Rank Algorithm, matrix factorisation, random teleport (Credit to Mauro Sozio)
-
Densest.pdf : Dense subgraph (Credit to Mauro Sozio)
-
lec2-assoc-rules.pdf : (Credit to Mauro Sozio)
-
CES_PageRank/pagerankPythonEnglish.pdf : PageRank lab queries (Credit to Mauro Sozio)
-
CES_PageRank/labels : text file containing a label for each article of Simple English Wikipedia
-
CES_PageRank/edge_list.txt : txt file containing all links between Simple English Wikipedia articles
-
CES_PageRank/LoadIntoHDFS.py : Python file that puts graph matric into HDFS
-
CES_PageRank/PageRankDriver.py : Python file that launch hadoopy to compute Map/Reduce classes defined into PageRank.py
-
CES_PageRank/PageRank.py : Python file that implements Map/Reduce classes to compute vector of importance
- Please follow link for more details.