DataScience

This repository will contain systematization of my knowledge about Data Science.

Steps:
Of course these may vary slightly depending on the type of problem, data, tools available etc.

Problem definition The first step is to of course understand the business problem. What is the problem you are trying to solve, what is the business context? Very often however your client may also just give you a whole lot of data and ask you to do something with it. In such a case you would need to take a more exploratory look at the data. Nevertheless if the client has a specific problem that needs to be tackled, then then first step is to clearly define and understand the problem. You will then need to convert the business problem into an analytics problem. I other words you need to understand exactly what you are going to predict with the model you build. There is no point in building a fabulous model, only to realise later that what it is predicting is not exactly what the business needs.
Data Exploration Once you have the problem defined, the next step is to explore the data and become more familiar with it. This is especially important when dealing with a completely new data set.
Data Preparation Now that you have a good understanding of the data, you will need to prepare it for modelling. You will identify and treat missing values, detect outliers, transform variables, create binary variables if required and so on. This stage is very influenced by the modelling technique you will use at the next stage. For example, regression involves a fair amount of data preparation, but decision trees may need less prep whereas clustering requires a whole different kind of prep as compared to other techniques.
Modelling Once the data is prepared, you can begin modelling. This is usually an iterative process where you run a model, evaluate the results, tweak your approach, run another model, evaluate the results, re-tweak and so on. You go on doing this until you come up with a model you are satisfied with or what you feel is the best possible result with the given data.
Validation The final model (or maybe the best 2-3 models) should then be put through the validation process. In this process, you test the model using completely new data set i.e. data that was not used to build the model. This process ensures that your model is a good model in general and not just a very good model for the specific data earlier used (Technically, this is called avoiding over fitting)
Implementation and tracking The final model is chosen after the validation. Then you start implementing the model and tracking the results. You need to track results to see the performance of the model over time. In general, the accuracy of a model goes down over time. How much time will really depend on the variables: how dynamic or static they are, and the general environment; how static or dynamic that is.

Learning algorithms:

supervised learning - labeled output (regression, classification)
unsupervised learning - unlabeled output (clustering(organize computing clusters, social network analysis, market segmentation, astronomical data analysis))

training set
|
learning algorithm
|
input -> h -> output

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
1Statistics		1Statistics
2Getting and cleaning data		2Getting and cleaning data
3Data visualization		3Data visualization
4Linear regression		4Linear regression
5Classification		5Classification
6Clustering		6Clustering
7Languages		7Languages
8Examples		8Examples
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1Statistics

1Statistics

2Getting and cleaning data

2Getting and cleaning data

3Data visualization

3Data visualization

4Linear regression

4Linear regression

5Classification

5Classification

6Clustering

6Clustering

7Languages

7Languages

8Examples

8Examples

README.md

README.md

Repository files navigation

DataScience

About

Releases

Packages

Languages

strongSquirell/DataScience

Folders and files

Latest commit

History

Repository files navigation

DataScience

About

Resources

Stars

Watchers

Forks

Languages