Caterpiller (spelled correctly, unlike the repo)

Competition result: 32/1333

Different ideas for how to handle the data

1. Build data by taking summary statistics if multiple components of the same type match one assembly

For example, if an assembly has two nuts, in this data build we would aggregate all of the fields associated with those two nuts (like average their weights, take max of their weights, etc)

Things still to do:

Encode categorical variables
Create field dictionary for component subtables
Extract information from the names in type_connection, then merge on to adaptor
Manual variable creation
Reshape the spec data to indicators

Summary statistics planned to use:

Average numeric columns and binary
Take max and min
Count the number of instances of that comp type matched to the assembly
Count instances of that component id in the whole data set and merge that on
Try data build that only takes one moment (min, median, max), first confirm that these moments are super highly correlated

2. Don't merge to the sub-component tables, just count instances of each component on the assembly

In this case, the data will have a feature for each component. The feature will count how many of that component are present on the assembly instance. Note, sometimes you have multiples of the exact same component noted in var component_quantity, so this approach will allow us to incorporate that information in a way that 1. misses

3. Try different outcome variables

Model with current data structure and include units as a field
Somehow account for the fact that some tube assmblies have 4-7 observations associated with them
Use 1/16th power rather than the 1 + log
Use 1/16th for stacking
Account for the fact the some tube assembly ids have multiple observations, which screws up weighting (how do you do the equiv of clustering for prediction)

4. Misc data ideas

Modeling approaches

Gradient deep trees (need to use more parameters available)
Gradient with stumps (this seems to be dominated by meduim depth trees)
Svm (small data means this might work okay)
penalized regression
NN (sigh, finding a reasonable specification has been a nightmare)
KNN (this might actually be a reasonable application for it, remember to normalize first)(this was bad)
Try to get XGBOOST running
Two stage stacking approach (first stage you create predictions for bunch of model using only two components at a time, second state you fite a ridge to all varaibles plus first stage predictions)

Blending approaches

My current plan is to blend at the submission level. SO, I will try to create many submissions that are as good as possible using methods that are as different as possible and then do some sort of semi-naive blending of the final submissions.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
Submissions		Submissions
.gitignore		.gitignore
README.md		README.md
building data.py		building data.py
merging files.py		merging files.py
metrics.py		metrics.py
modeling.py		modeling.py
neural network.py		neural network.py
stacking.py		stacking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submissions

Submissions

.gitignore

.gitignore

README.md

README.md

building data.py

building data.py

merging files.py

merging files.py

metrics.py

metrics.py

modeling.py

modeling.py

neural network.py

neural network.py

stacking.py

stacking.py

Repository files navigation

Caterpiller (spelled correctly, unlike the repo)

Different ideas for how to handle the data

1. Build data by taking summary statistics if multiple components of the same type match one assembly

2. Don't merge to the sub-component tables, just count instances of each component on the assembly

3. Try different outcome variables

4. Misc data ideas

Modeling approaches

Blending approaches

About

Releases

Packages

Languages

pcohen89/catepiller

Folders and files

Latest commit

History

Repository files navigation

Caterpiller (spelled correctly, unlike the repo)

Different ideas for how to handle the data

1. Build data by taking summary statistics if multiple components of the same type match one assembly

2. Don't merge to the sub-component tables, just count instances of each component on the assembly

3. Try different outcome variables

4. Misc data ideas

Modeling approaches

Blending approaches

About

Resources

Stars

Watchers

Forks

Languages