Skip to content

pcohen89/catepiller

Repository files navigation

Caterpiller (spelled correctly, unlike the repo)

Competition result: 32/1333

Different ideas for how to handle the data

1. Build data by taking summary statistics if multiple components of the same type match one assembly

For example, if an assembly has two nuts, in this data build we would aggregate all of the fields associated with those two nuts (like average their weights, take max of their weights, etc)

Things still to do:

  • Encode categorical variables
  • Create field dictionary for component subtables
  • Extract information from the names in type_connection, then merge on to adaptor
  • Manual variable creation
  • Reshape the spec data to indicators

Summary statistics planned to use:

  • Average numeric columns and binary
  • Take max and min
  • Count the number of instances of that comp type matched to the assembly
  • Count instances of that component id in the whole data set and merge that on
  • Try data build that only takes one moment (min, median, max), first confirm that these moments are super highly correlated

2. Don't merge to the sub-component tables, just count instances of each component on the assembly

In this case, the data will have a feature for each component. The feature will count how many of that component are present on the assembly instance. Note, sometimes you have multiples of the exact same component noted in var component_quantity, so this approach will allow us to incorporate that information in a way that 1. misses

3. Try different outcome variables

  • Model with current data structure and include units as a field
  • Somehow account for the fact that some tube assmblies have 4-7 observations associated with them
  • Use 1/16th power rather than the 1 + log
  • Use 1/16th for stacking
  • Account for the fact the some tube assembly ids have multiple observations, which screws up weighting (how do you do the equiv of clustering for prediction)

4. Misc data ideas

  • Sum weight from all components
  • total weight/ tube length
  • sum all tube-ish lengths
  • wall thickness * length
  • bends/length
  • Flag if end_a != end_x
  • Number of bends per bend radius (does this even make sense?)
  • Unique or rare part should be interacted with quantity
  • Analyze bill of materials quantity more, somehow I need to capture that if a given component is adding a lot of cost, than the quantity of that component is really important (it did turn out to be very important)
  • maybe sum the number of tube assemblies associated with the supplier
  • Identify if any tubes are exactly similar to any others (I have done this and it doesnt seem to help as much as expected, need to do further inquiry)
  • Interact year with other major variables
  • 'other' seems to give good results, look into this component and manually extracting variables
  • Create a variable for each of the major material types, values are length_x_wall

Modeling approaches

  • Gradient deep trees (need to use more parameters available)
  • Gradient with stumps (this seems to be dominated by meduim depth trees)
  • Svm (small data means this might work okay)
  • penalized regression
  • NN (sigh, finding a reasonable specification has been a nightmare)
  • KNN (this might actually be a reasonable application for it, remember to normalize first)(this was bad)
  • Try to get XGBOOST running
  • Two stage stacking approach (first stage you create predictions for bunch of model using only two components at a time, second state you fite a ridge to all varaibles plus first stage predictions)

Blending approaches

My current plan is to blend at the submission level. SO, I will try to create many submissions that are as good as possible using methods that are as different as possible and then do some sort of semi-naive blending of the final submissions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages