Please put your username in this file.
source /data/groups/schools3/dssg_env/bin/activate # can add this in ~/.bashrc file
python3 schools3/main.py --exp SingleDatasetExperiment
In the call to main.py
, we can swap out the --exp
argument with any class in the experiments directory to run different workflows. We support the following options for this argument:
SingleDatasetExperiment
: trains models and reports metrics for a single gradeMultiDatasetExperiment
: trains models and reports metrics for multiple gradesLocalImportancesExperiment
: finds locally important features for models using LIMEHPTuningExperiment
: does hyperparameter tuningFeatureImportancesExperiment
: find most important features for models using permutation importanceFeaturePruningExperiment
: gets feature importances, then trains models and report metrics when only given some number of most important featuresCrossTabsExperiment
: finds the features with greatest disparity between the positive and negative labels
The other (optional) arguments for main are
--no-cache
: if not given, models may be read from a cache instead of being re-trained--no-good-values
: used for inHPTuningExperiment
. If not given, the hyperparameter tuning library will only iterate over grid values that we specified in the config.
The settings for all experiments are controlled by the main config. From this file, we can control
- years
- grade(s)
- features and feature processors
- labels
- metrics
- fairness attributes
- caching behavior
- hyperparameter tuning strategy
Similar to the experiments, most of these settings (such as features and models) have corresponding directory and are selected by just specifying a class. The Repository Structure section of the README gives an overview of all of the options available.
Below is the tree structure of the notable files in our repo. Descriptions of the directories are in bold
Note that the config
and gen
directories have internal directory structure that mirrors the schools3 directory. This means that a file schools3/foo/bar/file.py
would have its corresponding configuration placed at schools3/config/foo/bar/file_config.py
.
schools3 ├── main.py ├── config - where config variables are stored. The structure is a mirror of the schools3 directory │ ├── data │ │ ├── db_config.py │ │ └── db_tables.py │ ├── base_config.py │ ├── global_config.py │ └── main_config.py ├── data - data for machine learning │ ├── acs │ │ └── acs_table.py │ ├── base - base classes for the data/ dir │ │ ├── cohort.py │ │ ├── cohortable_table.py │ │ ├── processor.py │ │ └── schools_table.py │ ├── datasets - input class for a model. It wraps around a cohort, features, and labels │ │ ├── dataset.py │ │ └── datasets_generator.py │ ├── explorers - code to do data exploration for proposal │ │ ├── bivariate_explorer.py │ │ ├── explorer.py │ │ ├── labels_explorer.py │ │ └── temporal_explorer.py │ ├── features - tables of features │ │ ├── processors - classes that can transform the tables of features │ │ │ ├── acs_feature_processor.py │ │ │ ├── categorical_feature_processor.py │ │ │ ├── composite_feature_processor.py │ │ │ ├── feature_processor.py │ │ │ ├── impute_null_processor.py │ │ │ ├── inv_type_processor.py │ │ │ ├── oaaogt_processor.py │ │ │ ├── pivot_processor.py │ │ │ ├── replace_nullish_processor.py │ │ │ ├── standardize_processor.py │ │ │ └── transform_processor.py │ │ ├── absence_desc_features.py │ │ ├── absence_features.py │ │ ├── academic_features.py │ │ ├── acs_features.py │ │ ├── discipline_incident_rate_features.py │ │ ├── features_table.py │ │ ├── inv_features.py │ │ ├── inv_type_features.py │ │ ├── middle_school_features.py │ │ ├── oaaogt_features.py │ │ └── pivot_block_features.py │ └── labels - tables of labels │ ├── labels_table.py │ └── original_labels.py ├── gen - generated output. The structure is a mirror of the schools3 dir └── ml - everything related to learning ├── base - base classes for the ml/ dir │ ├── experiment.py │ ├── hyperparameters.py │ ├── metrics.py │ └── model.py ├── experiments - different top-level tasks to be called from main │ ├── cross_tabs_experiment.py │ ├── feat_importances_experiment.py │ ├── feat_pruning_experiment.py │ ├── hp_tuning_experiment.py │ ├── local_importances_experiment.py │ ├── models_experiment.py │ ├── multi_dataset_experiment.py │ └── single_dataset_experiment.py ├── metrics - metrics to evaluate a model with │ ├── composite_metrics.py │ ├── fairness_metrics.py │ ├── performance_metrics.py │ └── prk_curve_metrics.py ├── baselines - baselines to compare the machine learning model against │ ├── absenteeism_baseline.py │ ├── baseline.py │ ├── discipline_baseline.py │ ├── gpa_baseline.py │ └── ranked_baseline.py ├── hyperparameters - hyperparameter definitions for machine learning model │ ├── decision_tree_hyperparameters.py │ ├── gradient_boosting_hyperparameters.py │ ├── k_neighbors_hyperparameters.py │ ├── logistic_regression_hyperparameters.py │ ├── mlp_hyperparameters.py │ ├── random_forest_hyperparameters.py │ └── svm_hyperparameters.py └── models - the machine learning models ├── all_ones_model.py ├── decision_tree_model.py ├── ensemble_model.py ├── gradient_boosting_model.py ├── k_neighbors_model.py ├── logistic_regression_model.py ├── mlp_model.py ├── random_forest_model.py ├── sklearn_model.py ├── svm_model.py └── tfkeras_model.py
According to the US Department of Education, 1 in 5 high school students fail to graduate within four years, but at risk students are typically only identified late in high school after significant red flags. This project works with data from the Muskingum Valley Education Service Center (MVESC) with the goal of supplementing educators' intuitions to provider an earlier warning about which students are in need of targeted interventions. MVESC provides services to 16 school districts in 5 counties in rural southeastern Ohio, reaching more than 30,000 pre-K to high school students annually.
MVESC has provided longitudinal data for seven of the districts it works with dating back to the 2006-2007 school year (except Ridgewood county, which starts in 2007), including information about grades, absences, standardized tests, and withdrawl information. Raw data was provided as a set of SQL Server backups, CSV files, and Exel files and has been ingested into a postgres database. These raw data can be found in several tables in the public
schema, however, initial cleaning, consolidation, and recoding has been performed to generate the tables in the clean
schema with a student-level identifier, student_lookup
. These include:
- Grades: Information about courses and grades is available in
clean.all_grades
and derived from the underlying raw tables with "grade" in the table name. Note that the course grades are in themark
column while thegrade
column indicates the grade level of the student. These data have also been aggregated to an overall GPA for high school students inclean.high_school_gpa
on a 4-point scale. - Absences: Data about absences and tardiness is found in
clean.all_absences
and derived from underlying raw tables with "absence" in the table name. Note that theabsence_code
is not standardized across schools and theabsence_desc
has been consolidated from the underlying data. - Test Scores: Scores from standardized tests --- including Ohio Achievement Assessment (OAA), Ohio Graduation Test (OGT), Kindergarten Readiness Assessment for Literacy (KRA-L) --- are provided in
clean.oaaogt
. The OAA results are broken out by grade level and subject area and for all tests listed, columns ending with_ss
indicate a numeric score and those ending with_pl
indicate a placement level. On the OGT, scores of 400 or higher are considered passing in each subject area and scores fall into five performance levels. - Interventions: The
clean.interventions
table describes students who have received various types of interventions (e.g., special academic programs or instruction, extracurricular programs, etc.) in a given year. The underlying data tables are those beginning withINV
. Notes that the "membership codes" relate to the program description --- a lookup can be found inpublic."INV_MembershipCodes"
. - Student Information: The
clean.all_snapshots
table contains student-level information about demographics, special needs, discipline (including in-school suspensioniss
and out-of-school suspensionoss
), etc. In this table,section_504_plan
referrs to accomadations for students with disabilities under Section 504 of the Rehabilitation Act of 1973, andIRN
referrs to an "Information Retrevial Number" that is used by the Ohio Department of Education to identify entities that interact with it (schools, districts, etc) which can be searched here.
This Google document contains some preliminary information about data and some known problems. Please add any new insights/problems that you find out in the document, so the team is updated.