Skip to content

Tool to convert & load data from edX platform into BigQuery

License

Notifications You must be signed in to change notification settings

scmc/edx2bigquery

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

edx2bigquery

edx2bigquery is a tool for importing edX SQL and log data into Google BigQuery for research and analysis.

Read more about it in this writeup by the Harvard University Office of the Vice Provost for Advances in Learning, and the article "Google BigQuery for Education: Framework for Parsing and Analyzing edX MOOC Data", by Glenn Lopez, Daniel Seaton, Andrew Ang, Dustin Tingley, and Isaac Chuang.

Getting Started

To get started, install the Google Cloud SDK first:

https://cloud.google.com/sdk/

Then generate authentication tokens using "gcloud auth login". Make sure that "bq" and "gsutil" work properly.

Install edxcut dependency:

git clone https://github.com/mitodl/edxcut.git
pip install ./edxcut

To install:

python setup.py develop

Also, setup the file edx2bigquery_config.py in your current working directory. Example:

#-----------------------------------------------------------------------------
#
# sample edx2bigquery_config.py file
#
course_id_list = [
        "MITx/2.03x/3T2013",
]

courses = {
    'year2': course_id_list,
    'all_harvardx': [
        "HarvardX/AI12.1x/2013_SOND",
    ],
}

# google cloud project access
auth_key_file = "USE_GCLOUD_AUTH"
auth_service_acct = None

# google bigquery config
PROJECT_ID = "x-data"

# google cloud storage
GS_BUCKET = "gs://x-data"

# local file configuration
COURSE_SQL_BASE_DIR = "X-Year-1-data-sql"
COURSE_SQL_DATE_DIR = '2013-09-08'
TRACKING_LOGS_DIRECTORY = "TRACKING_LOGS"

#-----------------------------------------------------------------------------

To run:

edx2bigquery

Command line parameters and options

Command help message:

usage: edx2bigquery [-h] [--course-base-dir COURSE_BASE_DIR] [--course-date-dir COURSE_DATE_DIR] [--start-date START_DATE] [--end-date END_DATE]
                    [--tlfn TLFN] [-v] [--parallel] [--year2] [--clist CLIST] [--clist-from-missing-table CLIST_FROM_MISSING_TABLE]
                    [--force-recompute] [--dataset-latest] [--download-only] [--course-id-type COURSE_ID_TYPE] [--latest-sql-dir] [--skiprun]
                    [--external] [--extparam EXTPARAM] [--submit-condor] [--max-parallel MAX_PARALLEL] [--skip-geoip] [--skip-if-exists]
                    [--skip-log-loading] [--just-do-nightly] [--just-do-geoip] [--just-do-totals] [--just-get-schema] [--only-if-newer]
                    [--limit-query-size] [--table-max-size-mb TABLE_MAX_SIZE_MB] [--nskip NSKIP] [--only-step ONLY_STEP] [--logs-dir LOGS_DIR]
                    [--listings LISTINGS] [--dbname DBNAME] [--project-id PROJECT_ID] [--table TABLE] [--org ORG] [--combine-into COMBINE_INTO]
                    [--add-courseid] [--combine-into-table COMBINE_INTO_TABLE] [--skip-missing] [--output-format-json] [--collection COLLECTION]
                    [--output-project-id OUTPUT_PROJECT_ID] [--output-dataset-id OUTPUT_DATASET_ID] [--output-bucket OUTPUT_BUCKET] [--dynamic-dates]
                    [--logfn-keepdir] [--skip-last-day] [--gzip] [--time-on-task-config TIME_ON_TASK_CONFIG] [--subsection]
                    command [courses [courses ...]]

usage: %prog [command] [options] [arguments]

Examples of common commands:

edx2bigquery --clist=all_mitx logs2gs 
edx2bigquery setup_sql MITx/24.00x/2013_SOND
edx2bigquery --tlfn=DAILY/mitx-edx-events-2014-10-14.log.gz  --year2 daily_logs
edx2bigquery --year2 person_course
edx2bigquery --year2 report
edx2bigquery --year2 combinepc
edx2bigquery --year2 --output-bucket="gs://harvardx-data" --nskip=2 --output-project-id='harvardx-data' combinepc >& LOG.combinepc

Examples of not-so common commands:

edx2bigquery person_day MITx/2.03x/3T2013 >& LOG.person_day
edx2bigquery --force-recompute person_course --year2 >& LOG.person_course
edx2bigquery testbq
edx2bigquery make_uic --year2
edx2bigquery logs2bq MITx/24.00x/2013_SOND
edx2bigquery person_course MITx/24.00x/2013_SOND >& LOG.person_course
edx2bigquery split DAILY/mitx-edx-events-2014-10-14.log.gz 

positional arguments:
  command               A variety of commands are available, each with different arguments:
                        
                        --- TOP LEVEL COMMANDS
                        
                        setup_sql <course_id> ...   : Do all commands (make_uic, sql2bq, load_forum) to get edX SQL data into the right format, upload to
                                                      google storage, and import into BigQuery.  See more information about each of those commands, below.
                                                      This step is idempotent - it can be re-run multiple times, and the result should not change.
                                                      Returns when all uploads and imports are completed.
                        
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                                                      Accepts the --dataset-latest flag, to use the latest directory date in the SQL data directory.
                                                      Directories should be named YYYY-MM-DD.  When this flag is used, the course SQL dataset name has
                                                      "_latest" appended to it.
                        
                                                      Also accepts the "--clist=XXX" option, to specify which list of courses to act upon.
                        
                                                      Before running this command, make sure your SQL files are converted and formatted according to the 
                                                      "Waldo" convention established by Harvard.  Use the "waldofy" command (see below) for this, 
                                                      if necessary.
                        
                        daily_logs --tlfn=<path>    : Do all commands (split, logs2gs, logs2bq) to get one day's edX tracking logs into google storage 
                                   <course_id>        and import into BigQuery.  See more information about each of those commands, below.
                                   ...                This step is idempotent - it can be re-run multiple times, and the result should not change.
                                                      Returns when all uploads and imports are completed.
                        
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        doall <course_id> ...       : run setup_sql, analyze_problems, logs2gs, logs2bq, axis2bq, person_day, enrollment_day,
                                                      person_course, and problem_check for each of the specified courses.  This is idempotent, and can be run
                                                      weekly when new SQL dumps come in.
                        
                        nightly <course_id> ...     : Run sequence of commands for common nightly update (based on having new tracking logs available).
                                                      This includes logs2gs, logs2bq, person_day, enrollment_day, person_course (forced recompute),
                                                      and problem_check.
                        
                        --external command <cid>... : Run external command on data from one or more course_id's.  Also uses the --extparam settings.
                                                      External commands are defined in edx2bigquery_config.  Use --skiprun to create the external script
                                                      without running.  Use --submit-condor to submit command as a condor job.
                        
                        --- SQL DATA RELATED COMMANDS
                        
                        waldofy <sql_data_dir>      : Apply HarvardX Jim Waldo conventions to SQL data as received from edX, which renames files to be
                                <course_id> ...       more user friendly (e.g. courseware_studentmodule -> studentmodule) and converts the tab-separated
                                                      values form (*.sql) to comma-separated values (*.csv).  Also compresses the resulting csv files.
                                                      Does this only for the specified course's, because the edX SQL dump may contain a bunch of
                                                      uknown courses, or scratch courses from the edge site, which should not be co-mingled with
                                                      course data from the main production site.  
                                                      
                                                      It is assumed that <sql_data_dir> has a name which contains a date, e.g. xorg-2014-05-11 ;
                                                      the resulting data are put into the course SQL base directory, into a subdirectory with
                                                      name given by the course_id and date, YYYY-MM-DD.
                        
                                                      The SQL files from edX must already be decrypted (not *.gpg), before running this command.
                        
                                                      Be sure to specify which course_id's to act upon.  Courses which are not explicitly specified
                                                      are put in a subdirectory named "UNKNOWN".
                        
                        make_uic <course_id> ...    : make the "user_info_combo" file for the specified course_id, from edX's SQL dumps, and upload to google storage.
                                                      Does not import into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        make_roles <course_id> ...  : make the "roles.csv" file for the specified course_id, from edX's SQL dumps
                        
                        
                        sql2bq <course_id> ...      : load specified course_id SQL files into google storage, and import the user_info_combo and studentmodule
                                                      data into BigQuery.
                                                      Also upload course_image.jpg images from the course SQL directories (if image exists) to
                                                      google storage, and make them public-read.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        load_forum <course_id> ...  : Rephrase the forum.mongo data from the edX SQL dump, to fit the schema used for forum
                                                      data in the course BigQuery tables.  Saves this to google storage, and imports into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        makegeoip                   : Creates table of geoip information for IP addresses in person_course table missing country codes.
                                                      Accepts the --table argument to specify the person_course table to use.
                                                      Alternatively, provide the --org argument to specify the course_report_ORG dataset to look
                                                      in for the latest person_course dataset.
                        
                        tsv2csv                     : filter, which takes lines of tab separated values and outputs lines of comma separated values.
                                                      Useful when processing the *.sql files from edX dumps.
                        
                        analyze_problems <c_id> ... : Analyze capa problem data in studentmodule table, generating the problem_analysis table as a result.  
                                                      Uploads the result to google cloud storage and to BigQuery.
                                                      This table is necessary for the analytics dashboard.
                                                      Accepts --only-step=grades | show_answer | analysis
                        
                        analyze_videos <course_id>  : Analyze videos viewed and videos watched, generating tables video_axis (based on course axis),
                                                      and video_stats_day and video_stats, based on daily tracking logs, for specified course.
                        
                        analyze_forum <course_id>   : Analyze forum events, generating the forum_events table from the daily tracking logs, for specified course.
                        
                        analyze_ora <course_id> ... : Analyze openassessment response problem data in tracking logs, generating the ora_events table as a result.  
                                                      Uploads the result to google cloud storage and to BigQuery.
                        
                        analyze_idv <course_id> ... : Analyze engagement of IDV enrollees (and non-IDV enrollees) before end of the course, including
                                                      forum, video, and problem activity, up to the point of IDV enrollment (or last ever IDV enrollment
                                                      in the course, for non-IDVers).  Also include context about prior activity in other courses,
                                                      if course_report_latest.person_course_viewed is available.  Produces idv_analysis table, in each course.
                        
                        problem_events <course_id>  : Extract capa problem events from the tracking logs, generating the problem_events table as a result.
                        
                        time_task <course_id> ...   : Update time_task table of data on time on task, based on daily tracking logs, for specified course.
                        
                        time_asset <course_id> ...   : Update time_on_asset_daily and time_on_asset_totals tables of data on time on asset (ie module_id,
                                                       aka url_name), based on daily tracking logs, for specified course.
                        
                        item_tables <course_id> ... : Make course_item and person_item tables, used for IRT analyses.
                        
                        irt_report <coure_id> ...   : Compute the item_response_theory_report table, which extracts data from item_irt_grm[_R], course_item,
                                                      course_problem, and item_reliabilities.  This table is used in the XAnalytics reporting on IRT.
                                                      Note that item_reliabilities and item_irt_grm[_R] are computed using external commands, using
                                                      Stata and/or R.  If item_irt_grm (produced by Stata) is available, that is used, in preference to
                                                      item_irt_grm_R (produced by mirt in R).  This report summarizes, in one place, statistics about
                                                      problem difficulty and discrimination, Cronbach's alpha, item-test and item-rest correlations,
                                                      average problem raw scores, average problem percent scores, number of unique users attempted.
                                                      Standard errors are also provided for difficulty and discrimination.  Requires the IRT tables to
                                                      already have been computed.  
                        
                        staff2bq <staff.csv>        : load staff.csv file into BigQuery; put it in the "courses" dataset.
                        
                        mongo2user_info <course_id> : dump users, profiles, enrollment, certificates CSV files from mongodb for specified course_id's.
                                                      Use this to address missing users issue in pre-early-2014 edX course dumps, which had the problem
                                                      that learners who un-enrolled were removed from the enrollment table.
                        
                        --- TRACKING LOG DATA RELATED COMMANDS
                        
                        split <daily_log_file> ...  : split single-day tracking log files (should be named something like mitx-edx-events-2014-10-17.log.gz),
                                                      which have aleady been decrypted, into DIR/<course>/tracklog-YYYY-MM-DD.json.gz for each course_id.
                                                      The course_id is determined by parsing each line of the tracking log.  Each line is also
                                                      rephrased such that it is consistent with the tracking log schema defined for import
                                                      into BigQuery.  For example, "event" is turned into a string, and "event_struct" is created
                                                      as a parsed JSON dict for certain event_type values.  Also, key names cannot contain
                                                      dashes or periods.  Uses --logs-dir option, or, failing that, TRACKING_LOGS_DIRECTORY in the
                                                      edx2bigquery_config file.  Employs DIR/META/* files to keep track of which log files have been
                                                      split and rephrased, such that this command's actions are idempotent.
                        
                        logs2gs <course_id> ...     : transfer compressed daily tracking log files for the specified course_id's to Google cloud storage.
                                                      Does NOT import the log data into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        logs2bq <course_id> ...     : import daily tracking log files for the specified course_id's to BigQuery, from Google cloud storage.
                                                      The import jobs are queued; this does not wait for the jobs to complete,
                                                      before exiting.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        mongo2gs <course_id> ...    : extract tracking logs from mongodb (using mongoexport) for the specified course_id and upload to google storage.
                                                      uses the --start-date and --end-date options.  Skips dates for which the correspnding file in google storage
                                                      already exists.  
                                                      Rephrases log file entries to be consistent with the schema used for tracking log file data in BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        course_key_version <cid>    : print out what version of course key (standard, v1) is being used for a given course_id.  Needed for opaque key 
                                                      handling.  The "standard" course_id format is "org/course/semester".  The opaque-key v1 format is
                                                      "course-v1:org+course+semester".  Opaque keys also mangle what is known as a the "module_id", and
                                                      use something like "block-v1:MITx+8.MechCx_2+2T2015+type@problem+block@Blocks_on_Ramp_randxyzBILNKOA0".
                                                      The opaque key ID's are changed back into the standard format, in most of the parsing of the SQL
                                                      and the tracking logs (see addmoduleid.py), but it's still necessary to know which kind of key
                                                      is being used, e.g. do that jump_to_id will work properly when linking back to a live course page.
                                                      course_id is always stored in tradition format.  course_key is kept in either standard or v1 opaque key
                                                      format.  module_id is always standard, and kept as "org/course/<module_type>/<module_id>".  In
                                                      the future, there may be a module_key, but that idea isn't currently used.  This particular command
                                                      scans some tracking log entries, and if the count of "block-v1" is high, it assigns "v1"
                                                      to the course; otherwise, it assigns "standard" as the course key version.
                        
                        --- COURSE CONTENT DATA RELATED COMMANDS
                        
                        axis2bq <course_id> ...     : construct "course_axis" table, upload to gs, and generate table in BigQuery dataset for the
                                                      specified course_id's.  
                                                      Accepts the "--clist" flag, to process specified list of courses in the config file's "courses" dict.
                        
                        grading_policy <course_id>  : construct the "grading_policy" table, upload to gs, and generate table in BigQuery dataset, for
                                                      the specified course_id's.  Uses pin dates, just as does axis2bq.  Requires course.tar.gz file,
                                                      from the weekly SQL dumps.
                        
                        grades_persistent <course_id> : construct "grades_persistent", upload to gs, and 
                                                        generate table in BigQuery dataset for the specified course_ids. If the option "--subsection" is used, construct "grades_persistent_subsection" instead.
                        
                        grade_reports <course_id>   : request and download the latest edx instructor grade report from the edx instructor dashboards.
                                                      For courses that are self-paced/on-demand, edX does not provide grades for non-verified ID users. 
                                                      To fix this issue, this function was created to grab the latest grades for all users from the grade report
                                                      in the edX instructor dashboard. This requires the following updates to the following: 
                                                      1) edx2bigquery_config.py: 
                                                         PRIVATE_PATH = Location to Private path containing new edx_private_config.py
                                                         ORG_LIST = (Optional, if there are more than one possible org name. ex: ['HarvardX', 'Harvardx', 'VJx'])
                                                      2) edx_private_config.py (create new file in PRIVATE_PATH): 
                                                         EDX_USER = edX login
                                                         EDX_PW = edX pw
                                                      **NOTE: In order to download grade reports, the edx account must have instructor level access
                        
                                                      Optional command line args:
                                                      --download-only=Do not make a new grade report request, but just download the latest available
                                                      --course-id-type=specify either 'opaque' or 'transparent' (default) for the original raw course id
                        
                                                      sample commands:
                                                      a) Request for new grade reports and then download based on list of self-paced/on-demand courses
                                                         After request is made, check if grade report is ready every 5 minutes.
                                                         MAXIMUM_PARALLEL_PROCESSES can be increased to make edX grade report requests in parallel
                                                         Recommend using --parallel command
                                                         self_paced_courses = [ 'ORG/COURSECODE1/TERM', 'ORG/COURSECODE2/TERM', ...] 
                                                         NOTE: Listed courses should be in 'transparent' course id format (shown above)
                                                               This process may take anywhere between 0.5 - 3 hours per course 
                                                      edx2bigquery --dataset-latest --course-base-dir=HarvardX-SQL --end-date="2019-01-01" --clist=self_paced_courses --parallel  --course-id-type opaque grade_reports >& LOGS/LOG.gradereports
                        
                                                      b) Download latest existing grade reports for a list of self-paced/on-demand courses (fast download)
                                                         This command may be used if there is an issue with the request
                                                         for instance, after issuing grade_reports command
                                                         self_paced_courses = [ 'ORG/COURSECODE1/TERM', 'ORG/COURSECODE2/TERM', ...] 
                                                         NOTE: Listed courses should be in 'transparent' course id format (shown above)
                                                               Since only download is requested and no new request is made to edx, this process should be quick
                                                      edx2bigquery --dataset-latest --course-base-dir=HarvardX-SQL --end-date="2019-01-01" --clist=self_paced_courses --parallel --download-only --course-id-type opaque grade_reports >& LOGS/LOG.gradereports_download
                        
                        recommend_pin_dates         : produce a list of recommended "pin dates" for the course axis, based on the specified --listings.  Example:
                                                      -->  edx2bigquery --clist=all --listings="course_listings.json" --course-base-dir=DATA-SQL recommend_pin_dates
                                                      These "pin dates" can be defined in edx2bigquery_config in the course_axis_pin_dates dict, to
                                                      specify the specific SQL dump dates to be used for course axis processing (by axis2bq) and for course
                                                      content analysis (analyze_course).  This is often needed when course authors change content after
                                                      a course ends, e.g. to remove or hide exams, and to change grading and due dates.
                        
                        make_cinfo listings.csv     : make the courses.listings table, which contains a listing of all the courses with metadata.
                                                      The listings.csv file should contain the columns Institution, Semester, New or Rerun, Course Number,
                                                      Short Title, Title, Instructors, Registration Open, Course Launch, Course Wrap, course_id
                        
                        analyze_content             : construct "course_content" table, which counts the number of xmodules of different categories,
                            --listings listings.csv   and also the length of the course in weeks (based on using the listings file).
                            <course_id> ...           Use this for categorizing a course according to the kind of content it has.
                        
                        --- REPORTING COMMANDS
                        
                        pcday_ip <course_id> ...    : Compute the pcday_ip_counts table for specified course_id's, based on ingesting
                                                      the tracking logs.  This is a single table stored in the course's main table,
                                                      which is incrementally updated (by appending) when new tracking logs are found.
                        
                                                      This table stores one line per (person, course_id, date, ip address) value from the
                                                      tracking logs.  Aggregation of these rows is then done to compute modal IP addresses,
                                                      for geolocation.
                        
                                                      The "report" command aggregates the individual course_id's pcday_ip tables to produce
                                                      the courses.global_modal_ip table, which is the modal IP address of each username across 
                                                      all the courses.
                        pcday_trlang <course_id> ...: Compute pcday_trlang_counts table for specified course_id's, based on ingesting
                                                      the tracking logs. This is a single table stored in the course's main table and
                                                      incrementally updated (by appending) when tracking logs are found. This command is run
                                                      as part of doall and nightly commands.
                        
                                                      This table stores one line per (person, course_id, date, resource_event_data,
                                                      resource_event_type) from the tracking logs, where 'resource_event_data' = video
                                                      transcript language, and 'resource_event_type' = transcript_language or 
                                                      transcript_download. Using this table, another table
                                                      language_multi_transcripts is created to capture counts per user per language
                                                      and then the modal language is computed per user.
                        
                                                      The "report" command aggregates the individual course_id's pcday_trlang_counts tables
                                                      to produce a courses.global_modal_lang table. This table represents the modal
                                                      language per user across all courses, based on transcript language events.
                        
                        person_day <course_id> ...  : Compute the person_course_day (pcday) for the specified course_id's, based on 
                                                      processing the course's daily tracking log table data.
                                                      The compute (query) jobs are queued; this does not wait for the jobs to complete,
                                                      before exiting.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        enrollment_day <c_id> ...   : Compute the enrollment_day (enrollday2_*) tables for the specified course_id's, based on 
                                                      processing the course's daily tracking log table data.
                                                      The compute (query) jobs are queued; this does not wait for the jobs to complete,
                                                      before exiting.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        person_course <course_id> ..: Compute the person-course table for the specified course_id's.
                                                      Needs person_day tables to be created first.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                                                      Accepts the --force-recompute flag, to force recomputation of all pc_* tables in BigQuery.
                                                      Accepts the --skip-if-exists flag, to skip computation of the table already exists in the course's dataset.
                        
                        report <course_id> ...      : Compute overall statistics, across all specified course_id's, based on the person_course tables.
                                                      Accepts the --nskip=XXX optional argument to determine how many report processing steps to skip.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        combinepc <course_id> ...   : Combine individual person_course tables from the specified course_id's, uploads CSV to
                                                      google storage.
                                                      Also imports the data into BigQuery.
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        problem_check <c_id> ...    : Create or update problem_check table, which has all the problem_check events from all the course's
                                                      tracking logs.
                        
                        attempts_correct <c_id> ... : Create or update stats_attempts_correct table, which records the percentages of attempts which were correct
                                                      for a given user in a specified course.
                        
                        ip_sybils <course_id> ...   : Create or update stats_ip_pair_sybils table, which records harvester-master pairs of users for which
                                                      the IP address is the same, and the pair have meaningful disparities in perfomance.  Requires that 
                                                      attempts_correct be run first.
                        
                        temporal_fingerprints <cid>  : Create or update the problem_check_temporal_fingerprint, show_answer_temporal_fingerprint, and
                                                      stats_temporal_fingerprint_correlations tables.  Part of problem analyses.  May be expensive.
                        
                        show_answer <course_id> ... : Create or update show_answer table, which has all the show_answer events from all the course's
                                                      tracking logs.
                        
                        enrollment_events <cid> ... : Create or update enrollment_events table, which has all the ed.course.enrollment events from all the course's
                                                      tracking logs.
                        
                        --- TESTING & DEBUGGING COMMANDS
                        
                        rephrase_logs               : process input tracking log lines one at a time from standard input, and rephrase them to fit the
                                                      schema used for tracking log file data in BigQuery.  Used for testing.
                        
                        testbq                      : test authentication to BigQuery, by listing accessible datasets.  This is a good command to start with,
                                                      to make sure your authentication is configured properly.
                        
                        get_course_tables <cid>     : dump list of tables in the course_id BigQuery dataset.  Good to use as a test case for parallel execution.
                        
                        get_tables <dataset>        : dump information about the tables in the specified BigQuery dataset.
                        
                        get_table_data <dataset>    : dump table data as JSON text to stdout
                                       <table_id>
                        
                        get_course_data <course_id> : retrieve course-specific table data as CSV file, saved as CID__tablename.csv, with CID being the course_id with slashes
                               --table <table_id>     replaced by double underscore ("__").  May specify --project-id and --gzip and --combine-into <output_filename>
                        
                        get_course_table_status     : retrieve date created, date modified, and size of specified table, for each course_id.  Outputs CSV by default.
                            <cid> --table <table_id>  use --output-format-json to output json instead.
                        
                        get_data p_id:d_id.t_id ... : retrieve project:dataset.table data as CSV file, saved as project__dataset__tablename.csv
                                                      May specify --gzip and --combine-into <output_filename>
                        
                        get_table_info <dataset>    : dump meta-data information about the specified dataset.table_id from BigQuery.
                                       <table_id>
                        
                        delete_empty_tables         : delete empty tables form the tracking logs dataset for the specified course_id's, from BigQuery.
                                    <course_id> ...   Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        delete_tables               : delete specified table form the datasets for the specified course_id's, from BigQuery.
                          --table <table_id> <cid>    Will skip if table doesn't exist.
                        
                        delete_stats_tables         : delete stats_activity_by_day tables 
                                    <course_id> ...   Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        check_for_duplicates        : check list of courses for duplicates
                              --clist <cid_list>...   
  courses               courses or course directories, depending on the command

optional arguments:
  -h, --help            show this help message and exit
  --course-base-dir COURSE_BASE_DIR
                        base directory where course SQL is stored, e.g. 'HarvardX-Year-2-data-sql'
  --course-date-dir COURSE_DATE_DIR
                        date sub directory where course SQL is stored, e.g. '2014-09-21'
  --start-date START_DATE
                        start date for person-course dataset generated, e.g. '2012-09-01'
  --end-date END_DATE   end date for person-course dataset generated, e.g. '2014-09-21'
  --tlfn TLFN           path to daily tracking log file to import, e.g. 'DAILY/mitx-edx-events-2014-10-14.log.gz'
  -v, --verbose         increase output verbosity
  --parallel            run separate course_id's in parallel
  --year2               increase output verbosity
  --clist CLIST         specify name of list of courses to iterate command over
  --clist-from-missing-table CLIST_FROM_MISSING_TABLE
                        iterate command over list of course_id's missing specified table
  --force-recompute     force recomputation
  --dataset-latest      use the *_latest SQL dataset
  --download-only       For grade_reports command when downloading grades through edxapi, get latest only without making a request
  --course-id-type COURSE_ID_TYPE
                        Specify 'transparent' or 'opaque' course id format type. When downloading grades through edxapi, specify that course id format is the old legacy format (e.g.: HarvardX/course/term) or new format (e.g.: course-v1:HarvardX+course+term
  --latest-sql-dir      use the most recent SQL data directory (for person_course)
  --skiprun             for external command, print, and skip running
  --external            run specified command as being an external command
  --extparam EXTPARAM   configure parameter for external command, e.g. --extparam irt_type=2pl
  --submit-condor       submit external command as a condor job (must be used with --external)
  --max-parallel MAX_PARALLEL
                        maximum number of parallel processes to run (overrides config) if --parallel is used
  --skip-geoip          skip geoip (and modal IP) processing in person_course
  --skip-if-exists      skip processing in person_course if table already exists
  --skip-log-loading    when processing a 'doall' command, skip loading of tracking logs
  --just-do-nightly     for person_course, just update activity stats for new logs
  --just-do-geoip       for person_course, just update geoip using local db
  --just-do-totals      for time_task or time_asset, just compute total sums
  --just-get-schema     for get_course_data and get_data, just return the table schema as a json file
  --only-if-newer       for get_course_data and get_data, only get if bq table newer than local file
  --limit-query-size    for time_task, limit query size to one day at a time and use hashing for large tables
  --table-max-size-mb TABLE_MAX_SIZE_MB
                        maximum log table size for query size limit processing, in MB (defaults to 800)
  --nskip NSKIP         number of steps to skip
  --only-step ONLY_STEP
                        specify single step to take in processing, e.g. for report
  --logs-dir LOGS_DIR   directory to output split tracking logs into
  --listings LISTINGS   path to the course listings.csv file (or, for some commands, table name for course info listings)
  --dbname DBNAME       mongodb db name to use for mongo2gs
  --project-id PROJECT_ID
                        project-id to use (overriding the default; used by get_course_data)
  --table TABLE         bigquery table to use, specified as dataset_id.table_id or just as table_id (for get_course_data)
  --org ORG             organization ID to use
  --combine-into COMBINE_INTO
                        combine outputs into the specified file as output (used by get_course_data, get_data)
  --add-courseid        adds course_id as a new field, when combining outputs into a file (used with get_course_data)
  --combine-into-table COMBINE_INTO_TABLE
                        combine outputs into specified table (may be project:dataset.table) for get_data, get_course_data
  --skip-missing        for get_data, get_course_data, skip course if missing table
  --output-format-json  output data in JSON format instead of CSV (the default); used by get_course_data, get_data
  --collection COLLECTION
                        mongodb collection name to use for mongo2gs
  --output-project-id OUTPUT_PROJECT_ID
                        project-id where the report output should go (used by the report and combinepc commands)
  --output-dataset-id OUTPUT_DATASET_ID
                        dataset-id where the report output should go (used by the report and combinepc commands)
  --output-bucket OUTPUT_BUCKET
                        gs bucket where the report output should go, e.g. gs://x-data (used by the report and combinepc commands)
  --dynamic-dates       split tracking logs using dates determined by each log line entry, and not filename
  --logfn-keepdir       keep directory name in tracking which tracking logs have been loaded already
  --skip-last-day       skip last day of tracking log data in processing pcday, to avoid partial-day data contamination
  --gzip                compress the output file (e.g. for get_course_data)
  --time-on-task-config TIME_ON_TASK_CONFIG
                        time-on-task computation parameters for overriding default config, as string of comma separated values
  --subsection          Add grades_persistent_subsection instead of grades_persistent

About

Tool to convert & load data from edX platform into BigQuery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Stata 1.2%
  • Other 0.4%