... is located in the similarity
directory; refer to the README there.
For a script extracting counts for a custom list of keywords from the corpus, see the cooccurrence_and_counting directory readme instead. Otherwise:
-
Copy the repository up to your personal workbench
-
For most scripts, running in a tmux window will be necessary to keep the program from quitting when your ssh session times out. If opening a tmux window, be sure to open it before doing the next step; otherwise, the changes that are supposed to be made to classpaths by that step won't persist for your script.
-
From the top-level directory of the repository, run the command
source add_dependencies_to_classpath.sh
(this will add the jar files in the lib directory to your classpath and pig_classpath)
-
From the top-level directory of the repository, run some variation on the command
pig -p I_PARSED_DATA=/dataset-derived/gov/parsed/arcs/bucket-2/ -p I_CHECKSUM_DATA=/dataset/gov/url-ts-checksum/ -p O_DATA_DIR=outputARC2/ -p O_DATA_DIR_2=outputARC2-2/ ExtractCounts_keywords.pig
providing the local path from this directory to your desired pig script, and making sure that your provided values for O_DATA_DIR and O_DATA_DIR_2 are not already preexisting directories in the hadoop file system at
hdfs://nn-ia.s3s.altiscale.com:8020/user/<your_workbench_username>/
. (If, for example,hdfs://nn-ia.s3s.altiscale.com:8020/user/<your_workbench_username>/outputARC2
already existed when the example command was run, the script would quit out for reasons related to "output validation.")(To see the contents of
hdfs://nn-ia.s3s.altiscale.com:8020/user/<your_workbench_username>/
, run the commandhdfs dfs -ls hdfs://nn-ia.s3s.altiscale.com:8020/user/<your_workbench_username>/
from any directory. Other unix commands can be run on files located there by changing
-ls
to a different command.)
-
A pig script sets up successfully, but then gets stuck at 0% completion for hours: after killing this job, check whether there's a phantom yarn thread running from a previous job, and if so, kill it.
-
From the workbench, run the command
yarn application -list
and if there's an old thread running there that shouldn't be, run the command
yarn application -kill <Application ID>
-