Skip to content

ye-kyaw-thu/tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preprocessing and postediting tools especially for NLP

Natural Language Processing သုတေသန လုပ်ကြတော့မယ်ဆိုရင် အရင်ဆုံး text file တွေကို cleaning လုပ်တာ၊ encoding ပြောင်းတာ၊ ရှိနေတဲ့ format ကို ကိုယ်လိုချင်တဲ့ ပုံစံဖြစ်အောင် ပြောင်းရတာ၊ လိုချင်တဲ့ စာလုံးတွေ၊ စာကြောင်းတွေကိုပဲ ဆွဲထုတ်ယူတာ စသည်ဖြင့် လုပ်ရတဲ့ အလုပ်တွေက အများကြီးပါပဲ။ Experiment တွေကို လုပ်ဖို့အတွက်က နေ့စဉ်လိုလို shell, perl (အခုနောက်ပိုင်းမှာတော့ python language) နဲ့ ပရိုဂရမ်တွေကို ရေးကြရပါတယ်။ တခါတလေမှာ format တစ်ခုကနေ နောက်တခြား format တစ်ခုကို ပြောင်းဖို့အတွက် ပရိုဂရမ်တပုဒ်ကို တရက်လုံးအချိန်ပေးပြီး ရေးလိုက်ရတာမျိုးလည်း ရှိပါတယ်။ အဲဒါကြောင့် အသုံးဝင်နိုင်မယ့် bash, perl, python ပရိုဂရမ်တွေကို ကျွန်တော် အချိန်ရရင်ရသလို တင်ပေးသွားပါမယ်။ တစ်ခုမှာချင်တာက ကျွန်တော်တင်ပေးထားတဲ့ ပရိုဂရမ်တွေကို အခြေခံပြီးတော့ shell, perl, python scripts တွေကို ကိုယ်တိုင်ရေးနိုင်အောင် ကြိုးစားကြပါ။

သုံးပုံသုံးနည်း အသေးစိတ်ကိုတော့ သက်ဆိုင်ရာ ဖိုလ်ဒါအသီးသီးမှာ ရှိတဲ့ example-usages.md (for bash, for perl, for python) ဖိုင်တွေကို မှီငြမ်းပါ။

ရဲကျော်သူ

bash

  1. read-and-move.sh
  2. change-filenames.sh
  3. rm-date-sentences.sh
  4. print-classID-prediction-result.sh
  5. compare-img-or-pdf.sh
  6. chk-sort-by-columns.sh
  7. kill-all-detached.sh
  8. unzip-all-with-one-passwd.sh
  9. cut-filename.sh
  10. calc-avg.sh
  11. print-latex-section.sh
  12. list-mistake-5-suggestion.sh
  13. mytxt2pdf.sh
  14. prepare-open-test-data.sh
  15. print-CRLF.sh
  16. group-files.sh
  17. segmentation.sh
  18. split-even-odd-pdf.sh
  19. even-odd.sh
  20. rm-stopwords.sh
  21. rm-spaces-lineno.sh
  22. blowfish.sh
  23. replace-with-lineno.sh
  24. replace-with-lineno2.sh
  25. OOV-count.sh
  26. find-blank-lines.sh
  27. dot2png-pdf.sh
  28. add-start-end.sh
  29. get-words-with-position.sh
  30. count-string-length.sh
  31. strip-substring.sh
  32. chk_total_duration.sh
  33. print-sentenceID-count.sh
  34. mk-16KHz-mono.sh
  35. mk-spectrogram.sh
  36. group-UCF11.sh
  37. group-within-group-UCF11.sh
  38. dot2pic.sh
  39. 2mono-pdf.sh
  40. calc-bleu-all.sh
  41. calc-ribes-all.sh
  42. print-matched-x.sh
  43. split-train-dev-test.sh
  44. clean-space-all.sh
  45. mk-g2p-model.sh
  46. mk-syl-list.sh
  47. rm-200b-200d.sh
  48. print-char.sh
  49. prepare-10fold-smt-pair.sh
  50. rm-ctrl-m.sh
  51. x-letter-word.sh
  52. paste-column.sh
  53. lm-building-exec.sh
  54. print-most-common.sh
  55. calc-ppl-with-kenlm-query.sh
  56. mk-two-lm-and-merge.sh
  57. mk-class-lm.sh
  58. get-myPOS-tag.sh
  59. rm-myPOStags.sh
  60. print-same-col1.sh
  61. char-segmentation.sh
  62. chk-blank-fields.sh
  63. chk-field-length.sh
  64. crop-pdf.sh
  65. excel2csv-chk-fields.sh
  66. change-format.sh
  67. format-mecab-pos.sh
  68. cp-config.sh
  69. DELETE-ALL.sh
  70. trim-silence.sh
  71. wav2wavform.sh
  72. mytext2pic.sh
  73. formula-pic.sh
  74. rm-heading-tab-lineno.sh
  75. mk-10cross-data.sh
  76. align-GIZA++.sh
  77. date-time-info.sh
  78. mp4-to-wav.sh
  79. my-font-chk.sh
  80. rec-recorder.sh
  81. mp42gif.sh
  82. extract-target-text.sh
  83. txt2png.sh
  84. pic2histogram.sh
  85. tesseract-ocr.sh
  86. sylbreak-10fold-mt.sh
  87. syllable-break-multi-files.sh
  88. build-fastalign-pt.sh
  89. txt2ASL-BSL.sh
  90. mgiza-align.sh
  91. add-dummy-word-mk-csv.sh
  92. kidbright-burmese-transcription.sh
  93. count-csv-fields.sh
  94. sylbreak-gui.sh
  95. espeak-and-zenity.sh
  96. find-edit-gui.sh
  97. sqlite3-gui.sh
  98. mk-background-transparent.sh
  99. spelling-checker-with-dict.sh
  100. chop-by-silence.sh
  101. random-no.sh
  102. sort-capitalized-letter-first.sh
  103. chk-wavefile-duration-for-unicode-filename.sh
  104. calc-chrF.sh
  105. check-end-mark.sh
  106. word2pdf.sh

perl

  1. clean-space.pl
  2. rm-EnglishSentences.pl
  3. word-analysis.pl
  4. print-emojiSentences.pl
  5. dq-multilines.pl
  6. mk-abstract-para.pl
  7. print-mySentenceOnly.pl
  8. rm-symbol-and-myVowel-only-sentences.pl
  9. rm-space-btw-numbers.pl
  10. print-ngram.pl
  11. print-codepoint.pl
  12. wc.pl
  13. wordlimit.pl
  14. wordwrap.pl
  15. get-syl-potma.pl
  16. my-linebreak.pl
  17. rm-ne-tag.pl
  18. clean-v-without-c.pl
  19. x-x-to-x-comma-x-with-brackets.pl
  20. select-en-th-my.pl
  21. mk-speakers-json.pl
  22. string-distance.pl
  23. print-matched-char-seq.pl
  24. search-common.pl
  25. fixed-parallel-order.pl
  26. encode-input.pl
  27. decode.pl
  28. mk-one2one-freq.pl
  29. mk-one-syl-confusion.pl
  30. rm-onechar-line.pl
  31. replace-with-lineno.pl
  32. chk-pos-tags.pl
  33. count-string-length.pl
  34. print-diff-word.pl
  35. print-union-isect-diff.pl
  36. print-common-kachin.pl
  37. sylbreak.pm
  38. test.sylbreak.pm.pl
  39. tag-BI.pl
  40. bigram-similarity.pl
  41. chk-src-trg-words.pl
  42. print-my-numeric-sentence.pl
  43. number-punct-segmentation.pl
  44. tabpair-to-crfcol.pl
  45. print-blank-lines.pl
  46. add-spu_id.pl
  47. human-mt-eval-form.pl
  48. trainTuneScore_jamy.pl
  49. rm-blank-line.pl
  50. gizaA3-4human.pl
  51. print-fngram-format.pl
  52. print-myWordOnly.pl
  53. fastalign-4human.pl
  54. find-one-file-words-in-another.pl
  55. mypos2json.pl
  56. roman2myno.pl
  57. bracket-tree2sentence.pl
  58. clean-punctuation.pl
  59. mk-spelling-dict.pl
  60. remove-one-char-lines.pl
  61. clean-brackets-tags.pl
  62. check-empty-field.pl
  63. eng-sentence-split.pl

Python

  1. chk-token.py
  2. numpy-array-element-compare.py
  3. char-count-element-wise.py
  4. char-startswith-element-wise.py
  5. fuzzy-match.py
  6. hex2uni.py
  7. korean-breaks.py
  8. epitranscribe.py
  9. plot-unicode-char.py
  10. en-sentence-tokenizer.py
  11. en-word-tokenizer.py
  12. en-tokenization-on-punctuation.py
  13. filter-en-stopwords.py
  14. mk-QR-code.py
  15. wu-palmer-similarity.py
  16. nltk-en-pos-tagger.py
  17. folder-file-dict.py
  18. csv-str2mapping123.py
  19. str2mapping123.py
  20. str2my-edit-distances.py
  21. mypos2upos.py
  22. isolation-forest.py
  23. accuracy.py
  24. how-name-eq-main-work.py
  25. f1-score-calc.py
  26. multi-class-f1.py
  27. language-detect.py
  28. python-list-eg.py
  29. split-train-test.py
  30. split-train-valid-test.py
  31. add-sign.py
  32. add-sign-onepage-pdf.py
  33. print-img-resolution.py
  34. print-pixel-value.py
  35. RGB2grey.py
  36. image2npy.py
  37. syl2freq.py
  38. syl2tf.py
  39. syl2idf.py
  40. syl2tf-idf.py
  41. syl2onehot-sklearn-4teaching.py
  42. syl2onehot-sklearn.py
  43. zawgyi2unicode.py
  44. zawgyi2unicode-syl.py
  45. word2vec.py
  46. make-edit-error.py
  47. 8eval.py
  48. soundex-metaphone.py
  49. 7sim.py
  50. abugida.py
  51. tex-spellcheck.py
  52. video_augment.py
  53. mk-video-class.py
  54. mk-video-class-for-sentence.py
  55. m4v_to_mp4.py
  56. mov_to_mp4.py
  57. jiwer_wer_mer_wil.py
  58. passphrase_generator.py
  59. rule_based_password_gen.py
  60. MOS_eval.py
  61. spacy_pos_ner.py
  62. spacy_pos_dep_jp.py
  63. spacy_pos_ner_dep_zh.py
  64. nltk-lm.py
  65. nltk-lm-predict.py
  66. format_conversion.py
  67. format_conversion_with_error_check.py
  68. cut_columns.py
  69. bidirectional_maximum_matching.py
  70. extract_filename_parts.py
  71. sort_openslr_transcript.py
  72. speech_corpus_info.py
  73. dKNN.py
  74. dKNN-ver2.py
  75. change_sampling_rate.py
  76. check_silence.py
  77. graph_lm_spellchek.py
  78. detect_language_ver1.py
  79. detect_language_ver2.py
  80. embedder.py
  81. test_embedding.py
  82. convert_to_conllu.py
  83. convert_to_spacyNER_json.py
  84. split_parallel_data.py
  85. clean_text.py
  86. extract_emoji.py
  87. compare_characters.py
  88. word_length_analysis.py
  89. comma2tab_label2digit.py
  90. conv_delimiter_label2digit.py
  91. padsint_detection.py
  92. replace_pipe_with_space.py
  93. pos_pattern_checker.py
  94. sort_ngram.py
  95. analyze_NER_corpus.py
  96. compare_sentence_tag_distributions.py
  97. compare_word_tag_distributions.py
  98. print_codepoint.py
  99. syl_ngram_mi.py
  100. txt_dl.py
  101. markov_txt_gen.py
  102. tesseract_ocr.py
  103. NER_23to9_conv.py
  104. tf_event2txt.py
  105. hangul_syl_generator.py
  106. ngram_segmentation.py
  107. long_sentence_wrapper.py
  108. mm_proverb_parser.py
  109. grapheme_tokenizer.py
  110. icu_collation.py
  111. icu_transliteration.py
  112. my_transliteration.py
  113. kana2roman.py
  114. prefix_suffix_extract.py
  115. mk_only_my.py
  116. rm_my_two_symbols.py
  117. char_segmentation.py
  118. fasttext_format_converter.py
  119. run_sylbreak.py
  120. rm_zwnj_zwsp_hsp.py
  121. clean_non_burmese.py
  122. eval_ngram_lm.py
  123. parquet_extractor.py
  124. g2p-compare.py
  125. extract-ReMeDi.py
  126. split-sentences-by-pipe.py
  127. format-conv.py
  128. wtc-paste.py
  129. cv-split.py
  130. mk_hatespeech_dict.py
  131. train_embedding.py
  132. convert_to_two_words_dict.py
  133. emoji_count.py
  134. rm_blank_line.py
  135. my_no_spacing.py
  136. punc_emoji_spacing.py

About

preprocessing and postediting tools especially for NLP (bash, perl, python)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published