Skip to content

elegantcoin/Split-words

Repository files navigation

Split-words


inspired by 分词打标 and Jieba tutorial, A splitting Chinese words and counting their frequence method is presentated. Since the original system is poorly performed, stopwords are used and English words are delt with correctly.

  • ps::joy::joy::joy:Leetcode 192. Word Frequency Only counting words, see this one line simple solution!
    cat words.txt | xargs -n 1|sort |uniq -c|sort -nr| awk '{print $2" "$1}'

Results should be:

- 🔥 1. input:

  • All.csv —— Words need to be split.
  • stopwords.txt —— Stopwords used(both English and Chinese).

- 🔥 2. main file:

- 🔥 3. output:

- 🔥 4. sourcecode:

# _*_ coding: UTF-8 _*_
import csv
import jieba
import jieba.posseg as psg
import copy
import time
import argparse


ap=argparse.ArgumentParser()
ap.add_argument("-f","--file-name",required=True,
    help="Input file name")
args=vars(ap.parse_args())

#读取停词表
stopwords = [line.strip() for line in  open('stopwords.txt',encoding='UTF-8-sig').readlines()]
stopflags=["m","o","q","w","x","y"]
#一行行读取csv
file_object2=open(args["file_name"],'r',encoding ='UTF-8-sig').read().split('\n')

#建立分词存储列表
Rs1=[]
Rs2=[] 
#统计词频的字典
dic={}
dic2= {"Ag":"形语素","a":"形容词","ad":"副形词","an":"名形词","b":"区别词","c":"连词","dg":"副语素","d":"副词","e":"叹词","f":"方位词","g":"语素",\
    "h":"前接成分","i":"成语","j":"简称略语","k":"后接成分","l":"习用语","m":"数词","Ng":"名语素","n":"名词","nr":"人名","ns":"地名","nt":"机构团体",\
    "nz":"其他专名","o":"拟声词","p":"介词","q":"量词","r":"代词","s":"处所词","tg":"时语素","t":"时间词","u":"助词","vg":"动语素","v":"动词",\
    "vd":"副动词","vn":"名动词","w":"标点符号","x":"非语素字","y":"语气词","z":"状态词","un":"未知词",}

for i in range(len(file_object2)):
  result=[]
  # 选择cut的模式
  # seg_list = jieba.cut(file_object2[i])
  seg_list = psg.cut(file_object2[i])
  # seg_list = jieba.cut_for_search(file_object2[i])
  #添加源数据列	
  # result.append(file_object2[i])
  #读取每一行分词	
  for w in seg_list :
    if w.word not in stopwords and w.flag not in stopflags:
      result.append(w.word)
      dic[w]=dic.get(w,0)+1
      continue
  #把分词写入源列表后面	
  Rs1.append(result)

#写入CSV,并用时间命名文件 避免重名
# 08 05 2019 09:49:02 时间格式
doctime=str(time.strftime("%m %d %Y %H:%M:%S", time.localtime()))
mon=doctime[0:2]
dy=doctime[3:5]
yr=doctime[6:10]
hr=doctime[11:13]
mi=doctime[14:16]
se=doctime[17:19]

file=open('0fenci_'+yr[-2:]+mon+dy+'_'+hr+'_'+mi+se+'.csv','w',newline='',encoding='utf-8-sig')
writer = csv.writer(file)
writer.writerows(Rs1)
file.close() 

# dic排序
dic1=copy.deepcopy(dic)
dic = sorted(dic1.items(), key=lambda d:d[1], reverse = True)
for k,v in enumerate(dic):
  Rs2.append((v[0].word,v[0].flag,dic2.get(v[0].flag),v[1]))

# for k,v in enumerate(dic):
# 	Rs2.append((v,dic[v]))

#print(Rs2[:10])
file=open('1keyword_'+yr[-2:]+mon+dy+'_'+hr+'_'+mi+se+'.csv','w',newline='',encoding='utf-8-sig')
writer = csv.writer(file)
writer.writerows(Rs2)
file.close() 

🔥 - 中国大学: 中国所有大学分词

- 思考:

  • cloudword?

  • Shell 可以一行解决词频问题,但是原文的分词结果展示以及中文实现是困难的。

  • 关键信息成组

  • keywors sort()在源代码中对dic排序?还是直接导出结果后再排序,谁的效率高?

  • 停词已经判断则不需要再次判断

  • 舆情分析

  • Awesome!!!

  • 中文知识图谱 openkg.cn

  • 快速将大学的简称也分出来(如北京大学→北大、清华大学→清华、上海交通大学→上交)

  • 利用百度echarts这个思路@zhouwei713渲染图谱

  • 高频词提取后进行词之间的连接,形成图谱。如下所示:

  • 利用Ucinet6NetDraw 可以绘制简单版本的,但是可视化效果不好,如下图所示:

About

Splitting Chinese words and counting their frequence, stopwors are used.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages