RASIA

Research Articles' Structure Identification and Applications in academic text ming, bibliometrics and scientometrics.

There are two main task in this project.

1. Identification of General Structure of scientific articles.
2. Applications in bibliometrics, scientometrics and text mining.

Data

1. Research articles from Computer Science, labeled as CS
2. Articles from PLOS. PLOS ONE used, labeled as PLOS

Method

1. section header based identification 
2. section content based identification 
3. hybrid identification

Tools

Usage

Preprocessing

data will be saved to data/sec-header.json and data/sec-type.json. The log info will print through standard outstream and data will be outputed through error stream.

python statistics/plos_xml_statistics.py [path direcotry] 1>plos_statstic.log 2>header_style.txt

For scienceDirect data:

python statistics/sc_xml_statistics.py [index file path] 1>headers.txt 2> sc_statistic.log

From the result of statistics, we find there are only 205 unique section header in PLOS_XML, and occupy 97% to total section headers. So, PLOS_XML data don't need a complicated classifier, only a dictionary could have a very high precision. But for science direct files, the high frequency section headers only occupy 51%.

So, we use scienceDirect as our data.

Section header based identification

Randomly select 300 papers, and label the general structure of papers.

 python tools/random_selection.py rn paths.txt 300 > sc_selected_papers.txt

 python section_header_based/extract_headers_for_manually_labeling.py sc_selelcted_papers.txt > section_headers_for_labeling.txt

manually labeling of selected papers with two PHD students.
After checking, build the section header based dataset.
We use three models: SVM,CRF baseline is CRF and features used in Parscit.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
citation_shift		citation_shift
data		data
section_content_based		section_content_based
section_header_based		section_header_based
statistics		statistics
tools		tools
.gitignore		.gitignore
README.md		README.md
headers.xlsx		headers.xlsx
plos_headers.csv		plos_headers.csv
plos_headers.xlsx		plos_headers.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

citation_shift

citation_shift

data

data

section_content_based

section_content_based

section_header_based

section_header_based

statistics

statistics

tools

tools

.gitignore

.gitignore

README.md

README.md

headers.xlsx

headers.xlsx

plos_headers.csv

plos_headers.csv

plos_headers.xlsx

plos_headers.xlsx

Repository files navigation

RASIA

Data

Method

Tools

Directory

Usage

Preprocessing

Section header based identification

Section content based identification

Paper

About

Releases

Packages

Languages

hyyc116/RASIA

Folders and files

Latest commit

History

Repository files navigation

RASIA

Data

Method

Tools

Directory

Usage

Preprocessing

Section header based identification

Section content based identification

Paper

About

Resources

Stars

Watchers

Forks

Languages