GitHub - nteetor/doc-integration: A document integration project. Taking a paragraph from a text, can the program insert the paragraph, correctly, back into the text.

The goal of this project was to create a program that could, based on semantic analysis, properly integrate text back into the larger body of text from which it was pulled.

This project was built entirely offline and thus is not as github friendly as could be.

A set of just under 17,000 Wikipedia articles was selected from a much larger set of articles. Each of the selected articles needed to have at least 600 words. This minimum was put in place to help ensure each article had multiple paragraphs and was content rich.

For each article/file the program would randomly select a paragraph, remove it from the text, and use each of the three similarity functions to place the paragraph back into the file text.

The similarirty functions worked as follows: given the removed paragraph and two consecutive paragraphs from the article they calculated a similarity score. If a score was high enough the program determined that the removed paragraph belonged, i.e. was originally in between, the two paragraphs. This result along with semantic composition of the article was written to a CSV file for analysis.

Analysis of the results was done in R, see the RStuff.R file. The writeup gives a much better description and explanation of the project and the similarity functions. It can be found publicly through my Google drive by following this link: final writeup.

===============

A document integration project. Taking a paragraph from a text, can the program insert the paragraph, correctly, back into the text.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
RStuff.R		RStuff.R
SimilarityMethods.py		SimilarityMethods.py
corpusData.csv		corpusData.csv
dataExtraction.py		dataExtraction.py
mcsTest.py		mcsTest.py
methodData.csv		methodData.csv
methodTester.py		methodTester.py
pcp.py		pcp.py
pcpTester.py		pcpTester.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RStuff.R

RStuff.R

SimilarityMethods.py

SimilarityMethods.py

corpusData.csv

corpusData.csv

dataExtraction.py

dataExtraction.py

mcsTest.py

mcsTest.py

methodData.csv

methodData.csv

methodTester.py

methodTester.py

pcp.py

pcp.py

pcpTester.py

pcpTester.py

Repository files navigation

About

Releases

Packages

Languages

nteetor/doc-integration

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages