GitHub - suriyan/autosum: Summarize Publications Automatically

AutoSum: Summarize Publications Automatically

The tool exploits the labor already expended by scholars in summarizing articles. It scrapes words next to citations across all openly available research citing a publication, and collates the output. The result is a very useful summary and data that are in a format that allows easy discovery of potential miscitations.

CLICK HERE to suggest an edit to this page!

Get the Data
Scrapes all openly accessible research citing a particular publication using links provided by Google Scholar. Note: Google monitors scraping on Google scholar.
Parse the Data
Iterates through a directory with all the articles citing a particular research article, and using regular expressions, picks up sentences near a citation.
Example from Social Science

Get the Data

To search for openly accessible pdfs citing the original research article on Google Scholar, use Scholar.py.

Input: URL to Google Scholar Page of an article.
What the script does:
- Goes to 'Cited By..'
- Downloads a user specified number of publicly available papers (pdfs only for now) that cite the paper to a user specified directory.
- Creates a csv that tracks basic characteristics of each of the downloaded paper -- title, url, author names, journal etc. It also dumps relative path to downloaded file.
Sample output

Usage

usage: scholar.py [-h] [-u USER] [-p PASSWORD] [-a AUTHOR] [-d DIR]
                  [-o OUTPUT] [-n N_CITES] [-v] [--version]
                  keyword [keyword ...]

positional arguments:
  keyword               Keyword to be searched

optional arguments:
  -h, --help            show this help message and exit
  -u USER, --user USER  Google account e-mail
  -p PASSWORD, --password PASSWORD
                        Google account password
  -a AUTHOR, --author AUTHOR
                        Author to be filtered
  -d DIR, --dir DIR     Output directory for PDF files
  -o OUTPUT, --output OUTPUT
                        CSV output filename
  -n N_CITES, --n-cites N_CITES
                        Number of cites to be download
  -v, --verbose
  --version             show program's version number and exit

Example

python scholar.py -v -d pdfs -o output.csv -n 100 -a "A Einstein" \
"Can quantum-mechanical description of physical reality be considered complete?"

Parse the Data

To scrape the text next to the relevant citations within the pdfs, use searchpdf.py:

The script iterates through the pdfs using the csv generated above.
Based on regex, gets the text and puts it in the same csv. If multiple regex are matched, everything is concatenated with a line space.
Sample output

usage: searchpdf.py [-h] [-i INPUT] [-o OUTPUT] [-v] [--version]
                    regex [regex ...]

positional arguments:
  regex                 Regex to be search

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        CSV input filename
  -o OUTPUT, --output OUTPUT
                        CSV output filename
  -v, --verbose
  --version             show program's version number and exit

Example

python searchpdf.py -v -i output.csv -o search-output.csv "\.\s(.{5,100}[\[\(]?Einstein.{2,30}\d+[\]\)])"

The regular expression matches a sentence (max 100 chars) following by author name "Einstein", any words (max 30 chars) and number with close bracket at the end.

Example from Social Science

What to search for?

Example with Google Scholar
Download 500 articles from Google Scholar:

python scholar.py -v -d pdfs -o iyengar-output.csv -n 500 -a "S Iyengar" "Is anyone responsible?: How television frames political issues."

Searching in the Test Data
- Sample input data
- Use autosumpdf.py to filter citations to Iyengar et al. 2012 using the regular expression "Iyengar.{3,30}2012":
```
python autosumpdf.py -v -i testdata.csv -o search-testdata-new.csv "Iyengar.{3,30}2012"
```
- Ouput
Miscitations
Social scientists hold that few truths are self-evident. But some truths become obvious to all social scientists after some years of experience, including: a) Peer review is a mess, b) Faculty hiring is idiosyncratic, and c) Research is often miscited. Here we quantify the last portion.

License

Released under the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
scripts		scripts
testdata		testdata
testout		testout
.gitignore		.gitignore
License.md		License.md
Readme.md		Readme.md
social_science_citations.md		social_science_citations.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

testdata

testdata

testout

testout

.gitignore

.gitignore

License.md

License.md

Readme.md

Readme.md

social_science_citations.md

social_science_citations.md

Repository files navigation

AutoSum: Summarize Publications Automatically

Table of Contents

Get the Data

Usage

Parse the Data

Example from Social Science

License

About

Releases

Packages

Languages

License

suriyan/autosum

Folders and files

Latest commit

History

Repository files navigation

AutoSum: Summarize Publications Automatically

Table of Contents

Get the Data

Usage

Parse the Data

Example from Social Science

License

About

Resources

License

Stars

Watchers

Forks

Languages