This project investigate the coverage and the role of Semantic Scholar (S2) search engine in condunting secondary studies in software engineering.
For the execution of scripts, you have:
- download the latest S2 corpus from http://api.semanticscholar.org/corpus/download/ and put it in the following path: data/sscholardump/.
- install all required packages from requiremnts.txt file.
The project contains 4 main folders:
This folder includes the used data for the elaboration of the project:
- cso : Computer Science Ontology described in json file
- swebok: Software Engineering Body of Knowledge described in json file
- sscholardump: this where the dump needs to be saved for proper execution of scripts
This folder includes the set of obtained results:
- Findings.xlsx : includes final and intermediate results of the project
- Studies.bib: includes the metadata of included studies in the elaborated review
- Metadata.bib: includes the metadata of all the included papers in the selected studies (Stduies.bib)
This folder includes python scripts used for the automatic elaboration of the project:
- bibtexloader.py: enabels loading bibtex files and get needed information to be searched in the S2 dump
- onto_handler.py: enabels cleaning cso.owl and tronsform it into appropriate json file
- locate_papers_in_corpus.py: implement the preliminary searches where papers are located in the corpus
- semantic_scholar_search.py: implement function to search in corpus within provided queries; it also imlement the snowballing process
- query_analyzer.py: implement search query construction and expansion using ontology terms
- main.py: is the main file used to launch the execution of the script.
This folder gives for each selected review in the study:
- All.bib: list of all the included papers by the review
- -Query.bib: list of papers not identified by the original query. References highlighted in red are missing from Semantic Scholar; those highlighted in yellow are found by the query but under a different research field than computer science; those highlighted in orange are also identified by the query but out oyear ranges specified in the review.
- -Snowballing.bib: list of papers not identified after snowballing
- -Ontology.bib: list of papers not identified after searching with refined queries
Each dataset incorporates the set of included studies for a specific SLR stated by the correspondent authors, extracted and saved in a readable format (.bib). In order to get a reasonable set of excluded studies, we applied the same query for each SLR into Scopus, we adopted the same inclusion criteria as mentioned in original SLRs: period, type and language of publications. The set of studies returned by Scopus and not included in SLRs are considered as excluded studies.