Domain credibility

Currently adopted method is the Gradient boosted decision tree.(XGBoost/LightGBM)
This works well with tabular datasets.
Features were orthogonal.
Anova was used over mutual gain for feature reduction.

Dependencies:

NLTK and BeautifulSoup.
EasyList.
Yslow with PhantomJS
Merceine API
WebArchives API
StanfordCoreNLP Library
URLlib

Work::

28/11/19

1.Siamese Net (pairwise comparision, how to design triplet loss)
2.Relevancy Sorting (Used by google, fine-tuned mix of a lot of stuff)

30/11/19

1.For a security search engine specific task, and since this is credibility assessment
-- Inverse Reinforcement learning. (targetted recommendations)
-- BERT (Used by google for youtube searches)
2. Check Zenserp api
3.What should be the search bias?
4.MultiLabel genres?

1/12/19

1.Setup the environment to run existing code.
2.Postgres DB
3.WebCred-dev Up and running (always gives default values as output, timeouts for some urls and exceptions as well) 4.Need to fix WebcCred-dev.

2/12/19

Issues:

1.Fixing WebCred repository,updating doc.
2.Peer connection closing for some websites (timeout)
3.Database on remote machine, to be made usable on local machine.

3/12/19

Task:

1.Generate word similarity scores, search for query word in knowledge graph and return pages ranked in order of this score( another aspect of credibility for query words/sentences).
2.Gensim Word2Vec/Doc2Vec
3.Which searching algorithm to use? (or can we just brute force?)
4.TF-IDF used by SOLR
5.Can the knowledge graph be a KDTree(get nearest neighbour words of a new vector)
6.Reference ontology on security.
7.Api for the above , plugin development .
8.Webcred

4/12/19

1.Doc2Vec or Word2Vec
2.WDM better for words which have nothing in common.
3.Implemented BFS/DFS for finding similar words in the knowledge graph for the given list of words/word
4.Research more on which approach would be better ? Do we need to implement KdTrees?
5.Glove/Gensim
6.Find all child urls of a given url.

5/12/19

1.Looked up page ranking
2.Looked into Facebook FAISS
3.Find advanced ranking algorithms

6/12/19

1.Which model to use ?
2.Implemented Sentence Similarity using wordnet based on a paper.
3.TO-DO: Update existing code to python3.x
4.Webcred-dev still has some issues to be resolved.

9/12/19

1.Implemented Textranking for keyword extraction
2.Understood POS Tagging,Lemmatization,stopword removal and vocabulary creation.Built a knowledge graph from scratch for the vocabulary created. Assigned scores to vertices.
3.Ranked keyphrases. The above implementation was completely based on a paper cited
.

10/12/19

1.Implemented basic searching of a keyword in security ontolgy.
2.Crawled contenst of a website to get similar phrases for input keyword.
3.Implementing BFS DFS to search for most similar URL's in the knowledge graph.
4.Adding keyword search functionality to webcred
5.Need to update check_genre function in Webcred-dev {currently nothing implemented}.
6.Working on WebCred-plugin

11/12/19

1.Working on Plugin.
2.Credibility score not being returned in existing code.
3.Need to add credibility score functionality and integrate with plugin.
4.Recaptcha not working,genereated new keys.(put aside for the moment).
5.Frontend of keyword searching implemented on current webcred server.

13/12/19

Code pushed

1.Keyword based url retrieval ranked by cosine similarity between the query word and the list of urls in the security.owl file.
2.BFS/DFS :Results for both searching algorithms have been stored.
3.Urls are ranked in descending order and returned as a list with the relevance score for each.
4.Working on generate score function for the plugin,working on backend.
5.Need to figure out the check_genre function.

16/12/19

1.WEBCred assigning null to calculated values. Fixing that.
2.check_genre function needs an reponse from the ML model.Working on that.

17/12/19

1.ML model
2.Similarity scores still running(taking time..)
3.Wrote backend for plugin
4.URL extracter and web crawler fixed

19/12/19

1.Implementing ML model ,dataset for training getting ready
2.Similarity score Updated and fixed, CSV files now make sense

20/12/19

1.WMD.py was producing redundant results,issue fixed.Code has been optimized and made more efficient
2.Need to handle exceptions for URL's which sent back error status codes
3.Plugin is now working(please run it on google chrome due to CORS error in firefox)

23/12/19

1.Plugin needs to be linked to original Webcred score script.
2.Generated CSV files for 700 URL's in batches, it takes about 1 hour for 150 urls (which can be sped up since multiple tasks were running in the machine)
3.Complexity of WMD.py has been reduced by O(len(keywordlist))

24/12/19

1.TF-IDF can be smoothed
. 2.Apache SOLR
(look up)

25/12/19

1.WMD.py made more efficient, only parses relevant webpage content,excludes blacklist.(style,headers like html tags)
2.GBDT to be run using our generated dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
API_PLUGIN		API_PLUGIN
GBDT		GBDT
KeyWordSearchEngine		KeyWordSearchEngine
References		References
Search_similairty_ontology		Search_similairty_ontology
SentenceSimSemNet		SentenceSimSemNet
TextRanking		TextRanking
WEBCred		WEBCred
WMD_SecurityURLs		WMD_SecurityURLs
README.md		README.md
tfidf.py		tfidf.py

lalitsanagavarapu/Domain-Credibility

Folders and files

Latest commit

History

Repository files navigation