Skip to content

ramanpreetSinghKhinda/CSE_535_Multilingual_Search_System

 
 

Repository files navigation

Information Retrieval
Multilingual Search System for Social Network
CSE 535 - Fall 2015


Img_1

Goal

The goal of this project is to build a multilingual faceted search system, including a front end that allows users to search and browse multilingual data based on various criteria: topic, location, person, etc.

Projected Design

Search_UI_Part_1 Search_UI_Part_2

Highlights of our Search System

  • A pure multilingual faceted search system
  • Can handle queries in 5 different languages- English, Russian, German, French and Arabic
  • Based on twitter data corpus with data of around 0.1 million tweets
  • Data spans more than 120 countries

For detailed design refer below: -

https://github.com/ramanpreet1990/CSE_535_Multilingual_Search_System/blob/master/documents/report.pdf

Components that we implemented

1. Faceted Search

This option involves leveraging the faceted search capability provided by Solr to allow various types of drill-down. Facets include people, topics, locations etc.

2. Cross-Document Analytics

This option involves computing various analytics that provide insight into the data.

Examples include: volume of tweets by region/topic/hashtag, sentiment analysis, analytics illustrating cultural differences, etc.

3. Cross-Lingual Retrieval/Analysis

In this option, we demonstrates cross-lingual capabilities. This can take on many aspects: one example involves cross-lingual queries, and automatic translation of resulting foreign language snippets.

For example, a search for a particular individual/place/organization should take place simultaneously in multiple languages –achieved by automatically tagging and normalizing entities across languages.

4. Ranking tweets

This option involves coming up with a novel ranking algorithm for tweets that balances recency with importance of content when presenting tweets. It could also take into account the popularity of a tweet, or the influence of a person tweeting, the location of the user, their interests etc...

5. Graphical Analysis

This option involves inferring some graphical structure from the tweets, based on entities mentioned, topics discussed etc. Graph structures (or relationships between tweets) could also be inferred through connection of topics reflected in the tweets

References

We have taken reference from below sources to design this search system: -

  1. Introduction to Information Retrieval
  2. Course by Oresoft LWC
  3. Apache Solr Tutorials
  4. Apache Solr Wiki
  5. Apache Solr Reference Guide

Credits

This project uses below open source api's. We are grateful for their contribution: -

  1. Language Detection Api of detectlanguage.com
  2. Microsoft Bing Language Translation Api

We also acknowledge and grateful to Professor Rohini K. Srihari and TAs James Clay, Nikhil Londhe, Chuishi Meng and Ruhan Sa for their continuous support throughout the Course (CSE 535) that helped us learn the skills of Information Retrieval and build a Multilingual Search System.

Contributors

Ramanpreet Singh Khinda

Alexander Simeonov , Akash Desai , Riaz Munshi and Karanjeet Singh

License

Copyright {2016} {Ramanpreet Singh Khinda rkhinda@buffalo.edu, Alexander Simeonov agsimeon@buffalo.edu, Akash Desai akash101192@gmail.com, Riaz Munshi riazmuns@buffalo.edu and Karanjeet Singh karanjee@buffalo.edu}

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

A multilingual faceted search system that can handle queries on 5 different languages - English, Russian, German, French and Arabic

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 50.9%
  • Hack 14.0%
  • PHP 10.7%
  • CSS 10.4%
  • HTML 5.1%
  • Python 4.7%
  • Java 4.2%