Motivations

The codebase contained in this repo is part of a problem set specified by FanXchange.

Requirements

The code base was constrained to run using python 2.7 using the standard python libraries.

Directory structure

The directory structure of the project is as follows:

fanxchange_text_analysis/
├── README.md
└── app
    ├── __init__.py
    ├── base.py
    ├── data
    │   └── festival.txt
    ├── festival_names.py
    ├── tests.py
    └── word_counter.py

Code Logic

Unittest

To run the unit tests, cd into the app/ folder, and run:

$ python tests.py

Task 1

For the successful completion of the task, the code implemented should get the total number of words in a test file as well as the 10 most common words and the frequency of occurences.

To this end, the code for Task 1 is in the word_counter.py file. In order to fulfill the task requirements, the following code needs to be run:

$ python word_counter.py <file_path>

If the file path has not been specified, the .\data\festival.txt file will be used instead.

The word_counter module contains the WordCounter class which inherits the Base class. This is a shared class that contains a file reader and stores the text content in an instance variable. The reason for the Base class is to be in line with the DRY principle as both tasks requires a file to be read.

The WordCounter class contains two main functions:

The first one is the get_total_words function which processes the text content and gets the count of words. This is a straighforward function and is self explanatory.
The second function is get_most_common_words which takes in an optional limit variable that was added for testability. In order to find the most common words, the function uses the python inbuilt collection.Counter module. As it's a native, it's optimized as well as provides a convenience function to return the common words encountered.

Task 2

For the successful completion of the task, the code implemented should find the unique festival names based on certain conditions described in the ProblemDefinition.md file.

To this end, the code for Task 2 is in the festival_names file. In order to fulfill the task requirements, the following code needs to be run:

$ python festival_names.py <file_path>

If the file path has not been specified, the .\data\festival.txt file will be used instead.

The festival_names module contains the FestivalNames class which inherits the Base class. The purpose of the base class has been previously explained.

The FestivalNames contains several methods which has been explained below.

`convert_text_content_to_first_word_dict`

Several assumptions have been made about the festival names which includes:

They must contain at least the first word in a sentence, which is tagged as the pseudo festival name.
The psuedo festival name must occur as the first word in at least 2 sentences.

Based on this assumption, the method creates a festival dictionary containing the pseudo festival name which is the first word in the sentence as a key, and the sentences as a list of values.

`filter_non_repeating_names`

As a unique festival name must occur in at least 2 sentences, this method filters out pseudo festival names keys in the festival dictionary where the value containing the sentences only have one item in the list.

`build_festival_names`

Now that the festival dictionary only has pseudo names that have multiple occurences, the actual festival names need to be determined. The unique festival name is assumed to be the first n characters between 2 strings that are equal. Thus, this method calls the get_sentence_commonality method and passes in only 2 strings as its the minimum number of strings required to find the real festival name.

`get_sentence_commonality`

Given 2 sentences, returns the festival name which is taken to be the first set of common characters between 2 strings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

.gitignore

.gitignore

ProblemDefinition.md

ProblemDefinition.md

README.md

README.md

Repository files navigation

Motivations

Requirements

Directory structure

Code Logic

Unittest

Task 1

Task 2

`convert_text_content_to_first_word_dict`

`filter_non_repeating_names`

`build_festival_names`

`get_sentence_commonality`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
.gitignore		.gitignore
ProblemDefinition.md		ProblemDefinition.md
README.md		README.md

gitumarkk/fanxchange_text_analysis

Folders and files

Latest commit

History

Repository files navigation

Motivations

Requirements

Directory structure

Code Logic

Unittest

Task 1

Task 2

convert_text_content_to_first_word_dict

filter_non_repeating_names

build_festival_names

get_sentence_commonality

About

Resources

Stars

Watchers

Forks

Languages

`convert_text_content_to_first_word_dict`

`filter_non_repeating_names`

`build_festival_names`

`get_sentence_commonality`