Assignment-1

Problem Description:

You will be given variable number of TXT files having different types of ASCII sentences. A python program must be developed to accept all these file names as arguments and the output shall provide all the word counts in single output.

Output can be shown as:

{'am': 1, 'python': 1, 'affraid': 1, 'want': 1, 'learn': 1, 'of': 1, 'to': 1, 'I': 2, 'python.': 1, 'But': 1}

Instruction to be followed:

Testing procedure: <yourpythoncode.py> …
The python binaries shall be 3.x
Files can be in your current directory or in different directory
You need to show the running time of your program in seconds
You can’t use any external library to satisfy the requirement except basic python libraries.

Assignment-2

Problem Description:

You have been given with a UNIX passwd file having all the user’s details. Now you need to develop a shell script which will be taking out only duplicate usernames and their unique shell names as output from the given file. In addition, you need to take the passwd file as input in the shell script.

Output can be shown as:

Duplicate users are as follows:

adm
halt
games
[…continue printing usernames in same fashion above for more duplicate users]

Unique shell used among all the duplicate users above:

/bin/sh
/bin/bash
/bin/csh
[…continue printing shell names in same fashion above for more cases]

Instruction to be followed:

Testing procedure: <yourpythoncode.py> …
The python binaries shall be 3.x
Files can be in your current directory or in different directory
You need to show the running time of your program in seconds
You can’t use any external library to satisfy the requirement except basic python libraries.

Assignment-3

Problem Description:

We all know that Covid-19 pandemic is going on world-wide and almost every corner of the world is impacted globally. You are given with a Covid-19 dataset collected from the internet to do certain analysis.
All the raw responses are kept in file named Covid_Analysis_DataSet.csv
Detail schema is self-explanatory.

The following analytics shall be done:

Month, Year and countryterritoryCode wise, please derive Infection Rate and Death Rate.

Note:

Infection Rate = (cases / TestPerformed) * 100% Death Rate = (deaths / cases) * 100%

Output can be shown as:

Month;Year;CountryCode;InfectionRate;DeathRate
September;2020; AFG;3;10
October;2020; AFG;5;12

Instruction to be followed:

Consider Raw Capture file shall be loaded in HDFS /ASSIGNMENT directory hourly from external program. So, you will get the mentioned file in the designed location periodically.
Process all the outputs in Apache Spark in Standalone Cluster Mode
Store all the results in HDFS /OUTPUT in mentioned CSV format (; separated file)
Schedule the spark job to run in every hour at 15 minutes
You need to submit following files as ZIP format and naming convention as suggested by earlier assignment in this course:
a. Python Program
b. Shell Script
c. Crontab Configuration Text File

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Covid_Analysis_DataSet.csv		Covid_Analysis_DataSet.csv
README.md		README.md
assignment-1.py		assignment-1.py
assignment-2.sh		assignment-2.sh
assignment-3-autoschedule.sh		assignment-3-autoschedule.sh
assignment-3-covid_analysis.py		assignment-3-covid_analysis.py
assignment-3-crontab_entry.txt		assignment-3-crontab_entry.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Covid_Analysis_DataSet.csv

Covid_Analysis_DataSet.csv

README.md

README.md

assignment-1.py

assignment-1.py

assignment-2.sh

assignment-2.sh

assignment-3-autoschedule.sh

assignment-3-autoschedule.sh

assignment-3-covid_analysis.py

assignment-3-covid_analysis.py

assignment-3-crontab_entry.txt

assignment-3-crontab_entry.txt

Repository files navigation

Assignment-1

Problem Description:

Output can be shown as:

Instruction to be followed:

Assignment-2

Problem Description:

Output can be shown as:

Duplicate users are as follows:

Unique shell used among all the duplicate users above:

Instruction to be followed:

Assignment-3

Problem Description:

The following analytics shall be done:

Note:

Output can be shown as:

Instruction to be followed:

About

Releases

Packages

Languages

simon-das/big-data-and-hadoop

Folders and files

Latest commit

History

Repository files navigation

Assignment-1

Problem Description:

Output can be shown as:

Instruction to be followed:

Assignment-2

Problem Description:

Output can be shown as:

Duplicate users are as follows:

Unique shell used among all the duplicate users above:

Instruction to be followed:

Assignment-3

Problem Description:

The following analytics shall be done:

Note:

Output can be shown as:

Instruction to be followed:

About

Resources

Stars

Watchers

Forks

Languages