Operation Restore Uber Account

Disclaimer: this crawler was operational as of 05/18/2020

Objective

The goal of this project is to create a crawler to scrap LinkedIn for all Uber employees in San Francisco and São Paulo and mass email them a letter hoping that someone, somewhere would be customer oriented enough to reactivate my account.

Background

The idea for this project emerged when my Uber account was disabled, and after a long time trying to restore it, I became very frustrated and finally gave up trying fix it via customer service. If you are interested in knowing the whole story you should read the email I sent them. Long story short, my account started to get blocked for no reason. Every time I would reach their customer service and they would always say that it was just a glitch and would promptly reactivate it. This went on and on until one day, without any explanation, they permanently disabled my account and stop responding me.

At this point, the smart move would have been to just change my phone number and get a new account. But after going through all the headache and having spent a lot of time emailing them back and forth, I refused to bow, and instead, had the "brilliant idea" to spend even more of my time on this issue and came up with this project.

Dataset

Uber doesn't really have a catalog or an API I could just pull their employees emails from so I had to get the data myself. I noticed by exchanging emails with Uber that their emails had two different formats, as shown in the example below:

Rafael Gama

rafael@uber.com

rafaelg@uber.com

Now I that knew Uber's email formats, all I needed was to get a list of employees first and last names and construct the emails. In order to do that, I developed a crawler that goes to LinkedIn's people search section, selects a city and a company, and scrape all results.

Crawler Development

LinkedIn proved to be a worthy adversary and a particularly hard site to crawl. After analyzing the task, I came up with the following set challenges and solutions:

LinkedIn requires JavaScript to render content on the page:

I am familiar with both selenium and requests but because of this peculiarity, I couldn't crawl the website using requests so I had to settle for selenium
LinkedIn server reject a large series of requests from the same IP address in a given time period and suspends your ability to search for a few days. Also, the page rendering time would greatly vary between requests:

To avoid using long sleep times, I tried to use Selenium's Wait module to dynamically wait for the page to load but had little success. The code was unstable and breaking all the time so I was forced to increase the sleep times to solve both problems.
LinkedIn is on the lookout for bot behavior patterns:

To reduce my chances of being flagged, I created and applied a set of functions that try to simulate human behavior by scrolling up and down and randomly clicking on profiles.

Data Preparation

Number of results in each city:

San Francisco: 940
São Paulo: 1000

However, not all results were relevant for my purposes as many of my results no longer work at Uber among other things. So I went on to the process of cleaning and exploring my dataset to come up with a more relevant employee list that would increase my chances of being helped while saving resources. Because the dataset is not complex, there wasn't much to do so I wrote a little script to execute the following actions:

Search and remove duplicates
Remove all employees not currently employed at Uber
Remove all employees showing "LinkedIn Member" as a name
Remove all employees working in the capacity of a driver/motorista (Portuguese)
Remove all employees working for Uber Eats and Uber Freight, UberAir, UberAIR / UberElevate and Uber Works

After cleaning the data, I was left with 764 employees (779 using pandas) from San Francisco and 585 employees for São Paulo (692 using pandas). The discrepancy between the results using my script and pandas got me a little intrigued but I decided to investigate that at a later time. Then, my next step was to develop an email constructor that returns 2 emails per name, one in each format.

Execution

Finally, all that was left to do was to integrate the email constructor in the email sender script and start bombarding Uber with emails, right?

Awwww, if it was only that easy...There is still one challenge left to overcome: Servers restrictions at both ends.

Both servers can flag my emails as spam
LinkedIn's server may perceive my emails as a security threat
My provider (Gmail) has email sending limits of their own

Because the content of my emails is exactly the same aside the person being addressed, they are in considerable risk of being flagged as spam. At the same time, if the emailing frequency is too high, it may trigger a security alert. The million dollar question is, how much is too much? Because I am not a expert in servers, I decided to be cautious and randomly space out the time between emails and send them in batches of 20 emails every 10 minutes, until I hit 360 sent emails, which was target I stipulated per day.

Results

The outcome of the project couldn't have been better. I ended up getting my account restored in the very first round of emails!!!

In the end, 168 emails were sent but only 31 reached its destination. I was expecting the conversion rate to be twice of that because I was only using 2 email formats per employee. After further investigation, I discovered that Uber uses a few other email formats so that explains why most of them were not delivered.

So that's it! Mission accomplished!!! My account is back in business so let's see if will remais this way now...

Potential Improvements

Since this is a pet project, the first and foremost goal was to have it up a running as fast as possible so I could actually try to fix my problem and restore my Uber account. However, while I and I was working on the project, I had many ideas on what can be done to improve the crawler's performance, scalability and robustness. These are a few of those ideas:

Find a way to reduce the number and duration of sleep times
Parallelize code in order to run multiple collections simultaneously
Rotate between different IPs to reduce the chance of getting flagged
Find out the reason for the discrepancy between the results generated by using my data cleaning function and pandas.
Refactor code to use a workflow management system, to become more robust to failures and be able to resume pipelines from failing points

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
crawler		crawler
data_cleaned		data_cleaned
data_cleaning		data_cleaning
data_raw		data_raw
email_factory		email_factory
email_sender		email_sender
images		images
notebooks		notebooks
utils		utils
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
env_variables.env		env_variables.env
main.py		main.py
requirements.txt		requirements.txt
uber.txt		uber.txt

rafaelcgama/restore-uber

Folders and files

Latest commit

History

Repository files navigation

Operation Restore Uber Account

Disclaimer: this crawler was operational as of 05/18/2020

Objective

Background

Dataset

Crawler Development

LinkedIn requires JavaScript to render content on the page:

LinkedIn server reject a large series of requests from the same IP address in a given time period and suspends your ability to search for a few days. Also, the page rendering time would greatly vary between requests:

LinkedIn is on the lookout for bot behavior patterns:

Data Preparation

Execution

Results

Potential Improvements

About

Resources

Stars

Watchers

Forks

Languages