def question_two(): ''' Question 2 (3 pts) In Homework 1, you saw Algorithm ER for generating random graphs and reasoned analytically about the properties of the ER graphs it generates. Consider the simple modification of the algorithm to generate random directed graphs: For every ordered pair of distinct nodes (i,j), the modified algorithm adds the directed edge from i to j with probability p. For this question, your task is to consider the shape of the in-degree distribution for an ER graph and compare its shape to that of the physics citation graph. In the homework, we considered the probability of a specific in-degree, k, for a single node. Now, we are interested in the in-degree distribution for the entire ER graph. To determine the shape of this distribution, you are welcome to compute several examples of in-degree distributions or determine the shape mathematically. Once you have determined the shape of the in-degree distributions for ER graphs, compare the shape of this distribution to the shape of the in-degree distribution for the citation graph. When answering this question, make sure to address the following points: Is the expected in-degree the same for every node in an ER graph? Please answer yes or no and include a short explanation for your answer. What does the in-degree distribution for an ER graph look like? You may either provide a plot (linear or log/log) of the degree distribution for a small value of n or a short written description of the shape of the distribution. Does the shape of the in-degree distribution plot for ER look similar to the shape of the in-degree distribution for the citation graph? Provide a short explanation of the similarities or differences. Focus on comparing the shape of the two plots as discussed in the class page on "Creating, formatting, and comparing plots". ''' ''' Answer: 1 - Yes, the expected in-degree is the same for every node. This is because in this model each individual node has an equal chance of being connected to any other individual node. In the example of n = 1000, p = 0.5, the expected in-degree is 500 (n * p). 2 - See graph - it is a binomial distribution. # question2.png 3 - No, the shape of the two graphs look completely different. The ER graph is binomial, frequency peaks at the expected in-degree and falls at higher and lower values of in-degree. However the range of these values is still quite narrow and close together. In contrast the citation graph is a decaying curve - low in-degrees have a high frequency and as in-degree increases, frequency falls. It also has a much wider range of in-degree values than the ER graph. ''' er_graph = degree_graphs.in_degree_distribution(degree_graphs.make_graph_prob(1000, 0.5)) factor = 1.0 / sum(er_graph.itervalues()) for k in er_graph: er_graph[k] *= factor x = list(er_graph.keys()) y = list(er_graph.values()) plt.plot(x, y, 'go') plt.yscale('log') plt.xlim(300, 700) plt.xlabel('In-degree') plt.ylabel('Frequency (log)') plt.title('ER normalized in-degree distribution graph') plt.tight_layout() plt.show()
def question_one(): ''' Question 1 (4 pts) For this question, your task is to load a provided citation graph for 27,770 high energy physics theory papers. This graph has 352,768 edges. You should use the following code to load the citation graph as a dictionary. (For an extra challenge, you are welcome to write your own function to create the citation graph by parsing this text representation of the citation graph.) Your task for this question is to compute the in-degree distribution for this citation graph. Once you have computed this distribution, you should normalize the distribution (make the values in the dictionary sum to one) and then compute a log/log plot of the points in this normalized distribution. ''' CITATION_URL = "http://storage.googleapis.com/codeskulptor-alg/alg_phys-cite.txt" # load data from web using algo_load_graph # then get in degree distribution via degree_graphs citation_graph = algo_load_graph.load_graph(CITATION_URL) in_degree_dict = degree_graphs.in_degree_distribution(citation_graph) # normalise so dictionary equals 1 # individual total / sum of all totals factor = 1.0 / sum(in_degree_dict.itervalues()) for k in in_degree_dict: in_degree_dict[k] *= factor # store dict in a csv file with open('loglog.csv', 'wb') as f: # Just use 'w' mode in 3.x w = csv.DictWriter(f, in_degree_dict.keys()) w.writeheader() w.writerow(in_degree_dict) x = np.array(list(in_degree_dict.keys())) y = np.array(list(in_degree_dict.values())) plt.loglog(x, y, 'ro') plt.xlim(1, 2**14) plt.xlabel('In-degree') plt.ylabel('Frequency') plt.title('Normalized in-degree distribution graph (log-log) \n for citations from scientific papers') plt.show()
def question_four(): ''' Question 4 (3 pts) Your task for this question is to implement the DPA algorithm, compute a DPA graph using the values from Question 3, and then plot the in-degree distribution for this DPA graph. Creating an efficient implementation of the DPA algorithm from scratch is surprisingly tricky. The key issue in implementing the algorithm is to avoid iterating through every node in the graph when executing Line 6. Using a loop to implement Line 6 leads to implementations that require on the order of 30 minutes in desktop Python to create a DPA graph with 28000 nodes. To avoid this bottleneck, you are welcome to use this provided code that implements a DPATrial class. The class has two methods: __init__(num_nodes): Create a DPATrial object corresponding to a complete graph with num_nodes nodes. run_trial(num_nodes): Runs num_nodes number of DPA trials (lines 4- 6). Returns a set of the nodes, computed with the correct probabilities, that are neighbors of the new node. In the provided code, the DPATrial class maintains a list of node numbers that contains multiple instances of the same node number. If the number of instances of each node number is maintained in the same ratio as the desired probabilities, a call to random.choice() produces a random node number with the desired probability. Using this provided code, implementing the DPA algorithm is fairly simple and leads to an efficient implementation of the algorithm. For a challenge, you are also welcome to develop your own implementation of the DPA algorithm that does not use this provided code. Once you have created a DPA graph of the appropriate size, compute a (normalized) log/log plot of the points in the graph's in-degree distribution. ''' dpa_graph = dpa_trial.make_dpa_graph(27770, 13) in_degree_dict = degree_graphs.in_degree_distribution(dpa_graph) factor = 1.0 / sum(in_degree_dict.itervalues()) for k in in_degree_dict: in_degree_dict[k] *= factor x = np.array(list(in_degree_dict.keys())) y = np.array(list(in_degree_dict.values())) plt.loglog(x, y, 'bo') plt.xlim(1, 2**14) plt.xlabel('In-degree') plt.ylabel('Frequency') plt.title('DPA normalized in-degree distribution graph (log-log) \n \ n = 27,770, m = 13') plt.show()