def iterateExpertClusters(startingDay=datetime(2011,3,19), endingDay=datetime(2011,3, 30)): # def iterateExpertClusters(startingDay=datetime(2011,3,19), endingDay=datetime(2011,4,7)): while startingDay<=endingDay: for line in FileIO.iterateJsonFromFile(experts_twitter_stream_settings.lsh_clusters_folder+FileIO.getFileByDay(startingDay)): currentTime = getDateTimeObjectFromTweetTimestamp(line['time_stamp']) for clusterMap in line['clusters']: yield (currentTime, TwitterCrowdsSpecificMethods.getClusterFromMapFormat(clusterMap)) startingDay+=timedelta(days=1)
def test_getClusterFromMapFormat(self): mapReresentation = {'clusterId': 1, 'mergedClustersList': [self.cluster1.clusterId], 'lastStreamAddedTime': getStringRepresentationForTweetTimestamp(test_time), 'streams': [self.doc1.docId], 'dimensions': {'#tcot':2, 'dsf':2}} cluster = TwitterCrowdsSpecificMethods.getClusterFromMapFormat(mapReresentation) self.assertEqual(1, cluster.clusterId) self.assertEqual([self.cluster1.clusterId], cluster.mergedClustersList) self.assertEqual([self.doc1.docId], cluster.documentsInCluster) self.assertEqual({'#tcot':2, 'dsf':2}, cluster) self.assertEqual(getStringRepresentationForTweetTimestamp(test_time), getStringRepresentationForTweetTimestamp(cluster.lastStreamAddedTime))
def test_combineClusters(self): clustersMap = {self.cluster1.clusterId: self.cluster1, self.cluster2.clusterId: self.cluster2} clustersMap = TwitterCrowdsSpecificMethods.combineClusters(clustersMap, **twitter_stream_settings) self.assertEqual(1, len(clustersMap)) mergedCluster = clustersMap.values()[0] self.assertEqual([self.doc1, self.doc2], list(mergedCluster.iterateDocumentsInCluster())) self.assertEqual(self.meanVectorForAllDocuments, mergedCluster) self.assertEqual([mergedCluster.docId, mergedCluster.docId], list(doc.clusterId for doc in mergedCluster.iterateDocumentsInCluster())) self.assertEqual([self.cluster1.clusterId, self.cluster2.clusterId], mergedCluster.mergedClustersList)
def iterateUserDocuments(fileName): dataForAggregation = defaultdict(Vector) textToIdMap = defaultdict(int) for tweet in FileIO.iterateJsonFromFile(fileName): textVector = TwitterCrowdsSpecificMethods.convertTweetJSONToMessage(tweet, **default_experts_twitter_stream_settings).vector textIdVector = Vector() for phrase in textVector: if phrase not in textToIdMap: textToIdMap[phrase]=str(len(textToIdMap)) textIdVector[textToIdMap[phrase]]=textVector[phrase] dataForAggregation[tweet['user']['screen_name'].lower()]+=textIdVector for k, v in dataForAggregation.iteritems(): yield k, v
def iterateTweetUsersAfterCombiningTweets(fileName, **stream_settings): dataForAggregation = defaultdict(Vector) textToIdMap = defaultdict(int) for tweet in TweetFiles.iterateTweetsFromGzip(fileName): textVector = TwitterCrowdsSpecificMethods.convertTweetJSONToMessage(tweet, **stream_settings).vector textIdVector = Vector() for phrase in textVector: if phrase not in textToIdMap: textToIdMap[phrase]=str(len(textToIdMap)) textIdVector[textToIdMap[phrase]]=textVector[phrase] dataForAggregation[tweet['user']['screen_name'].lower()]+=textIdVector for k, v in dataForAggregation.iteritems(): yield k, v
def _iterateUserDocuments(self): dataForAggregation = defaultdict(Vector) textToIdMap = defaultdict(int) for tweet in TweetFiles.iterateTweetsFromGzip(self.rawDataFileName): textVector = TwitterCrowdsSpecificMethods.convertTweetJSONToMessage(tweet, **self.stream_settings).vector textIdVector = Vector() for phrase in textVector: if phrase not in textToIdMap: textToIdMap[phrase] = str(len(textToIdMap)) textIdVector[textToIdMap[phrase]] = textVector[phrase] dataForAggregation[tweet["user"]["screen_name"].lower()] += textIdVector for k, v in dataForAggregation.iteritems(): yield k, v
def iterateUserDocuments(fileName): dataForAggregation = defaultdict(Vector) textToIdMap = defaultdict(int) for tweet in FileIO.iterateJsonFromFile(fileName): textVector = TwitterCrowdsSpecificMethods.convertTweetJSONToMessage( tweet, **default_experts_twitter_stream_settings).vector textIdVector = Vector() for phrase in textVector: if phrase not in textToIdMap: textToIdMap[phrase] = str(len(textToIdMap)) textIdVector[textToIdMap[phrase]] = textVector[phrase] dataForAggregation[tweet['user'] ['screen_name'].lower()] += textIdVector for k, v in dataForAggregation.iteritems(): yield k, v
def iterateTweetUsersAfterCombiningTweets(fileName, **stream_settings): dataForAggregation = defaultdict(Vector) textToIdMap = defaultdict(int) for tweet in TweetFiles.iterateTweetsFromGzip(fileName): textVector = TwitterCrowdsSpecificMethods.convertTweetJSONToMessage( tweet, **stream_settings).vector textIdVector = Vector() for phrase in textVector: if phrase not in textToIdMap: textToIdMap[phrase] = str(len(textToIdMap)) textIdVector[textToIdMap[phrase]] = textVector[phrase] dataForAggregation[tweet['user'] ['screen_name'].lower()] += textIdVector for k, v in dataForAggregation.iteritems(): yield k, v
def iteratePhrases(): for tweet in TweetFiles.iterateTweetsFromGzip('/mnt/chevron/kykamath/data/twitter/tweets_by_trends/2011_2_6.gz'): message = TwitterCrowdsSpecificMethods.convertTweetJSONToMessage(tweet, **settings) if message.vector: for phrase in message.vector: if phrase!='': yield (phrase, GeneralMethods.approximateEpoch(GeneralMethods.getEpochFromDateTimeObject(message.timeStamp), 60))
def dataIterator(self): for currentTime, clusterMaps in sorted(self.clusterMaps.iteritems()): for clusterMap in clusterMaps: yield (currentTime, TwitterCrowdsSpecificMethods.getClusterFromMapFormat( clusterMap))
def test_getClusterInMapFormat(self): mergedCluster = StreamCluster.getClusterObjectToMergeFrom(self.cluster1) mergedCluster.mergedClustersList = [self.cluster1.clusterId] mergedCluster.lastStreamAddedTime = test_time mapReresentation = {'clusterId': mergedCluster.clusterId, 'lastStreamAddedTime':getStringRepresentationForTweetTimestamp(mergedCluster.lastStreamAddedTime), 'mergedClustersList': [self.cluster1.clusterId], 'streams': [self.doc1.docId], 'dimensions': {'#tcot':2, 'dsf':2}} self.assertEqual(mapReresentation, TwitterCrowdsSpecificMethods.getClusterInMapFormat(mergedCluster))
def test_convertTweetJSONToMessage(self): message = TwitterCrowdsSpecificMethods.convertTweetJSONToMessage(self.tweet, **twitter_stream_settings) self.assertEqual({'project': 1, 'cluster': 1, 'streams': 1, 'highdimensional': 1}, message.vector)
def dataIterator(self): for currentTime, clusterMaps in sorted(self.clusterMaps.iteritems()): for clusterMap in clusterMaps: yield (currentTime, TwitterCrowdsSpecificMethods.getClusterFromMapFormat(clusterMap))