def main(targetWebsite, configFile): global unseenUrlList global BASELOGDIR global BASELOCKFILEDIR global BASEFILESTORAGEDIR global BASEERRORLOGDIR global BASECONTENTDIR global contentLogFile global mode # Extract website name sitename = KastGenericFunctionsLib.extractWebSiteName(targetWebsite) # First generate the folder structure if its does not exist. BASELOGDIR = KastGenericFunctionsLib.chkmkFolderStructure(BASELOGDIR) BASELOCKFILEDIR = KastGenericFunctionsLib.chkmkFolderStructure(BASELOCKFILEDIR) BASEFILESTORAGEDIR = KastGenericFunctionsLib.chkmkFolderStructure(BASEFILESTORAGEDIR + sitename + '/') BASEERRORLOGDIR = KastGenericFunctionsLib.chkmkFolderStructure(BASEERRORLOGDIR) BASECONTENTDIR = KastGenericFunctionsLib.chkmkFolderStructure(BASECONTENTDIR) # Now generate the task/target specific filenames. lockFile = BASELOCKFILEDIR + sitename + '.lock' errorLog = BASEERRORLOGDIR + sitename + '.error' contentLogFile = BASECONTENTDIR + sitename + '-' + str(round(time.time(), 2)) # Now check if the lock file exists and proceed with crawling. if os.path.exists(lockFile): KastGenericFunctionsLib.logException(sitename + ' crawl in progress - Exiting - ' + str(time.time()), BASELOGDIR + sitename + '.exit.log') sys.exit(-1) # Make a lock file. if mode == 'p': lf = file(lockFile, 'w') lf.close() # Read the config file into a Dictionary/Hash structure. targetWebsiteConfigs = KastParsersLib.kastConfigFileParser(configFile) if targetWebsiteConfigs == {}: KastGenericFunctionsLib.logException('Target website configs could not extracted - ' + str(time.time()), errorLog) sys.exit(-1) # Obtain the list of URLs from the above data structure and generate time domain # perfect series representation of html content. htmlSeries = [KastParsersLib.html2TagSignal(url) for url in targetWebsiteConfigs['SampleURLS']] # Calculate the average similarity measure. similarityMeasure = KastParsersLib.calculateThresholdDftDistanceScore(htmlSeries) # Populate the unseenUrlList unseenUrlList = KastParsersLib.populateUnseenUrlList(targetWebsite, unseenUrlList) if unseenUrlList == []: logException('Seed URL List is malformed. Crawl engine is exiting - ' + str(time.time()), errorLog) sys.exit(-1) # Start crawling crawl(targetWebsite) # Now apply the Page classification algorithm to preserve only the pages of interest. classify(htmlSeries, similarityMeasure) # Apply the CSS rules for scrapping content, this will serve as a simple rule engine template. contentExtractionRules = targetWebsiteConfigs['ContentExtractionRules'] extractContent(contentExtractionRules) # Convert the log file into RDF N Triples file predicateList = targetWebsiteConfigs['PredicateList'] nTriplesFile = table2RDFNTriplesConverter(contentLogFile, predicateList) # Now log all the information to AllegroGraphDB store2db(nTriplesFile)