def scrapeMain(): for r in conf["scraper"]["queries"]: name = r["name"] page = r["page"] rules = r["rules"] replace = r["replace"] txtFilename = name + ".txt" ignore = conf["scraper"]["ignore"] url = conf["scraper"]["url"] + page begin = conf["scraper"]["pageBegin"] end = conf["scraper"]["pageEnd"] search = conf["scraper"]["searchTerm"] delim = conf["scraper"]["delim"] squash = conf["scraper"]["squash"] founds, pages = spidey.crawl(url, begin, end, search, delim, squash, txtFilename) print("Found: " + str(founds) + " results on " + str(pages) + " pages") print("Cleaning " + txtFilename) total, first, last = spidey.cleanResults(txtFilename, rules, replace, ignore) print("Captured " + str(total) + " results from " + first + " to " + last) print("Done!")
def scrapeMain(): for r in conf['scraper']['queries']: name = r['name'] page = r['page'] rules = r['rules'] replace = r['replace'] txtFilename = name + '.txt' ignore = conf['scraper']['ignore'] url = conf['scraper']['url'] + page begin = conf['scraper']['pageBegin'] end = conf['scraper']['pageEnd'] search = conf['scraper']['searchTerm'] delim = conf['scraper']['delim'] squash = conf['scraper']['squash'] founds, pages = spidey.crawl(url, begin, end, search, delim, squash, txtFilename) print('Found: ' + str(founds) + ' results on ' + str(pages) + ' pages') print('Cleaning ' + txtFilename) total, first, last = spidey.cleanResults(txtFilename, rules, replace, ignore) print('Captured ' + str(total) + ' results from ' + first + ' to ' + last) print('Done!')
def scrapeDeps(): print("Sorting deprecations...") depFilename = "LSL_DEPRECATED.txt" depFilename2 = "LSL_DEPRECATED2.txt" try: os.remove(depFilename) os.remove(depFilename2) except: pass # First search, on LSL_FUNCTIONS page searchTerm = "<s>" pageBegin = 'title="LlAbs"' pageEnd = 'id="footnote_1"' delim = [">", "<"] ignore = ["\n", "(", "(previous 200) (", "next 200", "previous 200", "(previous 200) (next 200)\n"] url = "http://wiki.secondlife.com/w/index.php?title=Category:LSL_Functions" spidey.crawl(url, pageBegin, pageEnd, searchTerm, delim, [], depFilename) spidey.cleanResults(depFilename, ["firstLower"], [False], ignore) # Second search, on the ill-maintained LSL_DEPRECATED page url = "http://wiki.secondlife.com/w/index.php?title=Category:LSL_Deprecated" pageBegin = "Pages in category" pageEnd = 'class="printfooter"' searchTerm = '<li><a href="/wiki/' ignore = ["\n", "(", "(previous 200) (", "next 200", "previous 200", "(previous 200) (next 200)\n"] spidey.crawl(url, pageBegin, pageEnd, searchTerm, delim, [], depFilename2) spidey.cleanResults(depFilename, [False], [" ", "_"], ignore) # Merge the two dep files depFile1 = open(depFilename, "r") depFile2 = open(depFilename2, "r") deps1 = depFile1.read().splitlines() # debug- print(str(deps1)) deps2 = [] for line in depFile2: line = line.strip("\n") if line[0:2] == "Ll": line = "ll" + line[2:] if not line in deps1: if line != "\\n": deps2.append(line.strip("\n")) deps2Txt = "\n".join(deps2) depFile1.close depFile2.close depFile1 = open(depFilename, "a") depFile1.write(deps2Txt) depFile1.close # Remove deps from captured result files case insensitively depFile = open(depFilename, "r") for r in conf["scraper"]["queries"]: srcFilename = r["name"] + ".txt" srcFile = open(srcFilename, "r") src = [] for line in srcFile: src.append(line.strip("\n")) for dep in depFile: dep = dep.strip("\n") try: src.remove(dep) except: pass print("Removing deprecation from: " + srcFilename + ": " + dep) srcTxt = "\n".join(src) srcFile.close srcFile = open(srcFilename, "w") srcFile.write(srcTxt) srcFile.close depFile.close print("Done!")
def scrapeDeps(): print('Sorting deprecations...') depFilename = 'LSL_DEPRECATED.txt' depFilename2 = 'LSL_DEPRECATED2.txt' try: os.remove(depFilename) os.remove(depFilename2) except: pass # First search, on LSL_FUNCTIONS page searchTerm = '<s>' pageBegin = 'title="LlAbs"' pageEnd = 'id="footnote_1"' delim = ['>','<'] ignore = ['\n', '(', '(previous 200) (', 'next 200', 'previous 200', '(previous 200) (next 200)\n'] url = 'http://wiki.secondlife.com/w/index.php?title=Category:LSL_Functions' spidey.crawl(url, pageBegin, pageEnd, searchTerm, delim, [], depFilename) spidey.cleanResults(depFilename, ['firstLower'], [False], ignore) # Second search, on the ill-maintained LSL_DEPRECATED page url = 'http://wiki.secondlife.com/w/index.php?title=Category:LSL_Deprecated' pageBegin = 'Pages in category' pageEnd = 'class="printfooter"' searchTerm = '<li><a href="/wiki/' ignore = ['\n', '(', '(previous 200) (', 'next 200', 'previous 200', '(previous 200) (next 200)\n'] spidey.crawl(url, pageBegin, pageEnd, searchTerm, delim, [], depFilename2) spidey.cleanResults(depFilename, [False], [' ','_'], ignore) # Merge the two dep files depFile1 = open(depFilename, 'r') depFile2 = open(depFilename2, 'r') deps1 = depFile1.read().splitlines() # debug- print(str(deps1)) deps2 = [] for line in depFile2: line = line.strip('\n') if line[0:2] == 'Ll': line = 'll' + line[2:] if not line in deps1: if line != '\\n': deps2.append(line.strip('\n')) deps2Txt = '\n'.join(deps2) depFile1.close depFile2.close depFile1 = open(depFilename, 'a') depFile1.write(deps2Txt) depFile1.close # Remove deps from captured result files case insensitively depFile = open(depFilename, 'r') for r in conf['scraper']['queries']: srcFilename = r['name'] + '.txt' srcFile = open(srcFilename, 'r') src = [] for line in srcFile: src.append(line.strip('\n')) for dep in depFile: dep = dep.strip('\n') try: src.remove(dep) except: pass print('Removing deprecation from: ' + srcFilename + ': ' + dep) srcTxt = '\n'.join(src) srcFile.close srcFile = open(srcFilename, 'w') srcFile.write(srcTxt) srcFile.close depFile.close print('Done!')