Python clean_domain_url 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: data_reader.data_reader

메소드/함수: clean_domain_url

hotexamples.com에서의 예제들: 4

Python clean_domain_url - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 data_reader.data_reader.clean_domain_url에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

 def __init__(self,
              path,
              domain,
              year,
              incyear=None,
              skip_memory_error=False):
     domain = data_reader.clean_domain_url(domain)
     self.texts = self.load_domain(path,
                                   domain,
                                   year,
                                   incyear,
                                   skip_memory_error=skip_memory_error)

예제 #2

파일 보기

파일: waybackmachine_crawler.py 프로젝트: nazlialagoz/measuring-founding-strategy

    def split_wayback_url(self, wayback_url):
        original_url = re.sub(r'http://web.archive.org/web/\d+/', "",
                              wayback_url)
        website_piece = re.sub(r"http(s?)\://", "", original_url)

        try:
            (domain, address) = website_piece.split("/", 1)
        except ValueError:
            domain = website_piece
            address = ""

        domain = data_reader.clean_domain_url(domain)

        return (domain, address)

예제 #3

파일 보기

파일: download_all_crunchbase.py 프로젝트: nazlialagoz/measuring-founding-strategy

def download_all(websites):
    global company_index_track
    
    counter = 0
    count_downloaded = 0
    count_skipped = 0
    total_websites = websites.shape[0]
    
    print("\n\n\nStarting the scraping of all websites.  A total of {0} websites\n\n".format(total_websites))


    last_company = get_last_company()    
    for index, company in websites.iterrows():    

        if (counter % 100) == 0: 
            downloaded = read_already_downloaded()
            
        counter += 1
        if counter < company_index_track or counter < last_company:
            continue
        

        print("\nStarting crawl number {0} of {1} : {2}".format(counter,total_websites, company['website']))

        if data_reader.clean_domain_url(company['website']) in downloaded:
            print(".Skipping {0}. Already downloaded".format(company['website']))
            count_skipped += 1
            continue

        crawler = waybackmachine_crawler(company['website'])
        year = company['founding_year'] + 1
        crawler.crawl_from_date(year,1,1)

        company_index_track = counter
        store_last_company(counter)
        count_downloaded += 1
        tot = count_downloaded + count_skipped
        print("\t. -- Download done.\n\ \t. STATS: {0} Downloaded ({1}%). {2} Skipped ({3}%)".format(count_downloaded,  round(count_downloaded*100/tot), count_skipped, round(count_skipped *100/tot)))

예제 #4

파일 보기

    def load_domain(self,
                    path,
                    domain,
                    year=None,
                    incyear=None,
                    force_download=False,
                    skip_memory_error=False):

        clean_domain = data_reader.clean_domain_url(domain)
        root_folder = "{0}/{1}".format(path, clean_domain).replace("//", "/")

        if year is None:
            file_folder = root_folder
        else:
            file_folder = "{0}/{1}/{2}".format(path, clean_domain,
                                               year).replace("//", "/")

        if os.path.exists(root_folder) is False or os.path.exists(
                file_folder) is False or os.path.isdir(file_folder) is False:
            if force_download is True:
                #depends on whether it is startup download of public download
                pdb.set_trace()
                download_year = year if year is not None else incyear
                download_year = int(download_year)
                self.force_download(root_folder, domain, download_year)
            else:
                return None

        files = []
        for file_name in os.listdir(file_folder):
            text = self.load_page(file_folder + "/" + file_name,
                                  skip_memory_error=skip_memory_error)
            text = re.sub(r"\s+", " ", text)
            files.append(text)

        return files