GitHub - inductiveload/pygrabber: Python based digital library archive assistant.

inductiveload / pygrabber Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Python based digital library archive assistant.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
grabs		grabs
icons		icons
.gitignore		.gitignore
README		README
README.textcleaning		README.textcleaning
_package.sh		_package.sh
clean_html.py		clean_html.py
consts.py		consts.py
filetype_detector.py		filetype_detector.py
grabbersources.py		grabbersources.py
page.py		page.py
pages2djvu.py		pages2djvu.py
pygrabber.desktop		pygrabber.desktop
pygrabber.py		pygrabber.py
setup.py		setup.py
utils.py		utils.py

Repository files navigation

pyGrabber is a program to download books from the Internet as a sequence 
of images. These images can then be OCR'd, either by retrieval of OCR 
from the online source, or by Tesseract. They can also be collated to 
DjVu, with OCR text layer, or uploaded individually to Wikimedia Commons.

DISCLAIMER:
pyGrabber is to be used ONLY to download public domain book in a legal
fashion. The capability of pyGrabber to download from a specific resource
does not imply that you are allowed to, and you do so at your own risk.


1) REQUIREMENTS ============================================================

Due to pyGrabber's wide range of tasks, there are several dependencies

    1)  python 2.x (2.6 is the development platform)
        Upgrade to Python 3 will be done if and when there is sufficient 
        demand, or when I migrate myself.
        
    2)  wxPython 2.8 for the GUI elements
    
    3)  LXML for the HTML parsing of webpages
    
    4)  pyWikipedia for the upload to Commons. Optional. You don't need
        this if you don't intend to upload using pyGrabber.
        
    5)  djvulibre for the DJVU construction. Optional. You don't need this
        if you won't convert to DJVU.
        
    6)  tesseract-ocr for the OCR (with lib-tiff support). Optional. You 
        don't need this is you will not perform OCR. You may need it even 
        if the source you are grabbing from provides OCR, as tesseract is 
        the fallback option.

2) USAGE ====================================================================

    2.1) Setting up pyGrabber

        Before you begin, you need to set USE_PYWIKIPEDIA to True or False
        as appropriate.
        
        If you set USE_PYWIKIPEDIA True, you also need to provide the 
        path to the PYWIKIPEDIA directory. Put this in PYWIKIPEDIA_PATH.

        
    2.2) Running pyGrabber

        Running pyGrabber is as simple as running the pygrabber.py from 
        the terminal. If you wish to upload files, you will need to 
        respond to queries from pywikipedia in the terminal.
        
    2.3)  Setting up pyGrabber for a job

        When you wish to begin a job, or "grab", you need to set the 
        option in the Settings Panel on the left:
        
        Text ID:    The unique identifier for the work you are working on.
                    See the "Sources" section for details.
                    
        Text Source: The Source you wish to download the text from. For a
                    list of source, see the "Sources" section.
                    
        Pages:      The first and last page of the range you wish to 
                    download, inclusive.
                    
        Guess from local files: Try to guess the first and last files 
                    based on which files are already avaiable in the 
                    local book directory.
                    
        Use a proxy: Whether or not to use a proxy to download. Use this if
                    the source only delivers content to certain locations.
                    
        Proxy IP:   IP address (and, optionally, port) of the proxy server.
                    eg. 111.222.333.444:80
                    
        Inter-fetch delay:  The delay between sequential fetches. This is
                    for use on servers which don't have sufficent upload
                    bandwidth, or on servers which will prevent rapid
                    downloading from a single source.
                    
        Top directory:  The directory into which you wish to put the 
                    directory holding the files for this grab.
                    eg. C:\book-grabs
                    
        Custom book directory: If this is not selected, the book directory
                    is set automatically, based on the top directory, source
                    and text id. If this is selected, the directory is
                    whatever is in the book-directory text box.
                    
        Book directory: The directory to store the grab files. If Custom 
                    book directory is unset, you can't change this.
        
        Filename prefix: The prefix of the generated and uploaded files.
        
                    eg. Prefix = Filename here
                    
                    DJVU file:  Filename here.djvu
                    Uploaded images: Filename here - 0001.jpg
                    
        Upload images: Whether you wish to upload individual images to 
                    Wikimedia Commons. You need pyWikipedia if you select
                    this option.
                    
        Force upload: Upload over files with the same name, useful if you
                    made a mistake first time around. Not recommended
                    otherwise.
        
        Template: The page template to provide as the image upload data.
                    If the template given is "template name", the upload 
                    data for the first image will be:
                    
                    {{template name|0001}}
                    
                    It is up to you to make sure this template exists and
                    can handle the page number correctly. If you want
                    more control over the data, such as specific parameters
                    for different page, pyGrabber is the wrong tool for
                    the pload.
                    
        Try to download missing images: If there are missing images in the
                    sequence, try to download them from the specified 
                    source. If this is not selected, missing pages will be
                    skipped.
                    
        Convert to DjVu: Convert the sequence of images to DjVu
        
        Bitonal DjVu: Make thate DjVu black-and-white only. This is good for
                    images that are already bitonal, and very long works
                    which need to be drastically compressed to fit in
                    100MB.

        DjVu quality:   The quality of the DjVu image compression. This is 
                    a number from 16 to 50. Only applies to some image
                    file formats.
                    
        Perform OCR, add to DjVu: Perform OCR by either downloading from
                    the specified source (only some sources provide OCR),
                    or as a fallback option, Tesseract.
                    
        OCR language: Tesseract language. eg. eng for English.
        
        Use Tesseract if source page has no OCR : If the source has not 
                    got any OCR for a page, select this option to generate
                    it with Tesseract.
        
        Perform all OCR locally with Tesseract: Do not find OCR from the
                    source, always use Tesseract. Useful if you are using
                    pyGrabber just to collate files, not download them.
                    
                    Overrides the previous option.
                    
        Use any availabe previously generated OCR: If you made OCR before,
                    don't bother fetching or generating new OCR.
                    
        Dump readable OCR: Provide a single concatenated OCR file at the 
                    end of the process, in addition to the page files.
                    
        Cleanup images before OCR: Clean the images with an Imagemagick 
                    script to try to improve OCR performance.
                    
        Cleanup commnd: This is the command you will use to perform the 
                    cleaning. You can use the following strings to
                    interpolate variables:
                    
                    %fin   the input file, from the source, or that you
                            saved to the directory yourself
                    %fout  the output file that will be used for OCR.
                            this will be removed automatically.
                    ;      split commands, if you need more than one 
                            step
                    
                    Double quotes to surround arguments with spaces.
                    \-escaping will not work. "This file" is right
                    This\ file is not.
                    
                    Normal environment variables (such as $HOME, ~)
                    can be used.
                            
                    Unicode must not be used, as shlex.split() doesn't 
                    accept that in Python 2.x
                        
    2.4) Starting and ending a job
        
        To start the processing, click "Begin grab". This button will then
        be greyed out and the "Abort grab" button will be enabled. The
        files will be checked for existing local files, and then they will
        be downloaded and processed one at a time. The DjVu will be
        constructed one page at a time, as we go along.
        
        If you wish to abort a grab, click "Abort grab". The grab will be 
        aborted once the current task is complete. The "start grab" button
        will re-appear when this happens. Be aware that this could take 
        a few seconds if the job is a long one (downloading and OCR
        especially).
        
        If you wish to delete all the files in the directory and start
        again, click "Delete all files". You will be prompted before 
        deletion. This is useful for "do-overs".
            
    2.5) Using pyGrabber with local files only
        
        You can use pyGrabber to generate DjVu and OCR from local files
        without fetching the images from a remote site.
        
        1)  Download the images to a local directory. Name them 0001.ext 
            and so on.
        2)  Select "Custm Book Directory" and enter the directory name 
            in the "Book Directory" textbox
        3)  Uncheck "Download missing images"
        4)  Set other conversion options as normal
        5)  Click begin - the files will appear in the file pane

About

Python based digital library archive assistant.

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grabs

grabs

icons

icons

.gitignore

.gitignore

README

README

README.textcleaning

README.textcleaning

_package.sh

_package.sh

clean_html.py

clean_html.py

consts.py

consts.py

filetype_detector.py

filetype_detector.py

grabbersources.py

grabbersources.py

page.py

page.py

pages2djvu.py

pages2djvu.py

pygrabber.desktop

pygrabber.desktop

pygrabber.py

pygrabber.py

setup.py

setup.py

utils.py

utils.py

Repository files navigation

About

Releases

Packages

Languages

inductiveload/pygrabber

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages