Scrapy is a great framework for web crawling. This middleware provides a proxy rotation from many sources which define in the settings in settings.py, spider, request.
- Tests on Python 3.6
- Tests on Linux, but it's a pure python module - it should work on any other platforms with official python supported, e.g. Windows, Mac OSX, BSD
The quick way:
pip install scrapy-proxy-management
This middleware supports the following proxy storage ways:
- environment variables (compatible with the middleware provided by scrapy)
- settings.py
- MongoDB
The relative settings are followed:
This is the default setting in this middleware, which has the same behaviours and settings with the middleware provided by scrapy:
# --------------------------------------------------------------------------- # Proxy Management # --------------------------------------------------------------------------- HTTPPROXY_STORAGE = 'scrapy_proxy_management.extensions.environment_http_proxy.EnvironmentProxyStorage' # default HTTPPROXY_ENABLED = True # default False HTTPPROXY_AUTH_ENCODING = 'latin-1' # default latin-1
This way allows scrapy using the proxies defined in the settings.py. This middleware would use the proxies in an endless cycle:
# --------------------------------------------------------------------------- # Proxy Management # --------------------------------------------------------------------------- HTTPPROXY_STORAGE = 'scrapy_proxy_management.extensions.settings_http_proxy.SettingsProxyStorage' HTTPPROXY_ENABLED = True # default False HTTPPROXY_AUTH_ENCODING = 'latin-1' # default latin-1 HTTPPROXY_PROXIES = { 'http': [ 'http://username:password@proxy01.com', 'http://username:password@proxy02.com', ], 'https': [ 'https://username:password@proxy01.com', 'https://username:password@proxy02.com', ], 'no': [ 'noproxy01.com', 'noproxy02.com', ], }
This way allows scrapy using the proxies saved in MongoDB. This middleware would retrieve the proxies from MongoDB in a user-defined way:
# --------------------------------------------------------------------------- # Proxy Management # --------------------------------------------------------------------------- HTTPPROXY_STORAGE = 'scrapy_proxy_management.extensions.mongodb_http_proxy.MongoDBProxyStorage' HTTPPROXY_ENABLED = True # default False HTTPPROXY_AUTH_ENCODING = 'latin-1' # default latin-1 # HTTPPROXY_MONGODB_USERNAME = # HTTPPROXY_MONGODB_PASSWORD = HTTPPROXY_MONGODB_HOST = 'localhost' HTTPPROXY_MONGODB_PORT = 27017 # HTTPPROXY_MONGODB_OPTIONS_ = HTTPPROXY_MONGODB_DATABASE = 'scrapy_proxies' HTTPPROXY_MONGODB_COLLECTION = 'proxies' HTTPPROXY_MONGODB_AUTHSOURCE = HTTPPROXY_MONGODB_DATABASE # default same with the database contained proxies HTTPPROXY_MONGODB_NOT_MONGOCLIENT_PARAMETERS = { 'collection', 'database', 'get_proxy_from_doc', 'not_mongoclient_parameters', 'proxy_management_strategy', 'proxy_retriever', } # if any parameters added in settings.py but not belongs to mongoclient, add it here HTTPPROXY_MONGODB_PROXY_RETRIEVER = { 'name': 'find', 'filter': None, 'projection': { '_id': 1, 'scheme': 1, 'proxy': 1, 'username': 1, 'password': 1 }, 'skip': 0, 'limit': 0, 'sort': None } # the method used to retrieve the proxies from the collection HTTPPROXY_MONGODB_GET_PROXY_FROM_DOC = 'scrapy_proxy_management.extensions.mongodb_http_proxy.get_proxy_from_doc' # the method to extract proxy from each document in the collection HTTPPROXY_MONGODB_PROXY_MANAGEMENT_STRATEGY = 'scrapy_proxy_management.extensions.strategies.default_proxy_management_strategy.DefaultProxyManagementStrategy' # the strategy of the proxy management