Skip to content

Create WebKit/Safari .webarchive files on any platform

License

Notifications You must be signed in to change notification settings

David-Development/python-webarchive

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

python-webarchive

This is a quick hack demonstrating how to create WebKit/Safari .webarchive files, inspired by pocket-archive-stream.

Usage

TARGET_URL=http://foo.com python3 main.py

// with additional urls:
TARGET_URL=http://localhost:8000/ios/index.ios.html CHANGE_DOMAIN_FROM="http://localhost:8000/" CHANGE_DOMAIN_TO="https://my-fancy-domain.de/" ADDITIONAL_URLS="http://myurl.de/img1.png;http://myurl.de/img2.png" python3 main.py

Why .webarchive?

.webarchive is the native web page archive format on the Mac, and is essentially a serialized snapshot of Safari/WebKit state. On a Mac, these files are Spotlight-indexable and can be opened by just about anything that takes a "webpage" as input.

Despite the rising prominence of WARC as the standard web archiving format (which to this day requires plug-ins to be viewable on a browser) I quite like .webarchive, and built this in order to both demonstrate how to use it and have a minimally viable archive creator I can deploy as a service.

Anatomy of a .webarchive file

The file format is a nested binary .plist, with roughly the following structure:

{
    "WebMainResource": {
        "WebResourceURL": String(),
        "WebResourceMIMEType": String(),
        "WebResourceResponse": NSKeyedArchiver(NSObject)),
        "WebResourceData": Bytes(),
        "WebResourceTextEncodingName": String(optional=True)
    },
    "WebSubresources": [
        {item, item, item...}
    ]

}

So creating a .webarchive turns out to be fairly straightforward if you simply build a dict with the right structure and then serialize it using biplist (which works on any platform).

The only hitch would be WebResourceResponse (which uses a rather more complex way to encode the HTTP result headers), but fortunately that appears not to be necessary at all.

Next Steps

About

Create WebKit/Safari .webarchive files on any platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%