Skip to content

Allows simple, sophisticated access to word processor text documents.

Notifications You must be signed in to change notification settings

TheHeadlessSourceMan/docStructure

Repository files navigation

docStructure

Status: Stable Python Version: 2.7 Release Version: 1.0 webpage:click here

License - LGPL

Allows simple, sophisticated access to word processor text documents.

  • Plug-in based support for unlimited file formats.
  • You can get simple string access to a document regardless of whether it’s marked up or not. e.g. myDoc[5]=’x’
  • You can find document elements. myDoc.chapter[1].word[19]
  • You can find document elements in a flat manner. ‘the’==myDoc.word[19] or myDoc.sentence[5]
  • Get linguistic info on elements. myDoc.word[19].partOfSpeech or myDoc.word[19].definition
  • All documents can be reduced to plain text or html for universal display/conversion. myDoc.text or myDoc.html

Uses:

  • Grammar/spelling checker. (for instance, “grammarcheck“)
  • Document viewer (possibly web-based).
  • Document type conversion.
  • Data mining.
  • Grep searching.
  • Natural language processing.
  • Machine learning.

Examples:

Can load complex file formats and get useful info.

d=Document('myDoc.odf')
> d.author
William Shakespere

It’s easy to get important stats.

> d.numWords
105

> d.numChapters
2

The document can be accessed flatly, or in a hierarchy.

> d.chapter[0].word[0]
The
> d.chapter[0].word[1]
I
> d.word[0]
The

It can attempt to deduce who a pronoun is referencing.

> d.word[7]
he
> d.word[7].partOfSpeech
pronoun
> d.word[7].refersTo
Fred

You can always access the document as simple string (a useful trick for marked up documents!)

> d.find('Fred')
27
> d[27]
F
> d[27:3]
Fre

Get any document as plain text or html

> d.html='This is a document.'
> d.text
This is a document

Combine concepts.

> d.word[7][0:4]
Fred
> d.chapter[5].html='This is a document.'

Notes:

  • When passing into other code to use as string, beware of “if type(x)==str:” In this case, your code will NOT treat a Document as a str!
  • also beware of file.write(doc) this would write a flattened string, which may or may not be what you want. Alternatives: doc.save(filename) or file.write(doc.html)

Status:

  • Has a basic implementation that more or less works.
  • Currently working under the assumption of flatten=>use=>replace_selection where each document fragment contains an index into the document for quick finding
  • This length counting makes indexing fast, but replacements early in the file slow.

TODO:

  • Insert+index could be speeded up using offset from parent element. It slows down read index, but could be a good compromise.
  • Inserting across elements is an unknown. e.g. replace “bears love” in “happy bears love”
  • Still need to develop a concept for single character representation, such as html ”
    ” = “\n” and “>” = “>”

Links:

Main webpage: https://theheadlesssourceman.wordpress.com/2018/08/02/docstructure/

About

Allows simple, sophisticated access to word processor text documents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published