Skip to content

lamielle/simphtml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple HTML parser:

Usage: ./parse <file>.html

Notes:
-Whitespace in tags is dropped, but whitespace in text is preserved.
-A trailing Text tag with a single newline will always be present in the result.
-Line numbers for match errors are not yet printed.

Original problem spec:
Write a program to take as input a file, and determine whether it is a properly
formatted subset of html.  Here are the rules for what should be accepted:

1) A standalone tag is of the form <foo/>, where there is a '/' immediately
before the closing '>'.  It can appear anywhere that text would appear.

2) A tag is of the form <foo> and must be closed with </foo>.  Text and other
tags can appear in between.  Hitting '</bar>' in the document without having an
active '<bar>' tag is illegal.  Hitting the end of the document without finding
'</foo>' while processing a '<foo>' tag is illegal.  Tags of either form must be
valid C identifiers, with hyphens allowed as well.

3) Text can appear anywhere that a tag would start.  When parsing text, the
character '&' is an escape character.  To put a literal '&' the text should
contain '&amp;', and to put a literal '<' the text should contain '&lt;'.  All
other escape characters are invalid.

If the document is not formatted in a valid way, you should print the line
number where there is a problem, and what the problem is.

If the document is formatted in a valid way, you should return a data structure
that represents the document, including all tags and text (with all values
unescaped).

About

Simple HTML Parser Written in Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages