lamielle/simphtml
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Simple HTML parser: Usage: ./parse <file>.html Notes: -Whitespace in tags is dropped, but whitespace in text is preserved. -A trailing Text tag with a single newline will always be present in the result. -Line numbers for match errors are not yet printed. Original problem spec: Write a program to take as input a file, and determine whether it is a properly formatted subset of html. Here are the rules for what should be accepted: 1) A standalone tag is of the form <foo/>, where there is a '/' immediately before the closing '>'. It can appear anywhere that text would appear. 2) A tag is of the form <foo> and must be closed with </foo>. Text and other tags can appear in between. Hitting '</bar>' in the document without having an active '<bar>' tag is illegal. Hitting the end of the document without finding '</foo>' while processing a '<foo>' tag is illegal. Tags of either form must be valid C identifiers, with hyphens allowed as well. 3) Text can appear anywhere that a tag would start. When parsing text, the character '&' is an escape character. To put a literal '&' the text should contain '&', and to put a literal '<' the text should contain '<'. All other escape characters are invalid. If the document is not formatted in a valid way, you should print the line number where there is a problem, and what the problem is. If the document is formatted in a valid way, you should return a data structure that represents the document, including all tags and text (with all values unescaped).
About
Simple HTML Parser Written in Python
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published