CSC 499 S09: XML Model
From CSWiki
[edit]
Tag spectrum for enwik8
<id> : 34078 </id> : 34078 <title> : 12347 </title> : 12347 <timestamp> : 12347 </timestamp> : 12347 <contributor> : 12347 </contributor> : 12347 <page> : 12347 </page> : 12346 <revision> : 12347 </revision> : 12346 <text> : 12345 </text> : 12344 <comment> : 10043 </comment> : 10043 <username> : 9384 </username> : 9384 <minor/> : 5665 <ip> : 2963 </ip> : 2963 <restrictions> : 162 </restrictions> : 162 <namespace> : 19 </namespace> : 19 <text/> : 2 <mediawiki//////////////////> : 1 <namespaces> : 1 </namespaces> : 1 <generator> : 1 </generator> : 1 <namespace/> : 1 <sitename> : 1 </sitename> : 1 <case> : 1 </case> : 1 <base> : 1 </base> : 1 <siteinfo> : 1 </siteinfo> : 1
[edit]
Thoughts on tag spectrum
In tags with separate open and close, close is always less than or equal to open, indicating correct if truncated nesting (i.e. the file is terminated before the necessary close tags.)
Back of envelope math shows that just by maintaining the stack, we can expect to save about 600kb over plaintext. (current implementation)
I estimate that at least the same amount could be saved again by maintaining a dictionary of common tags of length > 2.
The tag with too many slashes is a result of the XMLNS declaration, but shouldn't affect results. Not using slashes to change states if inside of quotes may be a todo item.

