$Id: INTERNALS,v 1.3 1997/04/10 19:53:54 dps Exp $

Here is how the program works.


reader.cc (1.10)

read_character reads characters from a word document suitably
translated, including dsitingishing between multiple and single ^Gs,
etc.

The output is fetched by chunk_reader::read_chunk_raw that assembles
it into bits ignoring inclusions. chunk_Reader::read_chunk gets these
chunks are parcels them out with inclusion seperated out.

tok_seq::rd_token adds start and end tags for rows, fields, paragraphs
and all the rest storing the tokens in a table on a seperate queue
before transfering them all onto the main queue. tok_seq::rd_token also
keeps track of the size and detects the probable end of the table.

tok_seq::feed_token takes a token off the queue and requests a refill at the
appropiate time. At the end of the document it tests a flag and if the flag
is not set then adds a document end entry (and then feeds it to the caller).

OK, so far? Now the fun begins!

If you look at the outptut now you see horrofic stuff like
<PARAGRAPH>550 *<SPEC>eq \F(foom bar)</SPEC><PARGRAPH>= 42</PARAGRAPH>
so the input is further processed by tok_seq::math_collect().
math_collect() uses saved_tok as a one byte push back mechamism and
will use this token before asking feed_token() for one. Non-paragraphs and non-equations go straight thorugh.

When math_collect sees a paragaph is pears at the next item. If this
is not an equation it just forwards the token and stashes the item it
got in saved_token (saved_token is definately free: either it was used
or feed_token supplied something). If it sees an euqation it calls
math_reverse_scan to work out whether there is any equation in the
string (guesswork but works quite nicely). If math_reverse_scan
decides it is all real text the token is just forwarded (with the
extra token still stashed in saved_tok).

Assuming math_reverse_scan found something to move that material is
moved into the equation and ntok and the current token
modified. saved_token still pointds to ntok so we use the same
structure but new strings. The reduced paragraoh token is returned.

-----

When the code sees an equation special (quite possibly saved_tok from
the paragraph process above) it ask feed_token() for the next two
tokens. The next token is the end token for the special and the one
after that interesting, and will be called T (the token itself is
*ntok in the code).

If T is an equation the end spec token is junked and the two equations
joined.  One of the equations is then junked. The end special is
pushed onto the start of the outpiut for feed_token to find there;
saved_tok is pointed to the expanded equations. The code then returns
to the original read a token state so further aggregation can take
place.

If T is a paragraph then the code uses math_forward_scan to see how
much of that is consumed as part of the equation. If none then the end
special and paragraph tokens are pushed onto the front of the output
queue and saved_tok invalided. The code is then returns the current
(equation special) token. The end special passes straight through and
then the accumulaion can begin again.

If T (a paragraph) is partial consumed the current equation and it is
adjusted and the same processing as if the paragraph had no formula contents.

If T (a paragraph) entirely consumed its contents are added onto to
the text, the paragraph junked, the end spec pushed pack. saved_tok is
pointed to the current, expanded equations.  The code then returns
to the original read a token state so further aggregation can take
place.

The output now contians nice stuff like

<PARAGRAPH><SPEC> 550 * \F(foo,bar) = 29</SPEC></PARAGRAPH> and even
horrors that word veiwer renders as  displayed equations like
<PARAGRAPH><SPEC> 550 * \F(foo,bar) = 29</SPEC><PARAGRAPH>.</PARAGRAPH>




This output is requested by tok_seq::eqn_rd_token() which is an
internal method. It is not devoid of tricks however. Anything other
than the start of a paragragh passes straight through.


When it sees a paragraph it pushes it onto a seperate queue and
acculumates totals of characters and specials in it sees. The loop
exits when any of the following applies:

	The paragaraph character total exceeds then (small, currently 3)
	treshold.

	The end of the paragraph is spotted.

	A non-special, non-pargraph, non-other character is seen (if this
	happens we add the treshold to the count to be sure of being >= to it.


On exit from the loop if the total is less than the critical value the
queue is reveresed and inserted at the front of the output queue minus
the paragraph items. Since the tokens are inserted as the first
character of the ouput they appear in reverse order of insertion (hence
the reverse makes the elements appear it the original order on the output
queue). This deletes that extraneous and wrong full stop, for example.

Otherwise the queue is the elements are transfered to the front of the
output queue in the existing order (this actually just sets a couple
of pointers).

Either way the temporary queue is now empty and is deleted. The first
item dequeued is returned. (This is what rtest2 shows you).


-----

The output of eqn_rd_token is fed to the listhandling guesswork. A
list is started by A pargraph starting with a number 1 (enumerate) or
a bullet character (itemize). If a paragraph does not fit then it is
checked for a list start---if so it is assumed to be a sublist. The
end of the list is signalled by MAX_ITEM_SEP_PARS which can not be
part of a list (default value is 5). Only the last paragraph with the
appropiate lead in text is included. Since this involves delaying
tokens and looking ahead the easiest place to do this is in the
reader.

Only the bottom level list is actually ended by enough non-list
paragraphs; the list is closed and the output is fed back in, giving
them the chance to kill the list the next level up.


The lists queues are completely seperate from anything eqn_rd_token
uses: each list builds the tokens in items. Note that the first item
includes the list start to make it easy to recover the original text
if required. When a level of list is popped these tokens are pointed
to by recycled, which is initially NULL.

If the list only has one item the list is transformed back into its
(listless) tokens, as recieved from eqn_rd_token. The first token is
appended to the list bellow the one popped or the output queue if there
is not such list.

The main loop of read_token grabs values from recycled if it is not
null, setting it to NULL if it becomes empty. If recycled is NULL then
the code asks eqn_rd_token for a token. As with eqn_rd_token the loop
waits for its own output queue (outqueue) to have something in it and
returns the first item in this queue.


The overall effect of the layered intelligence described above is that
the code in reader.cc is too complicated for comfort. The overall
performance is nice though...


----------------------------------------------------------------------

OH, yes and the *TeX output format also uses context cues. There is a
minimal amount of context cue usage in the ascii format. Overall this
program tends towards my idea of a complex AI program using context
cues to do the right stuff with what word throws at it!!

I hope this is now 100% clear.