% !Mode:: "TeX:DE:UTF-8:Main" \PassOptionsToPackage{check-declarations,enable-debug}{expl3} % Note on the compilation of the documentation: % The documentation uses for the tagging sometimes code % that is under development and/or not public yet. % To compile an *untagged* documentation, comment the line with % the testphase keys in the following \DocumentMetadata command. \DocumentMetadata { % comment the following line to compile an untagged documentation: testphase={phase-III,title,table}, pdfversion=2.0,lang=en-UK,pdfstandard=a-4,pdfstandard=ua-2 %uncompress } \DebugBlocksOff \makeatletter \def\UlrikeFischer@package@version{0.99n} \def\UlrikeFischer@package@date{2025-02-23} \makeatother \documentclass[bibliography=totoc,a4paper]{article} \usepackage{geometry} \usepackage[english]{babel} \usepackage{unicode-math} \setmainfont{Heuristica} \usepackage[nopatch]{microtype} \usepackage[autostyle]{csquotes} \usepackage[style=numeric]{biblatex} \addbibresource{tagpdf.bib} \reversemarginpar \NewDocumentCommand\sidenote{m}{\marginpar{#1}} \usepackage{booktabs} \setlength\belowcaptionskip{10pt} \usepackage{tcolorbox} \usepackage{tikz} \usetikzlibrary{positioning} \usetikzlibrary{fit,tikzmark} \usetikzlibrary{arrows.meta} \tikzset{arg/.style = {font=\footnotesize\ttfamily, anchor=base,draw, rounded corners,node distance=2mm and 2mm}} \tikzset{operator/.style = {font=\footnotesize\ttfamily, anchor=base,draw, rounded corners,node distance=4mm and 4mm}} \usepackage{listings} \lstset{basicstyle=\ttfamily, columns=fullflexible,language=[LaTeX]TeX, escapechar=*, commentstyle=\color{green!50!black}\bfseries} % this allow to get real spaces in the code parts. % This should perhaps be combined in a new listings key \lstset{showspaces} \makeatletter \def\lst@visiblespace{\lst@ttfamily{\char32}{\char32}}\makeatother \tagpdfsetup{tabsorder=structure} \usepackage[pdfdisplaydoctitle=true]{hyperref} \hypersetup{ pdftitle={The tagpdf package, v\csname UlrikeFischer@package@version\endcsname}, pdfauthor=Ulrike Fischer, colorlinks} \tcbuselibrary{documentation} \definecolor{Definition}{rgb}{0,0.2,0.6} \newcommand\PrintKeyName[1]{\textsf{#1}} \newcommand\pkg[1]{\texttt{#1}} \newcommand\DescribeKey[1]{\texttt{#1}} %tagging patches: \usepackage{tagpdfdocu-patches} \newcommand\PDF{PDF} \title{The \pkg{tagpdf} package, v\csname UlrikeFischer@package@version\endcsname} \date{\csname UlrikeFischer@package@date\endcsname} \author{Ulrike Fischer\thanks{fischer@troubleshooting-tex.de}} \usepackage{shortvrb} \MakeShortVerb| \begin{document} \maketitle \begin{tcolorbox}[colframe=red] This package is not meant for direct use in (normal) documents. It started in 2018 as a support tool to \emph{research} tagging. It is now the base of the code developed in the \pkg{latex-lab} bundle for the Tagged PDF project (i.e., loaded by that code) \url{https://www.latex-project.org/publications/indexbytopic/pdf/}. The package is developed and improved in parallel with the code in the \pkg{latex-lab} bundle (part of the core \LaTeX{} distribution), the \pkg{pdfmanagement-testphase} package (the \LaTeX{} PDF management bundle) and the L3 programming layer (part of the \LaTeX{} format). That means you must ensure that all these components are up-to-date and in sync which each other. This package quite probably still contains some bugs. It is in some parts quite slow because the code currently prefers readability over speed. At some point in the future its code will be integrated into the \LaTeX{} format and then this package will disappear. Because of its function as a research and development tool it is important to understand that this package can still change in incompatible ways from one version to the next. You need some knowledge about \TeX, \PDF{} and perhaps even lua to use it. \medskip Issues, comments, suggestions can be added as issues to these two github tracker: \medskip \centering \url{https://github.com/latex3/tagging-project}\par \leavevmode\llap{or\qquad\qquad} \url{https://github.com/latex3/tagpdf} \end{tcolorbox} \tagtool{sec-add-grouping=false} \tableofcontents \tagtool{sec-add-grouping} \section{Introduction} For many years the creation of accessible, tagged \PDF{}-files with \LaTeX\ that conform to the PDF/UA standard has been on the agenda of \TeX-meetings. Many people agree that this is important and Ross Moore has done quite some work on it. There is also a TUG-mailing list and a web page \parencite{tugaccess} dedicated to this topic. In my opinion missing were means to \emph{experiment} with tagging and accessibility. Means to try out, how difficult it is to tag some structures, means to try out, how much tagging is really needed (standards and validators don't need to be right \ldots), means to test what else is needed so that a \PDF{} works e.g. with a screen reader, means to try out how core \LaTeX\ commands behave if tagging is used. Without such experiments it is in my opinion quite difficult to get a feeling about what has to be done, which kernel changes are needed, and how packages should be adapted. This package was developed to close this gap by offering \emph{core} commands to tag a \PDF{}\footnote{In case you don't know what this means: there will be some explanations later on.}. My hope was that the knowledge gained by the use of this package would in the end allow to decide if and how code to do tagging should become part of the \LaTeX\ kernel. The code has been written so that it can be added as module to the \LaTeX{} kernel itself if it turned out to be usable. It therefore avoid to patch commands from other packages. It was also not an aim of the package to develop patches to directly enable tagging in other packages. While in the end changes to various commands in many classes and packages will be needed to automatically get tagged \PDF{} files, these changes should be done by class, package and document writers themselves using a sensible API provided by the kernel and not by some external package that adds patches everywhere and would need constant maintenance --- one only need to look at packages like \pkg{tex4ht} or \pkg{bidi} or \pkg{hyperref} to see how difficult and sometimes fragile this is. The package is now a part of the Tagged PDF project and triggered already various changes in the \LaTeX\ kernel and the engines: There is a new PDF management, the new para hooks allows to automatically tag paragraphs, after changes in the output routine page breaks and header and footer are handled correctly, the engines now support structure destinations. More changes are in the latex-lab bundle and can be loaded through \texttt{testphase} keys. I'm sure that tagpdf still has bugs. Bugs reports, suggestions and comments can be added to the issue tracker on github either \url{https://github.com/latex3/tagpdf} or \url{https://github.com/latex3/tagging-project}. Please also check the github site and latex-lab for new examples and improvements. \subsection{Tagging and accessibility} While the package is named \pkg{tagpdf} the goal is also \emph{accessible} \PDF{}-files. Tagging is \emph{one} (the most difficult) requirement for accessibility but there are others. I will mention some later on in this documentation, and -- if sensible -- also add code, keys or tips for them. So the name of the package is a bit wrong. As excuse I can only say that it is short and easy to pronounce (and of course, it was always meant to be temporary). \subsection{Engines and modes} Theoretically, the package works with all engines, but the xelatex and the latex-dvips-route are basically untested and they also don't support real space glyphs so I don't recommend them. lualatex is the most powerful and safe modus and should be used for new documents, it is slower than pdflatex but requires less compilations. pdflatex works ok and can be used for legacy documents; it needs more compilations to resolve all cross references needed for the tagging. The package has two modes: the \emph{generic mode} which should work in theory with every engine and the \emph{lua mode} which works only with lualatex and (since version 0.98k) with dvilualatex. Since version 0.99m, the lua mode is forced if luatex is detected, otherwise generic mode is used. I implemented the generic mode first. Mostly because my \TeX\ skills are much better than my lua skills and I wanted to get the \TeX\ side right before starting to fight with attributes and node traversing. While the generic mode is not bad and I spent quite some time to get it working I nevertheless think that the lua mode is the future and the only one that will be usable for larger documents. \PDF{} is a page orientated format and so the ability of luatex to manipulate pages and nodes after the \TeX-processing has finished is really useful here. Also with luatex characters are normally already given as Unicode. The package uses quite a lot labels (in generic mode more than with luamode). It is now based on the property module of the \LaTeX{} kernel. This module provides expandable references but the drawback is that (right now) they don't always give good rerun messages if they have changed. I advise to use the \pkg{rerunfilecheck} package as a intermediate work-around and when using pdflatex compile at least once or twice more often then normal. \subsection{References and target PDF version} My main reference for the first versions of this package was the free reference for \PDF{} 1.7. \parencite{pdfreference} and so they implemented only support for \PDF{} 1.7. In 2018 \PDF{} 2.0. has been released. The reference can now be bought at no cost through the PDF association. \PDF{} 2.0 has a number of features that are really needed for good tagging: it knows more structure types, it allows to add associated files to structures---these are small, embedded files that can, for example, contain the mathML or source code of an equation---, it knows structure destinations, which allows to link to a structure. It knows the MathML namespace. \LaTeX{} therefore targets \PDF{} 2.0 and tagpdf has support for associated files, for name spaces and other \PDF{} 2.0 features. \PDF{}~2.0 features are currently (begin of 2025) still not well supported by \PDF~consumer, but some progress has been made. Foxit can handle MathML associated files and to some extend MathML structure elements and together with development versions of NVDA and MathCat reading of equations is already quite good. The PDF Accessibility Checker (PAC) no longer crashes if one tries to load a \PDF{} 2.0 file. We recommend to use \PDF{} 2.0 if possible and then to complain to the PDF{} consumer if something doesn't work. The package doesn't try to suppress all 2.0 features if an older \PDF{} version is produced. It normally doesn't harm if a \PDF{} contains keys unknown in its version and it makes the code faster and easier to maintain if there aren't too many tests and code paths; so for example associated files will always be added. But tests could be added in case this leads to incompatibilities. \subsection{Validation} \PDF{}'s created with the commands of this package must be validated: \begin{itemize} \item One must check that the \PDF{} is \emph{syntactically} correct. It is rather easy to create broken \PDF{}: e.g. if a chunk is opened on one page but closed on the next page or if the document isn't compiled often enough. \item One must check how good the PDF follows requirements of standards like PDF/UA \emph{formally}\footnote{The PDF/UA-2 standard for \PDF~2.0 will hopefully be released begin of 2024.}. \item One must check how good the accessibility is \emph{practically}. \end{itemize} Syntax validation and formal standard validation can be done with the validator veraPDF \parencite{verapdf} which can also handle PDF 2.0 files. Other options (only for PDF 1.7 and older) are preflight of the (non-free) Adobe Acrobat and the free \PDF{} Accessibility Checker (PAC~2024) \parencite{pac2024}. A quite useful tool is \enquote{Next Generation PDF} \parencite{ngpdf}, a browser application which converts a tagged PDF to html, allows to inspect its structure and also to edit the structure. For PDF~2.0 files there is also a checker based on the Arlington model from veraPDF. A tool developed by the \LaTeX{} team allows to extract the structure as XML and to validate it against a schema. This can be tested as \url{https://texlive.net/showtags}. Practical validation is naturally the more complicated part. It needs screen reader, users which actually knows how to handle them, can test documents and can report where a \PDF{} has real accessibility problems. \minisec{Preflight woes} Sadly validators can not be always trusted. As an example for an reason that I don't understand the adobe preflight don't like the list structure \texttt{L}. It is also possible that validators contradict: that the one says everything is okay, while the other complains. Generally when in doubt I recommend to use and trust verapdf. \subsection{Examples wanted!} To make the package usable examples are needed: examples that demonstrate how various structures can be tagged and which patches are needed, examples for the test suite, examples that demonstrates problems. \begin{tcolorbox} Feedback, contributions and corrections are welcome! \end{tcolorbox} All examples should use the \cs{DocumentMetadata} key \PrintKeyName{uncompress} so that uncompressed \PDF{} are created and the internal objects and structures can be inspected and be compared by the l3build checks.% \subsection{Proof of concept: the tagging of the documentation itself} Starting with version 0.6 the documentation itself has been tagged. The tagging wasn't (and isn't) in no way perfect. The validator from Adobe didn't complain, but PAC~3 wanted alternative text for all links (no idea why) and so I put everywhere simple text like \enquote{link} and \enquote{ref}. The links to footnotes gave warnings, so I disabled them. I used types from the \PDF{} version 1.7, mostly as I had no idea what should be used for code in 2.0. Margin notes were simply wrong and there were tagging commands everywhere \ldots The tagging has been improved and automated over time in sync with improvements and new features in the \LaTeX\ kernel, the latex-lab bundle and the \PDF\ management code and is now much better. Only a few structures---mostly some from currently unsupported packages--- still need manual tagging. But sadly the output of the validators don't quite reflect the improvements. The documentation uses now \PDF~2.0 and while the newest PAC~2024 can at least open the file it can not validate properly the file. For example it complains about the tabular header cells as it doesn't follow attribute classes. The Adobe validator has a bug and doesn't like the (valid) use of the \texttt{Lbl} tag for the section numbers (see figure~\ref{fig:adobe}). But even if the documentation would pass all the tests of the validators: as mentioned above passing a formal test doesn't mean that the content is really good and usable. The user commands used for the tagging and also some of the patches used are still rather crude. So there is lot space for improvement. \begin{tcolorbox}[] Be aware that to create the tagged version a current lualatex-dev and a current version of the pdfmanagment-testphase package is needed. \end{tcolorbox} \includegraphics[alt=PAC 2024 complains about PDF version]{pac2024-version} \includegraphics[alt=PAC 2024 complains about table header cells]{pac2024-report} \begin{figure} \includegraphics[alt={Screenshot of Adobe report}]{acrobat} \caption{Adobe Acrobat complaining about the \texttt{Lbl} use}\label{fig:adobe}\par \end{figure} \section{Loading} The package requires the new PDF management. With a current \LaTeX{} (2022-06-01 or newer) the PDF management is loaded if you use the \cs{DocumentMetadata} command before \cs{documentclass}. The \pkg{tagpdf} package can then be loaded and activated by using the \texttt{testphase} key. The exact behavior of the \texttt{testphase} key is documented in \texttt{documentmetadata-support-doc.pdf} which is part of the \pkg{latex-lab} bundle. Various parts of the code differentiate between \PDF{} version 2.0 and lower versions. If \PDF{} 2.0 is wanted it is required to set the version early in the \cs{DocumentMetadata} command so that \pkg{tagpdf} can pick up the correct code path. \begin{taglstlisting} \DocumentMetadata { % testphase = phase-I, % tagging without paragraph tagging % testphase = phase-II, % tagging with paragraph tagging testphase = phase-III, % tagging with paragraph sec, toc, blocks and more pdfversion = 2.0, % pdfversion must be set here. pdfstandard=ua-2, % pdfstandard can be set too } \documentclass{article} \begin{document} some text \end{document} \end{taglstlisting} \minisec{Deactivation} When loading \pkg{tagpdf} through the \texttt{testphase} keys, it is automatically activated. To deactivate it while still retaining all the other new code from the latex-lab testphase files, use in the preamble |\tagpdfsetup{activate/all=false}|. You can additionally also deactivate the paratagging and the interword space code. To suppress the loading of the package altogether you can try \begin{taglstlisting} \makeatletter \disable@package@load{tagpdf}{} \makeatother \DocumentMetadata{...} \end{taglstlisting} \minisec{Loading as package needs activation!} It is not recommended anymore, but the package can also be loaded normally with |\usepackage| (but it is still required to use \cs{DocumentMetadata} to load the \PDF\ management) but it will then -- apart from loading more packages and defining a lot of things -- not do much. You will have to \emph{activate} it with \verb+\tagpdfsetup+. The \PDF\ management loaded with \cs{DocumentMetadata} will in any case load \pkg{tagpdf-base} a small package that provides no-op versions of the main tagging commands. Most commands do nothing if tagging is not activated, but in case a test is needed a command (with the usual p,T,F variants) is provided: \begin{docCommand}{tag_if_active:TF}{}\end{docCommand} The check is true only if \emph{everything} is activated. In all other cases (including if tagging has been stopped locally) it will be false. \subsection{Modes and package options} %TODO think about tagging of the keys. Aside? Header? The package has two different modes: The \textbf{generic mode} works (in theory, currently only fully tested with pdflatex) probably with all engines, the \textbf{lua mode} only with lualatex. The differences between both modes will be described later. Starting with version 0.99m the mode is set automatically (lua mode for luatex, generic mode otherwise). The package options do nothing anymore and will be remove in future versions. \subsection{Setup and activation}\label{ssec:setup} \begin{docCommand}{tagpdfsetup}{\marg{key-val-list}}\end{docCommand} This command setups the general behavior of the package. The command should be normally used only in the preamble (for a few keys it could also make sense to change them in the document). The key-val list understands at least the following keys. More keys are defined in some of the latex-lab module, see table~\ref{tab:setupkey} for an overview which also includes older, now deprecated names. \begin{table} \caption{Overview over keys for \cs{tagpdfsetup}}\label{tab:setupkey} \input{tagpdfsetup-keys} \end{table} \begin{description} \item[\PrintKeyName{activate/all}] Boolean, initially false. Activates everything, that's normally the sensible thing to do. \item [\PrintKeyName{activate}] Like |activate/all|, \emph{additionally} is opens at begin document a structure with |\tagstructbegin| and closes it at end document. The key accepts as value a tag name which is used as the tag of the structure. The default value is |Document|. \item[\PrintKeyName{activate/mc}] Boolean, initially false. Activates the code related to marked content. \item[\PrintKeyName{activate/struct}] Boolean, initially false. Activates the code related to structures. Should be used only if \PrintKeyName{activate/mc} has been used too. \item[\PrintKeyName{activate/struct-dest}] Boolean, initially true. Starting with version 0.93 \pkg{tagpdf} will create automatically structure destinations (see section~\ref{sec:struct-dest} if \pkg{hyperref} is used and if the engine supports it. With this key this can be suppressed. \item[\PrintKeyName{activate/tree}] Boolean, initially false. Activates the code related to trees. Should be used only if the two other keys has been used too. \item[\PrintKeyName{activate/spaces}] Boolean. The key activates/deactivates the insertion of space glyphs, see section~\ref{sec:spacechars}. In the luamode it only works if at least \PrintKeyName{activate/mc} has been used. The old name of the key |interwordspace| is still supported but deprecated. \item[\PrintKeyName{activate/softhyphen}] Boolean. luamode only. The key activates/deactivates the replacing of hard hyphens from hyphenation by soft hyphens. By default this is activated. \item[\PrintKeyName{role/new-tag}] Allows to define new tag names, see section \ref{sec:new-tag} for a description. \item[\PrintKeyName{role/new-attribute}] This key takes two arguments and declares an attribute. See \ref{sec:attributes}. \item[\PrintKeyName{role/map-tags}] This key allows to remap the structure tags. Currently it supports only two values: |false| (the default) and |pdf| which maps all tags to their standard PDF role, e.g. |itemize| will be mapped to |L|. \item[\PrintKeyName{para/tagging}] Boolean. This activate/deactivates the automatic tagging of paragraphs, see \ref{sec:paratagging} for more background. It uses the \texttt{para/begin} and \texttt{para/end} hooks. With more tagging support conditions will be added, that means the code is bound to change! Paragraphs can appear in many unexpected places and the code can easily break, so there is also an option to see where such paragraphs are: \item[\PrintKeyName{para/tag}] String. This key changes the second tag used by the paratagging code. The default tag is \texttt{text}, a \LaTeX{} specific tag that is role mapped to \texttt{P}. A useful local setting here can be \texttt{NonStruct}, which creates a structure \enquote{without meaning}. For local changes it is recommended to use the newer \cs{tagtool} command described below instead of \cs{tagpdfsetup}. \item[\PrintKeyName{para/maintag}] String. This key changes the first tag used by the paratagging code. The default tag is \texttt{text-unit}, a \LaTeX{} specific tag that is role mapped to \texttt{Part}. For local changes it is recommended to use the newer \cs{tagtool} command described below instead of \cs{tagpdfsetup}. \item[\PrintKeyName{page/tabsorder}] Choice key, possible values are \PrintKeyName{row}, \PrintKeyName{column}, \PrintKeyName{structure}, \PrintKeyName{none}. This decides if a \verb+/Tabs+ value is written to the dictionary of the page objects. Not really needed for tagging itself, but one of the things you probably need for accessibility checks. So I added it. Currently the tabsorder is the same for all pages. Perhaps this should be changed \ldots. \item[\PrintKeyName{activate/tagunmarked}] Boolean,\sidenote{luamode} initially true. When this boolean is true, the lua code will try to mark everything that has not been marked yet as an artifact. The benefit is that one doesn't have to mark up every deco rule oneself. The danger is that it perhaps marks things that shouldn't be marked -- it hasn't been tested yet with complicated documents containing annotations etc. See also section~\ref{sec:lazy} for a discussion about automatic tagging. \item[\PrintKeyName{viewer/startstructure}] A structure number. If a \texttt{OpenAction} is set in the PDF Catalog (which is normally the case if hyperref is used) a structure destination pointing to the structure is added. The initial value is structure 1 (the \texttt{Document} structure), the default value is the current structure. The key can be used more than once, the last setting will win. \item[\PrintKeyName{debug/uncompress}] Sets both the \PDF{} compresslevel and the \PDF{} objcompresslevel to 0 and so allows to inspect the \PDF{}. No really useful anymore as this can also be set in \cs{DocumentMetadata}. \item[\PrintKeyName{debug}] This keys knows a number of sub-keys to set various debug options. \begin{description} \item[\PrintKeyName{debug/show}] This takes a comma list of keywords: \texttt{spaces}/\texttt{spacesOff}: \sidenote{luamode} That helps in lua mode to see where space glyph will be inserted if \PrintKeyName{activate/spaces} is activated. This can also be activated with the now deprecated key |show-spaces| \texttt{para}/\texttt{paraOff}: This (locally) activates/deactivates small red and green numbers in the places where the paratagging hook code is used. \item[\PrintKeyName{debug/log}] Choice key, possible values \PrintKeyName{none}, \PrintKeyName{v}, \PrintKeyName{vv}, \PrintKeyName{vvv}, \PrintKeyName{all}. Setups the log level. Changing the value affects currently mostly the luamode: \enquote{higher} values gives more messages in the log. The current levels and messages have been setup in a quite ad-hoc manner and will need improvement. \end{description} \end{description} \begin{docCommands} { {doc name=tagtool,doc parameter=\marg{key-val}}, {doc name=tag_tool:n,doc parameter=\marg{key-val}} } \end{docCommands} The tagging of document elements requires a variety of small commands. This command will unify them under a common interface. This is work-in-progress and syntax and implementation can change! While the argument looks like a key-val \emph{list} (and currently is actually one), this should not be relied on. Instead only one argument should be used as the implementation will change to improve the speed. Currently the following arguments are supported \begin{description} \item[\PrintKeyName{para/tagging}] Boolean. It will replace the \cs{tagpdfparaOn} and \cs{tagpdfparaOff} command. \item[\PrintKeyName{para/maintag}] String. It allows to change the outer tag used in the following automatically tagged paragraphs. The setting is local. \item[\PrintKeyName{para/tag}] String. It allows to change the inner tag used in the following automatically tagged paragraphs. The setting is local. \item[\PrintKeyName{para/flattened}] Boolean. If set it will suppress the outer structure in the automatic paratagging. This should be applied to the start and end hook in the same way! The setting is local. \end{description} \section{Tagging} PDF is a page orientated graphic format. It simply puts ink and glyphs at various coordinates on a page. A simple stream of a page can look like this\footnote{The appendix contains some remarks about the syntax of a \PDF{} file}: \begin{taglstlisting}[columns=fixed] stream BT /F27 14.3462 Tf %select font 89.291 746.742 Td %move point [(1)-574(Intro)-32(duction)]TJ %print text /F24 10.9091 Tf %select font 0 -24.35 Td %move point [(Let's)-331(start)]TJ %print text 205.635 -605.688 Td %move point [(1)]TJ %print text ET endstream \end{taglstlisting} From this stream one can extract the characters and their placement on the page but not their semantic meaning (the first line is actually a section heading, the last the page number). And while in the example the order is correct there is actually no guaranty that the stream contains the text in the order it should be read. Tagging means to enrich the \PDF{} with information about the \emph{semantic} meaning and the \emph{reading order}. (Tagging can do more, one can also store all sorts of layout information like font properties and indentation with tags. But as I already wrote this package concentrates on the part of tagging that is needed to improve accessibility.) \subsection{Three tasks} To tag a \PDF{} three tasks must be carried out: \begin{enumerate} \item \textbf{The mark-content-task}:\sidenote{mc-task} The document must add \enquote{labels} to the page stream which allows to identify and reference the various chunks of text and other content. This is the most difficult part of tagging -- both for the document writer but also for the package code. At first there can be quite many chunks as every one is a leaf node of the structure and so often a rather small unit. At second the chunks must be defined page-wise -- and this is not easy when you don't know where the page breaks are. Also in a standard document a lot text is created automatically, e.g. the toc, references, citations, list numbers etc and it is not always easy to mark them correctly. \item \textbf{The structure-task}:\sidenote{struct-task} The document must declare the structure. This means marking the start and end of semantically connected portions of the document (correctly nested as a tree). This too means some work for the document writer, but less than for the mc-task: at first quite often the mc-task and the structure-task can be combined, e.g. when you mark up a list number or a tabular cell or a section header; at second one doesn't have to worry about page breaks so quite often one can patch standard environments to declare the structure. On the other side a number of structures end in \LaTeX\ only implicitly -- e.g. an item ends at the next item, so getting the \PDF{} structure right still means that additional mark up must be added. \item \textbf{The tree management}:\sidenote{tree-task} At last the structure must be written into the \PDF{}. For every structure an object of type \texttt{StructElem} must be created and flushed with keys for the parents and the kids. A parent tree must be created to get a reference from the mc-chunks to the parent structure. A role map must be written. And a number of dictionary entries. All this is hopefully done automatically and correctly by the package \ldots. \end{enumerate} \begin{figure}[t!] \begin{tcolorbox}[] \minisec{Page stream with marked content} \begin{tikzpicture}[baseline=(a.north),node distance=2pt,remember picture, alt={Illustration of page stream with marked content}] \node(start){\ldots~\ldots~\ldots}; \node[draw,base right = of start](a) {mc-chunk 1}; \node[draw,base right = of a](b) {mc-chunk 2}; \node[draw,base right = of b](c) {mc-chunk 3}; \node[draw,base right = of c](d) {mc-chunk 3}; \node[base right = of d] {\ldots~\ldots}; \end{tikzpicture} \minisec{Structure} \newlength\ydistance\setlength\ydistance{-0.8cm} \begin{tikzpicture}[remember picture,baseline=(root.north),alt={Illustration of structure}] \node[draw,anchor=base west] (root) at (0,0) {Sect (start section)}; \node[draw,anchor=base west] at (0.3,\ydistance) {H (header section)}; \node[draw,anchor=base west](aref) at (0.6,2\ydistance){mc-chunk 1}; \node[draw,anchor=base west](bref) at (0.6,3\ydistance){mc-chunk 2}; \node[draw,anchor=base west] at (0.3,4\ydistance){/H (end header)}; \node[draw,anchor=base west] at (0.3,5\ydistance){P (start paragraph)}; \node[draw,anchor=base west](cref) at (0.6,6\ydistance){mc-chunk 3}; \node[draw,anchor=base west](dref) at (0.6,7\ydistance){mc-chunk 4}; \node[draw,anchor=base west] at (0.3,8\ydistance){/P (end paragraph)}; \node[draw,anchor=base west] at (0,9\ydistance){/Sect (end section)}; \end{tikzpicture} \begin{tikzpicture}[remember picture, overlay] \draw[->,red](aref)-|(a); \draw[->,red](bref)-|(b); \draw[->,red](cref)-|(c); \draw[->,red](dref)-|(d); \end{tikzpicture} \end{tcolorbox} \caption{Schematical description of the relation between marked content in the page stream and the structure} \end{figure} \subsection{Task 1: Marking the chunks: the mark-content-step} To be able to refer to parts of the text in the structure, the text in the page stream must get \enquote{labels}. In the \PDF{} reference they are called \enquote{marked content}. The three main variants needed here are: \begin{description} \item[Artifacts] They are marked with of a pair of keywords, \texttt{BMC} and \texttt{EMC} which surrounds the text. \texttt{BMC} has a single prefix argument, the fix tag name \texttt{/Artifact}. Artifacts should be used for irrelevant text and page content that should be ignored in the structure. Sadly it is often not possible to leave such text simply unmarked -- the accessibility tests in Acrobat and other validators complain. \begin{taglstlisting} /Artifact BMC text to be marked /EMC \end{taglstlisting} \item[Artifacts with a type] They are marked with of a pair of keywords, \texttt{BDC} and \texttt{EMC} which surrounds the text. \texttt{BDC} has two arguments: again the tag name \texttt{/Artifact} and a following dictionary which allows to specify the suppressed info. Text in header and footer can e.g. be declared as pagination like this: \begin{taglstlisting} /Artifact <> BDC text to be marked /EMC \end{taglstlisting} \item[Content] Content is marked also with of a pair of keywords, \texttt{BDC} and \texttt{EMC}. The first argument of \texttt{BDC} is a tag name which describes the structural type of the text\footnote{There is quite some redundancy in the specification here. The structural type is also set in the structure tree. One wonders if it isn't enough to use always \texttt{/SPAN} here.} Examples are \texttt{/P} (paragraph), \texttt{/H2} (heading), \texttt{/TD} (table cell). The reference mentions a number of standard types but it is possible to add more or to use different names. In the second argument of \texttt{BDC} -- in the property dictionary -- more data can be stored. \emph{Required} is an \texttt{/MCID}-key which takes an integer as a value: \begin{taglstlisting} /H1 <> BDC text to be marked /EMC \end{taglstlisting} This integer is used to identify the chunk when building the structure tree. The chunks are numbered by page starting with 0. As the numbers are also used as an index in an array they shouldn't be \enquote{holes} in the numbering system (It is perhaps possible to handle a numbering scheme not starting by 0 and having holes, but it will enlarge the \PDF{} as one would need dummy objects.). It is possible to add more entries to the property dictionary, e.g. a title, alternative text or a local language setting. \end{description} The needed markers can be added with low level code e.g. like this (in pdftex syntax): \begin{taglstlisting} \pdfliteral page {/H1 <> BDC}% text to be marked \pdfliteral page {EMC}% \end{taglstlisting} This sounds easy. But there are quite a number of traps, mostly with pdfLaTeX: \begin{enumerate}[beginpenalty=10000] \item \PDF{} is a page oriented format. And this means that the start \texttt{BDC}/\texttt{BMC} and the corresponding end \texttt{EMC} must be on the same page. So marking e.g. a section title like in the following example won't always work as the literal before the section could end on the previous page: \begin{taglstlisting} \pdfliteral page {/H1 <> BDC} %problem: possible pagebreak here \section{mysection} \pdfliteral page {EMC}% \end{taglstlisting} Using the literals \emph{inside} the section argument is better, but then one has to take care that they don't wander into the header and the toc. \item Literals are \enquote{whatsits} nodes and can change spacing, page and line breaking. The literal \emph{behind} the section in the previous example could e.g. lead to a lonely section title at the end of the page. \item The \texttt{/MCID} numbers must be unique on a page. So you can't use the literal in a saved box that you reuse in various places. This is e.\,g. a problem with \texttt{longtable} as it saves the table header and footer in a box. \item The \texttt{/MCID}-chunks are leaf nodes in the structure tree, so they shouldn't be nested. \item Often text in a document is created automatically or moved around: entries in the table of contents, index, bibliography and more. To mark these text chunks correctly one has to analyze the code creating such content to find suitable places to inject the literals. \item There exist environments which process their content more than once -- examples are \texttt{align} and \texttt{tabularx}. So one has to check for doublets and holes in the counting system. \item \PDF{} is a page oriented format. And this means that the start and the end marker must be on the same page \ldots\ \emph{so what to do with normal paragraphs that split over pages??}. This question will be discussed in subsection~\ref{sec:splitpara}. \end{enumerate} \subsubsection{Generic mode versus lua mode in the mc-task} While in generic mode the commands insert the literals directly and so have all the problems described above the lua mode works quite differently: The tagging commands don't insert literals but set some (global) \emph{attributes} which are attached to all the following nodes. When the page is shipped out some lua code is called which wanders through the shipout box and injects the literals at the places where the attributes changes. This means that quite a number of problems mentioned above are not relevant for the lua mode: \begin{enumerate} \item Page breaks between start and end of the marker are \emph{not} a problem. So you can mark a complete paragraph. If a pagebreak occur directly after an start marker or before an end marker this can lead to empty chunks in the \PDF{} and so bloat up \PDF{} a bit, but this is imho not really a problem (compared to the size increase by the rest of the tagging). \item The commands don't insert literals directly and so affect line and page breaking much less. \item The numbering of the MCID are done at shipout, so no label/ref system is needed. \item The code can do some marking automatically. Currently everything that has not been marked up by the document is marked as artifact. \end{enumerate} \subsubsection{Commands to mark content and chunks} In generic mode\sidenote{Generic mode only} is vital that the end command is executed on the same page as the begin command. So think carefully how to place them. For strategies how to handle paragraphs that split over pages see subsection~\ref{sec:splitpara}. \begin{docCommands} { {doc name=tagmcbegin,doc parameter={\marg{key-val-list}}}, {doc name=tag_mc_begin:n,doc parameter={\marg{key-val-list}}} } \end{docCommands} These commands insert the begin of the marked content code in the \PDF{}. They don't start a paragraph. \emph{They don't start a group}. Such markers should not be nested. The command will warn you if this happens. In the generic mode the commands insert literals. These are whatsits and so can affect spacing. In lua mode they set an attribute \emph{globally}. The key-val list understands the following keys: \begin{description} \item[\PrintKeyName{tag}] This key is optional. By default the tag name of the surrounding structure is used, which normally should be fine. But if needed the name can be set explicitly with this key. The value of the key is typically one of the standard type listed in section \ref{sec:new-tag} (without a slash at the begin, this is added by the code). It is possible to setup new tags, see the same section. The value of the key is expanded, so it can be a command. The expansion is passed unchanged to the \PDF{}, so it should with a starting slash give a valid \PDF{} name (some ascii with numbers like \texttt{H4} is fine). \item[\PrintKeyName{artifact}] This will setup the marked content as an artifact. The key should be used for content that should be ignored. The key can take one of the values \PrintKeyName{pagination}, \PrintKeyName{pagination/header}, \PrintKeyName{pagination/footer}, \PrintKeyName{layout}, \PrintKeyName{page}, \PrintKeyName{background} and \PrintKeyName{notype} (this is the default). Text in the header and footer should normally be marked with \PrintKeyName{artifact=pagination} or \PrintKeyName{pagination/header}, \PrintKeyName{pagination/footer} but simply artifact (as it is now done automatically) should be ok too. It is not quite clear if rules and other decorative graphical objects needs to be marked up as artifacts. Acrobat seems not to mind if not, but PAC~3 complained. The validators complain if some text is not marked up, but it is not quite clear if this is a serious problem. The\sidenote{lua mode} lua mode will mark up everything unmarked as \texttt{artifact=notype}. You can suppress this behavior by setting the tagpdfsetup key \texttt{activate/tagunmarked} to false. See section \ref{ssec:setup}. \item[\PrintKeyName{stash}] Normally marked content will be stored in the \enquote{current} structure. This may not be what you want. As an example you may perhaps want to put a marginnote behind or before the paragraph it is in the tex-code. With this boolean key the content is marked but not stored in the kid-key of the current structure. \item[\PrintKeyName{label}] This key sets a label by which you can call the marked content \emph{later} in another structure (if it has been stashed with the previous key). Internally the label name will start with \texttt{tagpdf-}. \item[\PrintKeyName{alt}] This key inserts an \texttt{/Alt} value in the property dictionary of the BDC operator. See section~\ref{sec:alt}. The value is handled as verbatim string, commands are not expanded but the value will be expanded first once (so works like the key \texttt{alttext-o} in previous versions which has been removed). If the value is empty, nothing will happen. That means that you can do something like in the following listing and it will insert \verb+\frac{a}{b}+ (hex encoded) in the \PDF{}. \begin{taglstlisting} \newcommand\myalttext{\frac{a}{b}} \tagmcbegin{tag=P,alt=\myalttext} \end{taglstlisting} \item[\PrintKeyName{actualtext}] This key inserts an \texttt{/ActualText} value in the property dictionary of the BDC operator. See section~\ref{sec:alt}. The value is handled as verbatim string, commands are not expanded but the value will be expanded first once (so works like the key \texttt{actualtext-o} in previous versions which has been removed). If the value is empty, nothing will happen. That means that you can do something like in the following listing and it will insert \verb+X+ (hex encoded) in the \PDF{}. \begin{taglstlisting} \newcommand\myactualtext{X} \tagmcbegin{tag=Span,actualtext=\myactualtext} \end{taglstlisting} According to the PDF reference, \texttt{/ActualText} should only be used on marked content sequence of type Span. This is not enforced by the code currently. There is also some discussion going on, if \texttt{/ActualText} can actually be used in a MC dictionary or if it should be in a separate BDC-operator. \item[\PrintKeyName{raw}] This key allows you to add more entries to the properties dictionary. The value must be correct, low-level \PDF{}. E.g. \verb+raw=/Alt (Hello)+ will insert an alternative Text. \end{description} \begin{docCommands} { {doc name=tagmcend}, {doc name=tag_mc_end:} } \end{docCommands} These commands insert the end code of the marked content. They don't end a group and it doesn't matter if they are in another group as the starting commands. In generic mode both commands check if there has been a begin marker and issue a warning if not. In luamode it is often possible to omit the command, as the effect of the begin command ends with a new \verb+\tagmcbegin+ anyway. \begin{docCommands} { {doc name=tagmcuse,doc parameter=\marg{label}}, {doc name=tag_mc_use:n,doc parameter=\marg{label}} } \end{docCommands} These commands allow you to record a marked content that you stashed away into the current structure. Be aware that a marked content can be used only once -- the command will warn you if you try to use it a second time. \begin{docCommands} { {doc name=tag_mc_end_push:}, {doc name=tag_mc_begin_pop:n,doc parameter=\marg{key-val-list}} }\end{docCommands} If there is an open mc chunk, the first command ends it and pushes its tag on a stack. If there is no open chunk, it puts $-1$ on the stack (for debugging). The second command removes a value from the stack. If it is different from $-1$ it opens a tag with it. The command is mainly meant to be used inside hooks and command definitions so there is only an expl3 version. Perhaps other content of the mc-dictionary (for example the Lang) needs to be saved on the stacked too. \begin{docCommands} { {doc name=tagmcifinTF,doc parameter=\marg{true code}\marg{false code}}, {doc name=tag_mc_if_in:TF,doc parameter=\marg{true code}\marg{false code}} }\end{docCommands} These commands check if a marked content is currently open and allows you to e.g. add the end marker if yes. In \emph{generic mode}, where marked content command shouldn't be nested, it works with a global boolean. In \emph{lua mode} it tests if the mc-attribute is currently unset. You can't test the nesting level with it! \begin{docCommand}{tag_mc_reset_box:N}{\marg{box}}\end{docCommand} In lua mode this command will process the given box and reset all mc related attributes in the box to the current values. This means that if the box is used all its contents will be a kid of the current structure. This should (probably) only be used on boxes which don't contain tagging commands. See below section~\ref{sec:savebox} for more details. \subsubsection{Retrieving data} \label{sec:retrieve} With more elaborate tagging the need arise to retrieve and store current data. \begin{docCommand}{tag_get:n}{\marg{key word}}\end{docCommand} This (expandable) command returns the values of some variables. Currently, the working key words are \begin{itemize} \item \verb+mc_tag+: the tag name of the current mc-chunk \item \verb+struct_tag+: the tag name of the current structure \item \verb+struct_id+: The ID of the current structure. This is a string and is returned including parentheses. \item \verb+struct_num+: This returns a number and works also if only \pkg{tagpdf-base} has been loaded, but then doesn't give the same output: if \pkg{tagpdf} is loaded and tagging is active, \verb+struct_num+ gives the number of currently active structure, so it reverts to the parent number if a structure is closed. If only \pkg{tagpdf-base} is loaded nesting of structure is not tracked and so the command gives back the number of the last structure that has been created. In luatex this number can also be retrieved with the lua function \verb+ltx.tag.get_struct_num()+. \item \verb+struct_counter+: This returns a number and works also if only \pkg{tagpdf-base} has been loaded. It gives back the state of the absolute structure counter and so the number of the last structure that has been created. This can be used to detect if in a piece of code there are structure commands. Be aware that this is a \LaTeX{} counter and so is reset in some places. In luatex this number can also be retrieved with the lua function \verb+ltx.tag.get_struct_counter()+. The number of the next structure to be created is then \verb+ltx.tag.get_struct_counter()+ increased by one), this can also be retrieved with the function \verb+ltx.tag.get_struct_num_next()+. \item \verb+mc_counter+: This returns a number and works also if only \pkg{tagpdf-base} has been loaded. It gives back the state of the absolute mc-counter and so number of the last mc-chunk that has been created. This can be used to detect if in a piece of code there are mc-commands. \end{itemize} \subsubsection{Luamode: global or not global -- that is the question}\label{sec:global-local} In\sidenote{lua mode} luamode the mc-commands set and unset an attribute to mark the nodes. One can view such an attribute like a font change or a color: they affect all following chars and glue nodes until stopped. From version 0.6 to 0.82 the attributes were set locally. This had the advantage that the attributes didn't spill over in area where they are not wanted like the header and footer or the background pictures. But it had the disadvantage that it was difficult for an inner structure to correctly interrupt the outer mc-chunk if it can't control the group level. For example this didn't work due to the grouping inserted by the user: \begin{taglstlisting} \tagstructbegin{tag=P} \tagmcbegin{tag=P} Start paragraph {% user grouping \tag_mc_end_push: \tagstructbegin{tag=Em} \tagmcbegin{tag=Em} \emph{Emphasized test} \tagmcend \tagstructend \tag_mc_begin_pop:n{} }% user grouping Continuation of paragraph \tagmcend \tagstructend \end{taglstlisting} The reading order was then wrong, and the \emph{emphasized text} moved in the structure at the end. So starting with version 0.9 this has been reverted. The attribute is now global again. This solves the \enquote{interruption} problem, but has its price: Material inserted by the output routine must be properly guarded. For example \begin{taglstlisting} \DocumentMetadata{uncompress} \documentclass{article} \pagestyle{headings} \begin{document} \sectionmark{HEADER} \AddToHook{shipout/background}{\put(5cm,-5cm){BACKGROUND}} \tagmcbegin{tag=P}Page 1\newpage Page 2\tagmcend \end{document} \end{taglstlisting} Here the header and the background code on the \emph{first} page will be marked up as paragraph and added as chunk to the document structure. The header and the background code on the \emph{second} page will be marked as artifact. The following figure shows how the tags looks like. \includegraphics[alt=Show tags of examples]{global-ex} It is therefore from now on important to correctly markup such code. Header and footer are now marked as artifacts (see below). If they contain code which needs a different markup it still must be added explicitly. With packages like \pkg{fancyhdr} or \pkg{scrlayer-scrpage} it is quite easy to add the needed code. \subsubsection{Tips} \begin{itemize} \item Mark commands inside floats should work fine (but need perhaps some compilation rounds in generic mode). \item In case you want to use it inside a \verb+\savebox+ (or some command that saves the text internally in a box): If the box is used directly, there is probably no problem. If the use is later, stash the marked content and add the needed \verb+\tagmcuse+ directly before or after the box when you use it. \item Don't use a saved box with markers twice. \item If boxes are unboxed you will have to analyze the \PDF{} to check if everything is ok. \item If you use complicated structures and commands (breakable boxes like the one from \pkg{tcolorbox}, \pkg{multicol}, many footnotes) you will have to check the \PDF{}. \end{itemize} \begin{figure} \input{link-figure-input} \caption{Structure needed for a link annotation}\label{fig:linkannot} \end{figure} \subsubsection{Header and Footer}\label{sec:header-footer} Tagging header and footer is not trivial. At first on the technical side header and footer are typeset and attached to the page during the output routine and the exact timing is not really under control of the user. That means that when adding tagging there one has to be careful not to disturb the tagging of the main text---this is mostly important in luamode where the attributes are global and can easily spill over. At second one has to decide about how to tag: in many cases header and footer can simply be ignored, they only contain information which are meant to visually guide the reader and so are not relevant for the structure. This means that normally they should be tagged as artifacts. The PDF reference offers here a rather large number of options here to describe different versions of \enquote{ignore this}. Typically the header and footer should get the type \texttt{Pagination} and this types has a number of subtypes like Header, Footer, PageNum. It is not yet known if any technology actually makes use of this info. But they can also contain meaningful content, for example an address. In such cases the content should be added to the structure (where?) but even if this address is repeated on every page at best only once. All this need some thoughts both from the users and the packages and code providing support for header and footers. For now tagpdf added some first support for automatically tagging: Starting with version 0.92 header and footer are by default automatically marked up as (simple) artifacts. With the key \PrintKeyName{exclude-header-footer} the behavior can be changed: The value \texttt{false} disables the automatic tagging, the value \texttt{pagination} add additionally an \texttt{/Artifact} structure with the attribute \texttt{/Pagination}. If some additional markup (or even a structure) is wanted, something like this should be used (here with the syntax of the \pkg{fancyhdr} package) to close the open mc-chunk and restart if after the content: \begin{taglstlisting} \ExplSyntaxOn \cfoot{\leavevmode \tag_mc_end_push: \tagmcbegin{artifact=pagination/footer} \thepage \tagmcend \tag_mc_begin_pop:n{artifact}} \ExplSyntaxOff \end{taglstlisting} \subsubsection{Links and other annotations}\label{sec:link+annot} Annotations (like links or form field annotations) are objects associated with a geometric region of the page rather than with a particular object in its content stream. Any connection between a link or a form field and the text is based solely on visual appearance (the link text is in the same region, or there is empty space for the form field annotation) rather than on an explicitly specified association. To connect such a annotation with the structure and so with surrounding or underlying text a specific structure has to be added, see \ref{fig:linkannot}: The annotation is added to a structure element as an object reference. It is not referenced directly but through an intermediate object of type OBJR. To the dictionary of the annotation a \texttt{/StructParent} entry must be added, the value is a number which is then used in the ParentTree to define a relationship between the annotation and the parent structure element. To support this, \pkg{tagpdf} offers currently two commands \begin{docCommand}{tag_struct_parent_int:}{}\end{docCommand} This insert the current value of a global counter used to track such objects. It can be used to add the \texttt{/StructParent} value to the annotation dictionary. \begin{docCommand}{tag_struct_insert_annot:nn}{\marg{object reference}\marg{struct parent number}}\end{docCommand} This will insert the annotation described by the object reference into the current structure by creating the OBJR object. It will also add the necessary entry to the parent tree and increase the global counter referred to by |\tag_struct_parent_int:|. It does nothing if (structure) tagging is not activated. Attention! As the second command increases the global counter at the end it changes the value given back by the first. That means that if nesting is involved care must be taken that the correct numbers is used. This should be easy to fulfill for most annotations, as there are boxes. There the second command should at best be used directly behind the annotation and it can make use of |\tag_struct_parent_int:|. For links nesting is theoretically possible, and it could be that future versions need more sophisticated handling here. In environments which process their content twice like tabularx or align it would be the best to exclude the second command from the trial step, but this will need better support from these environments. Typically using this commands is not often needed: Since version 0.81 \pkg{tagpdf} already handles (unnested) links, and form fields created with the \pkg{l3pdffield-testphase} package will be handle by this package. The following listing shows low-level to create link where the two commands are used: \begin{taglstlisting} \pdfextension startlink attr { /StructParent \tag_struct_parent_int: %<---- } user { /Subtype/Link /A << /Type/Action /S/URI /URI(http://www.dante.de) >> } This is a link. \pdfextension endlink \tag_struct_insert_annot:xx {\pdfannot_link_ref_last:}{\tag_struct_parent_int:} \end{taglstlisting} \subsubsection{Math} Math is still a problem but some progress has been made. To tag math you have to surround it with a \texttt{Formula} structure. But the content of such a structure is handled by readers as a black box so additional data is needed for accessibility. There are a number of theoretical options here: \begin{enumerate} \item One can add an alternative text (\texttt{/Alt}) or an \texttt{/ActualText} to the structure element either some text manually provided by the author or (with the math module in the latex-lab bundle) the \LaTeX-source). \item One can add an alternative text (\texttt{/Alt} or \texttt{/ActualText}) to the MC-chunks. \item One can build inside the \texttt{Formula} structure element a tree with MathML structure elements --- with PDF 2.0 this not require to declare new tags as the MathML name space is built-in. \item One can in PDF 2.0 attach a MathML file and/or the \LaTeX-source as associated file to the \texttt{Formula} structure (or to one or more MC-chunks). \end{enumerate} The question is how these work in reality. Option 1 and 2 give not too bad results with a screen reader, but can require manual work and if you are unlucky the reader drops important part of the math (like punctuation symbols). Exploring the equation is not possible. Option 3 creates many structure elements. E.g. I have seen an example where \emph{every single symbol} has been marked up with tags from MathML along with an \texttt{/ActualText} entry and an entry with alternate text which describes how to read the symbol. The \PDF{} then looked like this \begin{taglstlisting} /mn </Alt( : open bracket: four )>>BDC ... /mn </Alt( third s )>>BDC ... /mo </Alt( times )>>BDC \end{taglstlisting} If this is really the way to go one would need some script to add the mark-up as doing it manually is too much work and would make the source unreadable -- at least with pdflatex and the generic mode. In lua mode is it possible to hook into the \texttt{mlist\_to\_hlist} callback and add marker automatically. Some first implementation in this direction has been done by Marcel Krüger in the luamml project. But up-to-now it was not possible to test the usability of this approach: With the exception of the html derivation with ngpdf no PDF-viewer/screen reader combination seems to make use of such structures. I'm not sure anyway that this is the best way to do math. It looks rather odd that a document should have to tell a screen reader in such detail how to read an equation. The last option 4 has been implemented in the math module in the \texttt{latex-lab} bundle. Here happily a proof of concept was possible: With development versions of foxit and the NVDA reader it was possible to access an attached MathML and get speech output from it \cite{todasoifferdeims2024,mittelbachfischerdeims2024}. See also \cite{mathexamples} for some examples and section~\ref{sec:alt} for some more remarks and tests. \subsubsection{Split paragraphs}\label{sec:splitpara} %TODO: think about marginnote! Aside? A\sidenote{Generic mode only} problem in generic mode are paragraphs with page breaks. As already mentioned the end marker \texttt{EMC} must be added on the same page as the begin marker. But it is in pdflatex \emph{very} difficult to inject something at the page break automatically. One can manipulate the shipout box to some extend in the output routine, but this is not easy and it gets even more difficult if inserts like footnotes and floats are involved: the end of the paragraph is then somewhere in the middle of the box. So with pdflatex in generic mode one until now had to do the splitting manually. The example \texttt{mc-manual-para-split} demonstrates how this can be done. The general idea was to use \verb+\vadjust+ in the right place: \begin{taglstlisting} \tagmcbegin{tag=P} ... fringilla, ligula wisi commodo felis, ut adipiscing felis dui in enim. Suspendisse malesuada ultrices ante.% page break \vadjust{\tagmcend\pagebreak\tagmcbegin{tag=P}} Pellentesque scelerisque ... sit amet, lacus.\tagmcend \end{taglstlisting} Starting with version 0.92 there is code which resolves this problem. Basically it works like this: every mc-command issues a mark command (actually two slightly different). When the page is built in the output routine this mark commands are inspected and from them \LaTeX{} can deduce if there is a mc-chunk which must be closed or reopened. The method is described in Frank Mittelbach's talk at TUG~2021 \enquote{Taming the beast — Advances in paragraph tagging with pdfTeX and XeTeX} \url{https://youtu.be/SZHIeevyo3U?t=19551}. Please note \begin{itemize} \item Typically you will need more compilations than previously, don't rely on the rerun messages, but if something looks wrong rerun. \item The code relies on that related |\tagmcbegin| and |\tagmcend| are in the same boxing level. If one is in a box (which hides the marks) and the other in the main galley, things will go wrong (\texttt{longtable} is for example problematic). \end{itemize} \subsubsection{Automatic tagging of paragraphs}\label{sec:paratagging} Another feature that emerged from the \LaTeX{} tagged PDF project are hooks at the begin and end of paragraphs. \pkg{tagpdf} makes use of these hooks to tag paragraphs. In the first version it added only one structure, but this proved to be not adequate: Paragraphs in \LaTeX{} can be nested, e.g., you can have a paragraph containing a display quote, which in turn consists of more than one (sub)paragraph, followed by some more text which all belongs to the same outer paragraph. In the \PDF{} model and in the HTML model that is not supported: the rules in \PDF{} specification do not allow \texttt{P}-structures to be nested --- a limitation that conflicts with real live, given that such constructs are quite normal in spoken and written language. The approach we take (starting with march 2023, version 0.98e) to resolve this is to model such \enquote{big} paragraphs with a structure named \texttt{text-unit} and use \texttt{P} (under the name \texttt{text}) only for (portions of) the actual paragraph text in a way that the \texttt{P}s are not nested. As a result we have for a simple paragraph two structures: \begin{taglstlisting}

The paragraph text ...

The paragraph text before the display element ... Content of the display structure possibly involving inner tags ... continuing the outer paragraph text \end{taglstlisting} In other words such a display block is always embedded in a || structure, possibly preceded by a ||\ldots|| block and possibly followed by one, though both such blocks are optional. More information about this can be found in the documentation of \texttt{latex-lab-block-tagging}. As a consequence \pkg{tagpdf} now adds two structures if paratagging is activated. The new code to tag display blocks extends this code to handle the nesting of lists and other display structures. The automatic tagging require that for every begin of a paragraph with the begin hook code there a corresponding end with the closing hook code. This can fail, e.g if a |vbox| doesn't correctly issue a |\par| at the end. If this happens the tagging structure can get very confused. At the end of the document \pkg{tagpdf} checks if the number of outer and inner start and end paragraph structures created with the automatic paratagging code are equal and it will error if not. The automatic tagging of paragraphs can be deactivated completely or only the outer level with the |\tagtool| keys |para| and |para-flattened| or with the (now deprecated) commands |\tagpdfparaOn| and |\tagpdfparaOff|. Nesting the activation and deactivation of the tagging of paragraphs can be quite difficult. For example if it is unclear if the inner code issues a |\par| or not it is not trivial to exclude an end hook for every excluded begin hook. In such cases it can be easier to use the |paratag| key with the value |NonStruct| to convert some |P|-structures into |NonStruct|-structures without real meaning. \subsection{Task 2: Marking the structure} The structure is represented in the \PDF{} with a number of objects of type \texttt{StructElem} which build a tree: each of this objects points back to its parent and normally has a number of kid elements, which are either again structure elements or -- as leafs of the tree -- the marked contents chunks marked up with the \verb+tagmc+-commands. The root of the tree is the \texttt{StructTreeRoot}. \subsubsection{Structure types} The tree should reflect the \emph{semantic} meaning of the text. That means that the text should be marked as section, list, table head, table cell and so on. A number of standard structure types is predefined, see section \ref{sec:new-tag} but it is allowed to create more. If you want to use types of your own you must declare them. E.g. this declares two new types \texttt{TAB} and {FIG} and bases them on \texttt{P}: \begin{taglstlisting} \tagpdfsetup{ role/new-tag = TAB/P, role/new-tag = FIG/P, } \end{taglstlisting} \subsubsection{Sectioning} The sectioning units can be structured in two ways: a flat, html-like and a more (in pdf/UA2 basically deprecated) xml-like version. The flat version creates a structure like this: \begin{taglstlisting}

section header

text

subsection header

... \end{taglstlisting} So here the headings are marked according their level with \texttt{H1}, \texttt{H2}, etc. In the xml-like tree the complete text of a sectioning unit is surrounded with the \texttt{Sect} tag, and all headers with the tag \texttt{H}. Here the nesting defines the level of a sectioning heading. \begin{taglstlisting} section heading

text

subsection heading ...
\end{taglstlisting} The flat version is more \LaTeX-like and it is rather straightforward to patch \verb+\chapter+, \verb+\section+ and so on to insert the appropriates \texttt{H\ldots} start and end markers. The xml-like tree is more difficult to automate. It has been implemented in the sec module in latex-lab, but can break if sectioning commands are hidden inside boxes. \subsubsection{Commands to define the structure} The following commands can be used to define the tree structure: \begin{docCommands} { {doc name=tagstructbegin,doc parameter=\marg{key-val-list}}, {doc name=tag_struct_begin:n,doc parameter=\marg{key-val-list}} }\end{docCommands} These commands start a new structure. They don't start a group. They set all their values globally. The key-val list understands the following keys: \begin{description} \item[\PrintKeyName{tag}] This is required. The value of the key is normally one of the standard types listed in section \ref{sec:new-tag}. It is possible to setup new tags/types, see the same section. The value can also be of the form |type/NS|, where |NS| is the shorthand of a declared name space. Currently the names spaces |pdf|, |pdf2|, |mathml| and |user| are defined. This allows to use a different name space than the one connected by default to the tag. But normally this should not be needed. \item[\PrintKeyName{stash}] Normally a new structure inserts itself as a kid into the currently active structure. This key prohibits this. The structure is nevertheless from now on \enquote{the current active structure} and parent for following marked content and structures. \item[\PrintKeyName{label}] This key sets a label by which one can refer to the structure. Currently the key writes a property whose name starts with \texttt{tagpdfstruct-} to the aux-file with the two attributes \texttt{tagstruct} (the structure number) and \texttt{tagstructobj} (the object reference) but also stores the name and the structure number into a prop for use in the current compilation. The label is e.g. used by \cs{tag\_struct\_use:n} and by the |ref| key (which can refer to future structures). \item[\PrintKeyName{parent}] With the parent key one can choose another parent. The value is a structure number which must refer to an already existing, previously created structure. Such a structure number can have been stored previously with \cs{tag\_get:n}, but one can also use a label on the parent structure and then use \cs{property\_ref:nn}\verb+{tagpdfstruct-label}{tagstruct}+ to retrieve it. \item[\PrintKeyName{firstkid}] If this key is used the structure is added at the left of the kids of the parent structure (if the structure is not stashed). This means that it will be the first kid of the structure (unless some later structure uses the key too). This can be needed e.g. for a caption as the PDF reference requires it to be the first or last kid of its structure. \item[\PrintKeyName{alt}] This key inserts an \texttt{/Alt} value in the dictionary of structure object, see section~\ref{sec:alt}. The value is handled as verbatim string and hex encoded. The value will be expanded first once (so works like the key \texttt{alttext-o} in previous versions which has been removed). If the value is empty, nothing will happen. That means that you can do something like this: \begin{taglstlisting} \newcommand\myalttext{\frac{a}{b}} \tagstructbegin{tag=P,alt=\myalttext} \end{taglstlisting} and it will insert \verb+\frac{a}{b}+ (hex encoded) in the \PDF{}. In case that the text begins with a command that should not be expanded protect it e.g. with a \verb+\empty+. \item[\PrintKeyName{actualtext}] This key inserts an \texttt{/ActualText} value in the dictionary of structure object, see section~\ref{sec:alt}. The value is handled as verbatim string. The value will be expanded first once (so works like the key \texttt{alttext-o} in previous versions which has been removed). If the value is empty, nothing will happen. That means that you can do something like this: \begin{taglstlisting} \newcommand\myactualtext{X} \tagstructbegin{tag=P,actualtext=\myactualtext} \end{taglstlisting} and it will insert \verb+X+ (hex encoded) in the \PDF{}. In case that the text begins with a command that should not be expanded protect it e.g. with a \verb+\empty+ \item[\PrintKeyName{attribute}] This key takes as argument a comma list of attribute names (use braces to protect the commas from the external key-val parser) and allows to add one or more attribute dictionary entries in the structure object. As an example \begin{taglstlisting} \tagstructbegin{tag=TH,attribute= TH-row} \end{taglstlisting} See also section~\ref{sec:attributes}. \item[\PrintKeyName{attribute-class}] This key takes as argument a comma list of attribute names (use braces to protect the commas from the external key-val parser) and allows to add them as attribute classes to the structure object. As an example \begin{taglstlisting} \tagstructbegin{tag=TH,attribute-class= TH-row} \end{taglstlisting} See also section~\ref{sec:attributes}. \item[\PrintKeyName{title}] This key allows to set the dictionary entry \texttt{/T} (for a title) in the structure object. The value is handled as verbatim string and hex encoded. Commands are not expanded. \item[\PrintKeyName{title-o}] This key allows to set the dictionary entry \texttt{/T} in the structure object. The value is expanded once and then handled as verbatim string like the \PrintKeyName{title} key. \item[\PrintKeyName{AF}] This key allows to reference an associated file in the structure element. The value should be the name of an object pointing to the \texttt{/Filespec} dictionary as expected by \verb+\pdf_object_ref:n+ from a current \texttt{l3kernel}. For example: \begin{taglstlisting} \group_begin: \pdfdict_put:nnn {l_pdffile/Filespec} {AFRelationship}{/Supplement} \pdffile_embed_file:nnn{example-input-file.tex}{}{tag/AFtest} \group_end: \tagstructbegin{tag=P,AF=tag/AFtest} \end{taglstlisting} As shown, the wanted AFRelationship can be set by filling the dictionary with the value. The mime type is here detected automatically, but for unknown types it can be set too. See the \texttt{l3pdffile} documentation for details. Associated files are a concept new in PDF 2.0, but the code currently doesn't check the pdf version, it is your responsibility to set it (this can be done with the \texttt{pdfversion} key in \verb+\DocumentMetadata+). \item[\PrintKeyName{root-AF}] This key allows to reference an associated file in the root structure element. Using the root can be e.g. useful to add a css-file. When converting the pdf to a html with e.g. ngpdf \cite{ngpdf} this css-file is then referenced in the head of the html. \item[\PrintKeyName{root-supplemental-file}] This is a variant of the previous key. It takes as argument a file name. It then embeds this file with \texttt{/AFRelationship /Supplement} and appends it as associated file to the structure root. ngpdf \cite{ngpdf} will store a \texttt{.css} attached in this way and reference it in the head of the html. If a \texttt{html} is attached in this way, ngpdf will copy the content into the head of the derived html. This means that the content of such an html file should normally be some html snippet suitable for the head, e.g. some css-code inside \texttt{