% !Mode:: "TeX:DE:UTF-8:Main"
\PassOptionsToPackage{check-declarations,enable-debug}{expl3}
% Note on the compilation of the documentation:
% The documentation uses for the tagging sometimes code
% that is under development and/or not public yet.
% To compile an *untagged* documentation, comment the line with
% the testphase keys in the following \DocumentMetadata command.
\DocumentMetadata
{
% comment the following line to compile an untagged documentation:
testphase={phase-III,title,table},
pdfversion=2.0,lang=en-UK,pdfstandard=a-4,pdfstandard=ua-2
%uncompress
}
\DebugBlocksOff
\makeatletter
\def\UlrikeFischer@package@version{0.99n}
\def\UlrikeFischer@package@date{2025-02-23}
\makeatother
\documentclass[bibliography=totoc,a4paper]{article}
\usepackage{geometry}
\usepackage[english]{babel}
\usepackage{unicode-math}
\setmainfont{Heuristica}
\usepackage[nopatch]{microtype}
\usepackage[autostyle]{csquotes}
\usepackage[style=numeric]{biblatex}
\addbibresource{tagpdf.bib}
\reversemarginpar
\NewDocumentCommand\sidenote{m}{\marginpar{#1}}
\usepackage{booktabs}
\setlength\belowcaptionskip{10pt}
\usepackage{tcolorbox}
\usepackage{tikz}
\usetikzlibrary{positioning}
\usetikzlibrary{fit,tikzmark}
\usetikzlibrary{arrows.meta}
\tikzset{arg/.style = {font=\footnotesize\ttfamily, anchor=base,draw, rounded corners,node distance=2mm and 2mm}}
\tikzset{operator/.style = {font=\footnotesize\ttfamily, anchor=base,draw, rounded corners,node distance=4mm and 4mm}}
\usepackage{listings}
\lstset{basicstyle=\ttfamily, columns=fullflexible,language=[LaTeX]TeX,
escapechar=*,
commentstyle=\color{green!50!black}\bfseries}
% this allow to get real spaces in the code parts.
% This should perhaps be combined in a new listings key
\lstset{showspaces}
\makeatletter \def\lst@visiblespace{\lst@ttfamily{\char32}{\char32}}\makeatother
\tagpdfsetup{tabsorder=structure}
\usepackage[pdfdisplaydoctitle=true]{hyperref}
\hypersetup{
pdftitle={The tagpdf package, v\csname UlrikeFischer@package@version\endcsname},
pdfauthor=Ulrike Fischer,
colorlinks}
\tcbuselibrary{documentation}
\definecolor{Definition}{rgb}{0,0.2,0.6}
\newcommand\PrintKeyName[1]{\textsf{#1}}
\newcommand\pkg[1]{\texttt{#1}}
\newcommand\DescribeKey[1]{\texttt{#1}}
%tagging patches:
\usepackage{tagpdfdocu-patches}
\newcommand\PDF{PDF}
\title{The \pkg{tagpdf} package, v\csname UlrikeFischer@package@version\endcsname}
\date{\csname UlrikeFischer@package@date\endcsname}
\author{Ulrike Fischer\thanks{fischer@troubleshooting-tex.de}}
\usepackage{shortvrb}
\MakeShortVerb|
\begin{document}
\maketitle
\begin{tcolorbox}[colframe=red]
This package is not meant for direct use in (normal) documents. It started in 2018 as
a support tool to \emph{research} tagging. It is now the base of the code developed
in the \pkg{latex-lab} bundle for the Tagged PDF project (i.e., loaded by that code)
\url{https://www.latex-project.org/publications/indexbytopic/pdf/}.
The package is developed and improved in parallel with the code in the \pkg{latex-lab}
bundle (part of the core \LaTeX{} distribution), the \pkg{pdfmanagement-testphase}
package (the \LaTeX{} PDF management bundle) and the L3 programming layer (part of the \LaTeX{} format).
That means you must ensure that all these components are up-to-date and in
sync which each other.
This package quite probably still contains some bugs. It is in some parts quite slow because
the code currently prefers readability over speed. At some point in the future its code will
be integrated into the \LaTeX{} format and then this package will disappear.
Because of its function as a research and development tool it is
important to understand that this package can still change in
incompatible ways from one version to the next.
You need some knowledge about \TeX, \PDF{} and perhaps even lua to use it.
\medskip
Issues, comments, suggestions can be added as issues to these two github tracker:
\medskip
\centering \url{https://github.com/latex3/tagging-project}\par
\leavevmode\llap{or\qquad\qquad} \url{https://github.com/latex3/tagpdf}
\end{tcolorbox}
\tagtool{sec-add-grouping=false}
\tableofcontents
\tagtool{sec-add-grouping}
\section{Introduction}
For many years the creation of accessible, tagged \PDF{}-files with \LaTeX\
that conform to the PDF/UA standard has been on the agenda of \TeX-meetings.
Many people agree that this is important and Ross Moore has done quite some
work on it. There is also a TUG-mailing list and a web page
\parencite{tugaccess} dedicated to this topic.
In my opinion missing were means to \emph{experiment} with tagging and
accessibility. Means to try out, how difficult it is to tag some structures,
means to try out, how much tagging is really needed (standards and validators
don't need to be right \ldots), means to test what else is needed so that a
\PDF{} works e.g. with a screen reader, means to try out how core \LaTeX\
commands behave if tagging is used. Without such experiments it is in my
opinion quite difficult to get a feeling about what has to be done, which
kernel changes are needed, and how packages should be adapted.
This package was developed to close this gap by offering \emph{core} commands
to tag a \PDF{}\footnote{In case you don't know what this means: there will
be some explanations later on.}. My hope was that the knowledge gained by the
use of this package would in the end allow to decide if and how code to do
tagging should become part of the \LaTeX\ kernel.
The code has been written so that it can be added as module to the \LaTeX{}
kernel itself if it turned out to be usable. It therefore avoid to patch
commands from other packages. It was also not an aim of the package to
develop patches to directly enable tagging in other packages. While in the end changes to various commands in many
classes and packages will be needed to automatically get tagged \PDF{} files, these changes
should be done by class, package and document writers themselves using a
sensible API provided by the kernel and not by some external package that
adds patches everywhere and would need constant maintenance --- one only need
to look at packages like \pkg{tex4ht} or \pkg{bidi} or \pkg{hyperref} to see how difficult and
sometimes fragile this is.
The package is now a part of the Tagged PDF project and triggered already
various changes in the \LaTeX\ kernel and the engines: There is a new PDF
management, the new para hooks allows to automatically tag paragraphs, after
changes in the output routine page breaks and header and footer are handled
correctly, the engines now support structure destinations. More changes are
in the latex-lab bundle and can be loaded through \texttt{testphase} keys.
I'm sure that tagpdf still has bugs. Bugs reports, suggestions and comments
can be added to the issue tracker on github either
\url{https://github.com/latex3/tagpdf} or
\url{https://github.com/latex3/tagging-project}.
Please also check the github site and latex-lab for new examples and improvements.
\subsection{Tagging and accessibility}
While the package is named \pkg{tagpdf} the goal is also \emph{accessible}
\PDF{}-files. Tagging is \emph{one} (the most difficult) requirement for
accessibility but there are others. I will mention some later on in this
documentation, and -- if sensible -- also add code, keys or
tips for them.
So the name of the package is a bit wrong. As excuse I can only say that it
is short and easy to pronounce (and of course, it was always meant to be temporary).
\subsection{Engines and modes}
Theoretically, the package works with all engines, but the xelatex and the
latex-dvips-route are basically untested and they also don't support real
space glyphs so I don't recommend them. lualatex is the most powerful and
safe modus and should be used for new documents, it is slower than pdflatex
but requires less compilations. pdflatex works ok and can be used for legacy
documents; it needs more compilations to resolve all cross references needed
for the tagging.
The package has two modes: the \emph{generic mode} which should work in
theory with every engine and the \emph{lua mode} which works only with
lualatex and (since version 0.98k) with dvilualatex. Since version 0.99m,
the lua mode is forced if luatex is detected, otherwise generic mode is used.
I implemented the generic mode first. Mostly because my \TeX\ skills are much
better than my lua skills and I wanted to get the \TeX\ side right before
starting to fight with attributes and node traversing.
While the generic mode is not bad and I spent quite some time to get it
working I nevertheless think that the lua mode is the future and the only one
that will be usable for larger documents. \PDF{} is a page orientated format
and so the ability of luatex to manipulate pages and nodes after the
\TeX-processing has finished is really useful here. Also with luatex characters are
normally already given as Unicode.
The package uses quite a lot labels (in generic mode more than with luamode).
It is now based on the property module of the \LaTeX{} kernel. This module
provides expandable references but the drawback is that (right now) they don't always give
good rerun messages if they have changed. I advise to use the
\pkg{rerunfilecheck} package as a intermediate work-around and when using
pdflatex compile at least once or twice more often then normal.
\subsection{References and target PDF version}
My main reference for the first versions of this package was the free
reference for \PDF{} 1.7. \parencite{pdfreference} and so they implemented
only support for \PDF{} 1.7.
In 2018 \PDF{} 2.0. has been released. The reference can now be bought at no
cost through the PDF association.
\PDF{} 2.0 has a number of features that are really needed for good tagging:
it knows more structure types, it allows to add associated files to
structures---these are small, embedded files that can, for example, contain
the mathML or source code of an equation---, it knows structure destinations,
which allows to link to a structure. It knows the MathML namespace.
\LaTeX{} therefore targets \PDF{} 2.0 and tagpdf has support for
associated files, for name spaces and other \PDF{} 2.0 features.
\PDF{}~2.0 features are currently (begin of 2025) still not well supported by
\PDF~consumer, but some progress has been made. Foxit can handle MathML associated files
and to some extend MathML structure elements and
together with development versions of NVDA and MathCat reading of equations is already quite good. The PDF Accessibility Checker (PAC) no longer crashes
if one tries to load a \PDF{} 2.0 file. We recommend
to use \PDF{} 2.0 if possible and then to complain to the PDF{} consumer if
something doesn't work.
The package doesn't try to suppress all 2.0 features if an older \PDF{}
version is produced. It normally doesn't harm if a \PDF{} contains keys
unknown in its version and it makes the code faster and easier to maintain if
there aren't too many tests and code paths; so for example associated files
will always be added. But tests could be added in case this leads to
incompatibilities.
\subsection{Validation}
\PDF{}'s created with the commands of this package must be validated:
\begin{itemize}
\item
One must check that the \PDF{} is \emph{syntactically} correct.
It is rather easy to create broken \PDF{}:
e.g. if a chunk is opened on one page but closed
on the next page or if the document isn't compiled often enough.
\item One must check how good the PDF follows requirements of standards
like PDF/UA \emph{formally}\footnote{The PDF/UA-2 standard for \PDF~2.0
will hopefully be released begin of 2024.}.
\item
One must check how good the accessibility is \emph{practically}.
\end{itemize}
Syntax validation and formal standard validation can be done with the
validator veraPDF \parencite{verapdf} which can also handle PDF 2.0 files.
Other options (only for PDF 1.7 and older) are preflight of the (non-free) Adobe Acrobat and the free \PDF{} Accessibility Checker (PAC~2024) \parencite{pac2024}.
A quite useful tool
is \enquote{Next Generation PDF} \parencite{ngpdf}, a browser application
which converts a tagged PDF to html, allows to inspect its structure and also
to edit the structure. For PDF~2.0 files there is also a checker based on the
Arlington model from veraPDF.
A tool developed by the \LaTeX{} team allows to extract the structure as XML and to validate it against a schema. This can be tested as \url{https://texlive.net/showtags}.
Practical validation is naturally the more complicated part.
It needs screen reader, users which actually knows how to handle them,
can test documents and can report where a \PDF{} has real accessibility problems.
\minisec{Preflight woes}
Sadly validators can not be always trusted.
As an example for an reason that I don't understand the adobe preflight
don't like the list structure \texttt{L}.
It is also possible that validators contradict: that the one says everything is okay,
while the other complains. Generally when in doubt I recommend to use and trust verapdf.
\subsection{Examples wanted!}
To make the package usable examples are needed: examples that demonstrate how
various structures can be tagged and which patches are needed, examples for
the test suite, examples that demonstrates problems.
\begin{tcolorbox}
Feedback, contributions and corrections are welcome!
\end{tcolorbox}
All examples should use the \cs{DocumentMetadata} key \PrintKeyName{uncompress}
so that uncompressed \PDF{} are created and the internal objects and
structures can be inspected and be compared by the l3build checks.%
\subsection{Proof of concept: the tagging of the documentation itself}
Starting with version 0.6 the documentation itself has been tagged. The
tagging wasn't (and isn't) in no way perfect. The validator from Adobe didn't
complain, but PAC~3 wanted alternative text for all links (no idea why) and
so I put everywhere simple text like \enquote{link} and \enquote{ref}. The
links to footnotes gave warnings, so I disabled them. I used types from the
\PDF{} version 1.7, mostly as I had no idea what should be used for code in
2.0. Margin notes were simply wrong and there were tagging commands
everywhere \ldots
The tagging has been improved and automated over time in sync with
improvements and new features in the \LaTeX\ kernel, the latex-lab bundle and
the \PDF\ management code and is now much better. Only a few
structures---mostly some from currently unsupported packages--- still need
manual tagging. But sadly the output of the validators don't quite reflect
the improvements. The documentation uses now \PDF~2.0 and while the newest
PAC~2024 can at least open the file it can not validate properly the file. For example
it complains about the tabular header cells as it doesn't follow attribute classes.
The Adobe validator has a bug and
doesn't like the (valid) use of the \texttt{Lbl} tag for the section numbers
(see figure~\ref{fig:adobe}).
But even if the documentation would pass all the tests of the validators: as
mentioned above passing a formal test doesn't mean that the content is really
good and usable. The user commands used for the tagging and also some of the
patches used are still rather crude. So there is lot space for improvement.
\begin{tcolorbox}[]
Be aware that to create the tagged version a current lualatex-dev and a
current version of the pdfmanagment-testphase package is needed.
\end{tcolorbox}
\includegraphics[alt=PAC 2024 complains about PDF version]{pac2024-version}
\includegraphics[alt=PAC 2024 complains about table header cells]{pac2024-report}
\begin{figure}
\includegraphics[alt={Screenshot of Adobe report}]{acrobat}
\caption{Adobe Acrobat complaining
about the \texttt{Lbl} use}\label{fig:adobe}\par
\end{figure}
\section{Loading}
The package requires the new PDF management. With a current \LaTeX{} (2022-06-01 or newer)
the PDF management is loaded if you use the \cs{DocumentMetadata} command before \cs{documentclass}.
The \pkg{tagpdf} package can then be loaded and activated by using the \texttt{testphase} key. The exact behavior of
the \texttt{testphase} key is documented in \texttt{documentmetadata-support-doc.pdf} which
is part of the \pkg{latex-lab} bundle.
Various parts of the code differentiate between \PDF{} version 2.0 and lower versions. If
\PDF{} 2.0 is wanted it is required to set the version early in the \cs{DocumentMetadata}
command so that \pkg{tagpdf} can pick up the correct code path.
\begin{taglstlisting}
\DocumentMetadata
{
% testphase = phase-I, % tagging without paragraph tagging
% testphase = phase-II, % tagging with paragraph tagging
testphase = phase-III, % tagging with paragraph sec, toc, blocks and more
pdfversion = 2.0, % pdfversion must be set here.
pdfstandard=ua-2, % pdfstandard can be set too
}
\documentclass{article}
\begin{document}
some text
\end{document}
\end{taglstlisting}
\minisec{Deactivation}
When loading \pkg{tagpdf} through the \texttt{testphase} keys, it is automatically activated.
To deactivate it while still retaining all the other new code from the latex-lab testphase files,
use in the preamble |\tagpdfsetup{activate/all=false}|. You can additionally also deactivate the
paratagging and the interword space code.
To suppress the loading of the package altogether you can try
\begin{taglstlisting}
\makeatletter
\disable@package@load{tagpdf}{}
\makeatother
\DocumentMetadata{...}
\end{taglstlisting}
\minisec{Loading as package needs activation!}
It is not recommended anymore, but the package can also be loaded
normally with |\usepackage| (but it is still required to
use \cs{DocumentMetadata} to load the \PDF\ management) but it will
then -- apart from loading more packages and defining a lot of things
-- not do much. You will have to \emph{activate} it
with \verb+\tagpdfsetup+.
The \PDF\ management loaded with \cs{DocumentMetadata} will in any
case load \pkg{tagpdf-base} a small package that provides no-op
versions of the main tagging commands.
Most commands do nothing if tagging is not activated, but in case a
test is needed a command (with the usual p,T,F variants) is provided:
\begin{docCommand}{tag_if_active:TF}{}\end{docCommand}
The check is true only if \emph{everything} is activated. In all other
cases (including if tagging has been stopped locally) it will be
false.
\subsection{Modes and package options}
%TODO think about tagging of the keys. Aside? Header?
The package has two different modes: The \textbf{generic mode} works
(in theory, currently only fully tested with pdflatex) probably with
all engines, the \textbf{lua mode} only with lualatex. The differences
between both modes will be described later.
Starting with version 0.99m the mode is set automatically (lua mode for luatex, generic mode otherwise). The package options do nothing anymore and will be remove in future versions.
\subsection{Setup and activation}\label{ssec:setup}
\begin{docCommand}{tagpdfsetup}{\marg{key-val-list}}\end{docCommand}
This command setups the general behavior of the package.
The command should be normally used only in the preamble
(for a few keys it could also make sense to change them in the document).
The key-val list understands at least the following keys. More keys are defined in some of the latex-lab module, see table~\ref{tab:setupkey} for an overview which also includes older, now deprecated names.
\begin{table}
\caption{Overview over keys for \cs{tagpdfsetup}}\label{tab:setupkey}
\input{tagpdfsetup-keys}
\end{table}
\begin{description}
\item[\PrintKeyName{activate/all}] Boolean, initially false. Activates
everything, that's normally the sensible thing to do.
\item [\PrintKeyName{activate}] Like |activate/all|,
\emph{additionally} is opens at begin document a structure with
|\tagstructbegin| and closes it at end document. The key accepts as
value a tag name which is used as the tag of the structure. The
default value is |Document|.
\item[\PrintKeyName{activate/mc}] Boolean, initially false. Activates
the code related to marked content.
\item[\PrintKeyName{activate/struct}] Boolean, initially
false. Activates the code related to structures. Should be used only
if \PrintKeyName{activate/mc} has been used too.
\item[\PrintKeyName{activate/struct-dest}] Boolean, initially true.
Starting with version 0.93
\pkg{tagpdf} will create automatically structure destinations (see
section~\ref{sec:struct-dest} if \pkg{hyperref} is used and if the
engine supports it. With this key this
can be suppressed.
\item[\PrintKeyName{activate/tree}] Boolean, initially
false. Activates the code related to trees. Should be used only if
the two other keys has been used too.
\item[\PrintKeyName{activate/spaces}] Boolean. The key
activates/deactivates the insertion of space glyphs, see
section~\ref{sec:spacechars}. In the luamode it only works if at
least \PrintKeyName{activate/mc} has been used.
The old name of the key |interwordspace| is still supported but deprecated.
\item[\PrintKeyName{activate/softhyphen}] Boolean. luamode only.
The key activates/deactivates the replacing of hard hyphens from hyphenation
by soft hyphens. By default this is activated.
\item[\PrintKeyName{role/new-tag}] Allows to define new tag names, see
section \ref{sec:new-tag} for a description.
\item[\PrintKeyName{role/new-attribute}] This key takes two arguments and
declares an attribute. See \ref{sec:attributes}.
\item[\PrintKeyName{role/map-tags}] This key allows to remap the structure
tags. Currently it supports only two values: |false| (the default) and |pdf| which
maps all tags to their standard PDF role, e.g. |itemize| will be mapped to |L|.
\item[\PrintKeyName{para/tagging}] Boolean. This activate/deactivates
the automatic tagging of paragraphs, see \ref{sec:paratagging} for
more background. It uses the \texttt{para/begin} and
\texttt{para/end} hooks.
With more tagging support conditions will be added, that means the
code is bound to change! Paragraphs can appear in many unexpected
places and the code can easily break, so there is also an option to
see where such paragraphs are:
\item[\PrintKeyName{para/tag}] String. This key changes the second tag
used by the paratagging code. The default tag is \texttt{text}, a
\LaTeX{} specific tag that is role mapped to \texttt{P}. A useful
local setting here can be \texttt{NonStruct}, which creates a
structure \enquote{without meaning}. For local changes it is
recommended to use the newer \cs{tagtool} command described below
instead of \cs{tagpdfsetup}.
\item[\PrintKeyName{para/maintag}] String. This key changes the first tag
used by the paratagging code. The default tag is \texttt{text-unit}, a
\LaTeX{} specific tag that is role mapped to \texttt{Part}.
For local changes it is
recommended to use the newer \cs{tagtool} command described below
instead of \cs{tagpdfsetup}.
\item[\PrintKeyName{page/tabsorder}] Choice key, possible values are
\PrintKeyName{row}, \PrintKeyName{column}, \PrintKeyName{structure},
\PrintKeyName{none}. This decides if a \verb+/Tabs+ value is
written to the dictionary of the page objects. Not really needed for
tagging itself, but one of the things you probably need for
accessibility checks. So I added it. Currently the tabsorder is the
same for all pages. Perhaps this should be changed \ldots.
\item[\PrintKeyName{activate/tagunmarked}] Boolean,\sidenote{luamode} initially
true. When this boolean is true, the lua code will try to mark
everything that has not been marked yet as an artifact. The benefit
is that one doesn't have to mark up every deco rule oneself. The
danger is that it perhaps marks things that shouldn't be marked --
it hasn't been tested yet with complicated documents containing
annotations etc. See also section~\ref{sec:lazy} for a discussion
about automatic tagging.
\item[\PrintKeyName{viewer/startstructure}] A structure number. If a \texttt{OpenAction}
is set in the PDF Catalog (which is normally the case if hyperref is used)
a structure destination pointing to the structure is added. The initial value is structure 1 (the \texttt{Document} structure), the default value is the current structure. The
key can be used more than once, the last setting will win.
\item[\PrintKeyName{debug/uncompress}] Sets both the \PDF{} compresslevel
and the \PDF{} objcompresslevel to 0 and so allows to inspect the
\PDF{}. No really useful anymore as this can also
be set in \cs{DocumentMetadata}.
\item[\PrintKeyName{debug}] This keys knows a number of sub-keys to
set various debug options.
\begin{description}
\item[\PrintKeyName{debug/show}] This takes a comma list of keywords:
\texttt{spaces}/\texttt{spacesOff}: \sidenote{luamode}
That helps in lua mode to see where space glyph will be
inserted if \PrintKeyName{activate/spaces} is activated.
This can also be activated with the now deprecated key |show-spaces|
\texttt{para}/\texttt{paraOff}: This (locally)
activates/deactivates small red and green numbers in the places where
the paratagging hook code is used.
\item[\PrintKeyName{debug/log}] Choice key, possible values
\PrintKeyName{none}, \PrintKeyName{v}, \PrintKeyName{vv},
\PrintKeyName{vvv}, \PrintKeyName{all}. Setups the log level.
Changing the value affects currently mostly the luamode:
\enquote{higher} values gives more messages in the log. The current
levels and messages have been setup in a quite ad-hoc manner and
will need improvement.
\end{description}
\end{description}
\begin{docCommands}
{
{doc name=tagtool,doc parameter=\marg{key-val}},
{doc name=tag_tool:n,doc parameter=\marg{key-val}}
}
\end{docCommands}
The tagging of document elements requires a variety of small
commands. This command will unify them under a common interface. This
is work-in-progress and syntax and implementation can change! While
the argument looks like a key-val \emph{list} (and currently is
actually one), this should not be relied on. Instead only one argument
should be used as the implementation will change to improve the
speed. Currently the following arguments are supported
\begin{description}
\item[\PrintKeyName{para/tagging}] Boolean. It will replace the
\cs{tagpdfparaOn} and \cs{tagpdfparaOff} command.
\item[\PrintKeyName{para/maintag}] String. It allows to change the outer
tag used in the following automatically tagged paragraphs. The
setting is local.
\item[\PrintKeyName{para/tag}] String. It allows to change the inner
tag used in the following automatically tagged paragraphs. The
setting is local.
\item[\PrintKeyName{para/flattened}] Boolean. If set it will suppress
the outer structure in the automatic paratagging. This should be
applied to the start and end hook in the same way! The setting is
local.
\end{description}
\section{Tagging}
PDF is a page orientated graphic format. It simply puts ink and glyphs
at various coordinates on a page. A simple stream of a page can look
like this\footnote{The appendix contains some remarks about the syntax
of a \PDF{} file}:
\begin{taglstlisting}[columns=fixed]
stream
BT
/F27 14.3462 Tf %select font
89.291 746.742 Td %move point
[(1)-574(Intro)-32(duction)]TJ %print text
/F24 10.9091 Tf %select font
0 -24.35 Td %move point
[(Let's)-331(start)]TJ %print text
205.635 -605.688 Td %move point
[(1)]TJ %print text
ET
endstream
\end{taglstlisting}
From this stream one can extract the characters and their placement on the page
but not their semantic meaning (the first line is actually a section heading,
the last the page number). And while in the example the order is correct
there is actually no guaranty that the stream contains the text in the order
it should be read.
Tagging means to enrich the \PDF{} with information about the \emph{semantic}
meaning and the \emph{reading order}. (Tagging can do more, one can also
store all sorts of layout information like font properties and indentation
with tags. But as I already wrote this package concentrates on the part of
tagging that is needed to improve accessibility.)
\subsection{Three tasks}
To tag a \PDF{} three tasks must be carried out:
\begin{enumerate}
\item
\textbf{The mark-content-task}:\sidenote{mc-task} The document must add
\enquote{labels} to the page stream which allows to identify and reference
the various chunks of text and other content.
This is the most difficult part of tagging -- both for the document writer
but also for the package code. At first there can be quite many
chunks as every one is a leaf node of the structure and so often a rather
small unit. At second the chunks must be defined page-wise -- and
this is not easy when you don't know where the page breaks are.
Also in a standard document a lot text is created automatically, e.g.
the toc, references, citations, list numbers etc and it is not always
easy to mark them correctly.
\item \textbf{The structure-task}:\sidenote{struct-task} The document must
declare the structure. This means marking the start and end of
semantically connected portions of the document (correctly nested as a
tree). This too means some work for the document writer, but less than
for the mc-task: at first quite often the mc-task and the
structure-task can be combined, e.g. when you mark up a list number or
a tabular cell or a section header; at second one doesn't have to worry
about page breaks so quite often one can patch standard environments to
declare the structure. On the other side a number of structures end in
\LaTeX\ only implicitly -- e.g. an item ends at the next item, so
getting the \PDF{} structure right still means that additional mark up
must be added.
\item \textbf{The tree management}:\sidenote{tree-task} At last the
structure must be written into the \PDF{}. For every structure an
object of type \texttt{StructElem} must be created and flushed with
keys for the parents and the kids. A parent tree must be created to get
a reference from the mc-chunks to the parent structure. A role map must
be written. And a number of dictionary entries. All this is hopefully
done automatically and correctly by the package \ldots.
\end{enumerate}
\begin{figure}[t!]
\begin{tcolorbox}[]
\minisec{Page stream with marked content}
\begin{tikzpicture}[baseline=(a.north),node distance=2pt,remember picture,
alt={Illustration of page stream with marked content}]
\node(start){\ldots~\ldots~\ldots};
\node[draw,base right = of start](a) {mc-chunk 1};
\node[draw,base right = of a](b) {mc-chunk 2};
\node[draw,base right = of b](c) {mc-chunk 3};
\node[draw,base right = of c](d) {mc-chunk 3};
\node[base right = of d] {\ldots~\ldots};
\end{tikzpicture}
\minisec{Structure}
\newlength\ydistance\setlength\ydistance{-0.8cm}
\begin{tikzpicture}[remember picture,baseline=(root.north),alt={Illustration of structure}]
\node[draw,anchor=base west] (root) at (0,0) {Sect (start section)};
\node[draw,anchor=base west] at (0.3,\ydistance) {H (header section)};
\node[draw,anchor=base west](aref) at (0.6,2\ydistance){mc-chunk 1};
\node[draw,anchor=base west](bref) at (0.6,3\ydistance){mc-chunk 2};
\node[draw,anchor=base west] at (0.3,4\ydistance){/H (end header)};
\node[draw,anchor=base west] at (0.3,5\ydistance){P (start paragraph)};
\node[draw,anchor=base west](cref) at (0.6,6\ydistance){mc-chunk 3};
\node[draw,anchor=base west](dref) at (0.6,7\ydistance){mc-chunk 4};
\node[draw,anchor=base west] at (0.3,8\ydistance){/P (end paragraph)};
\node[draw,anchor=base west] at (0,9\ydistance){/Sect (end section)};
\end{tikzpicture}
\begin{tikzpicture}[remember picture, overlay]
\draw[->,red](aref)-|(a);
\draw[->,red](bref)-|(b);
\draw[->,red](cref)-|(c);
\draw[->,red](dref)-|(d);
\end{tikzpicture}
\end{tcolorbox}
\caption{Schematical
description of the relation between marked content in the page stream and the
structure}
\end{figure}
\subsection{Task 1: Marking the chunks: the mark-content-step}
To be able to refer to parts of the text in the structure, the text in the
page stream must get \enquote{labels}. In the \PDF{} reference they are
called \enquote{marked content}. The three main variants needed here are:
\begin{description}
\item[Artifacts] They are marked with of a pair of keywords, \texttt{BMC}
and \texttt{EMC} which surrounds the text. \texttt{BMC} has a single
prefix argument, the fix tag name \texttt{/Artifact}. Artifacts should
be used for irrelevant text and page content that should be ignored in
the structure. Sadly it is often not possible to leave such text simply
unmarked -- the accessibility tests in Acrobat and other validators
complain.
\begin{taglstlisting}
/Artifact BMC
text to be marked
/EMC
\end{taglstlisting}
\item[Artifacts with a type] They are marked with of a pair of keywords,
\texttt{BDC} and \texttt{EMC} which surrounds the text. \texttt{BDC}
has two arguments: again the tag name \texttt{/Artifact} and a
following dictionary which allows to specify the suppressed info. Text
in header and footer can e.g. be declared as pagination like this:
\begin{taglstlisting}
/Artifact <> BDC
text to be marked
/EMC
\end{taglstlisting}
\item[Content] Content is marked also with of a pair of keywords,
\texttt{BDC} and \texttt{EMC}. The first argument of \texttt{BDC} is a
tag name which describes the structural type of the text\footnote{There
is quite some redundancy in the specification here. The structural type
is also set in the structure tree. One wonders if it isn't enough to
use always \texttt{/SPAN} here.} Examples are \texttt{/P} (paragraph),
\texttt{/H2} (heading), \texttt{/TD} (table cell). The reference
mentions a number of standard types but it is possible to add more or
to use different names.
In the second argument of \texttt{BDC} -- in the property dictionary -- more
data can be stored. \emph{Required} is an \texttt{/MCID}-key which takes an
integer as a value:
\begin{taglstlisting}
/H1 <> BDC
text to be marked
/EMC
\end{taglstlisting}
This integer is used to identify the chunk when building the structure
tree. The chunks are numbered by page starting with 0. As the numbers are
also used as an index in an array they shouldn't be \enquote{holes} in the
numbering system (It is perhaps possible to handle a numbering scheme not
starting by 0 and having holes, but it will enlarge the \PDF{} as one would
need dummy objects.).
It is possible to add more entries to the property dictionary, e.g. a
title, alternative text or a local language setting.
\end{description}
The needed markers can be added with low level code e.g. like this (in pdftex syntax):
\begin{taglstlisting}
\pdfliteral page {/H1 <> BDC}%
text to be marked
\pdfliteral page {EMC}%
\end{taglstlisting}
This sounds easy. But there are quite a number of traps, mostly with pdfLaTeX:
\begin{enumerate}[beginpenalty=10000]
\item \PDF{} is a page oriented format. And this means that the start
\texttt{BDC}/\texttt{BMC} and the corresponding end \texttt{EMC}
must be on the same page. So marking e.g. a section title like in the
following example won't always work as the literal before the
section could end on the previous page:
\begin{taglstlisting}
\pdfliteral page {/H1 <> BDC} %problem: possible pagebreak here
\section{mysection}
\pdfliteral page {EMC}%
\end{taglstlisting}
Using the literals \emph{inside} the section argument is better, but then
one has to take care that they don't wander into the header and the toc.
\item Literals are \enquote{whatsits} nodes and can change spacing, page
and line breaking. The literal \emph{behind} the section in the
previous example could e.g. lead to a lonely section title at the end
of the page.
\item The \texttt{/MCID} numbers must be unique on a page. So you can't
use the literal in a saved box that you reuse in various places. This
is e.\,g. a problem with \texttt{longtable} as it saves the table
header and footer in a box.
\item The \texttt{/MCID}-chunks are leaf nodes in the structure tree, so
they shouldn't be nested.
\item Often text in a document is created automatically or moved around:
entries in the table of contents, index, bibliography and more. To
mark these text chunks correctly one has to analyze the code creating
such content to find suitable places to inject the literals.
\item There exist environments which process their content more than once
-- examples are \texttt{align} and \texttt{tabularx}.
So one has to check for doublets and holes in the counting system.
\item \PDF{} is a page oriented format. And this means that the start and
the end marker must be on the same page \ldots\ \emph{so what to do
with normal paragraphs that split over pages??}. This question will
be discussed in subsection~\ref{sec:splitpara}.
\end{enumerate}
\subsubsection{Generic mode versus lua mode in the mc-task}
While in generic mode the commands insert the literals directly and so have
all the problems described above the lua mode works quite differently: The
tagging commands don't insert literals but set some (global)
\emph{attributes} which are attached to all the following nodes. When the
page is shipped out some lua code is called which wanders through the shipout
box and injects the literals at the places where the attributes changes.
This means that quite a number of problems mentioned above are not relevant
for the lua mode:
\begin{enumerate}
\item Page breaks between start and end of the marker are
\emph{not} a problem. So you can mark a complete paragraph. If a pagebreak
occur directly after an start marker or before an end marker this can lead to
empty chunks in the \PDF{} and so bloat up \PDF{} a bit, but this is imho not
really a problem (compared to the size increase by the rest of the tagging).
\item The commands don't insert literals directly and so affect line and page
breaking much less.
\item The numbering of the MCID are done at shipout, so no label/ref system
is needed.
\item The code can do some marking automatically. Currently everything that
has not been marked up by the document is marked as artifact.
\end{enumerate}
\subsubsection{Commands to mark content and chunks}
In generic mode\sidenote{Generic mode only} is vital that the end command is
executed on the same page as the begin command. So think carefully how to
place them. For strategies how to handle paragraphs that split over pages see
subsection~\ref{sec:splitpara}.
\begin{docCommands}
{
{doc name=tagmcbegin,doc parameter={\marg{key-val-list}}},
{doc name=tag_mc_begin:n,doc parameter={\marg{key-val-list}}}
}
\end{docCommands}
These commands insert the begin of the marked content code in the \PDF{}.
They don't start a paragraph. \emph{They don't start a group}. Such markers
should not be nested. The command will warn you if this happens.
In the generic mode the commands insert literals. These are whatsits and so
can affect spacing. In lua mode they set an attribute \emph{globally}.
The key-val list understands the following keys:
\begin{description}
\item[\PrintKeyName{tag}] This key is optional. By default the tag name
of the surrounding structure is used, which normally should be fine.
But if needed the name can be set explicitly with this key. The value
of the key is typically one of the standard type listed in section
\ref{sec:new-tag} (without a slash at the begin, this is added by the
code). It is possible to setup new tags, see the same section. The
value of the key is expanded, so it can be a command. The expansion
is passed unchanged to the \PDF{}, so it should with a starting slash
give a valid \PDF{} name (some ascii with numbers like \texttt{H4}
is fine).
\item[\PrintKeyName{artifact}] This will setup the marked content as an
artifact. The key should be used for content that should be ignored.
The key can take one of the values \PrintKeyName{pagination},
\PrintKeyName{pagination/header}, \PrintKeyName{pagination/footer},
\PrintKeyName{layout}, \PrintKeyName{page},
\PrintKeyName{background} and \PrintKeyName{notype} (this is the
default). Text in the header and footer should normally be marked
with \PrintKeyName{artifact=pagination} or
\PrintKeyName{pagination/header}, \PrintKeyName{pagination/footer}
but simply artifact (as it is now done automatically) should be ok
too.
It is not quite clear if rules and other decorative graphical objects
needs to be marked up as artifacts. Acrobat seems not to mind if not, but
PAC~3 complained.
The validators complain if some text is not marked up, but it is not
quite clear if this is a serious problem.
The\sidenote{lua mode} lua mode will mark up everything unmarked as
\texttt{artifact=notype}. You can suppress this behavior by setting the
tagpdfsetup key \texttt{activate/tagunmarked} to false. See section
\ref{ssec:setup}.
\item[\PrintKeyName{stash}] Normally marked content will be stored in the
\enquote{current} structure. This may not be what you want. As an
example you may perhaps want to put a marginnote behind or before the
paragraph it is in the tex-code. With this boolean key the content is
marked but not stored in the kid-key of the current structure.
\item[\PrintKeyName{label}] This key sets a label by which you can call
the marked content \emph{later} in another structure (if it has been stashed
with the previous key). Internally the label name will start with
\texttt{tagpdf-}.
\item[\PrintKeyName{alt}]
This key inserts an \texttt{/Alt} value in the property dictionary
of the BDC operator. See section~\ref{sec:alt}.
The value is handled as verbatim string, commands are
not expanded but the value will be expanded first once (so works like
the key \texttt{alttext-o} in previous versions which has been
removed). If the value is empty, nothing will happen.
That means that you can do something like in the following listing
and it will insert \verb+\frac{a}{b}+ (hex encoded) in the \PDF{}.
\begin{taglstlisting}
\newcommand\myalttext{\frac{a}{b}}
\tagmcbegin{tag=P,alt=\myalttext}
\end{taglstlisting}
\item[\PrintKeyName{actualtext}] This key inserts an \texttt{/ActualText}
value in the property dictionary of the BDC operator. See
section~\ref{sec:alt}. The value is handled as verbatim string,
commands are not expanded but the value will be expanded first once
(so works like the key \texttt{actualtext-o} in previous versions
which has been removed). If the value is empty, nothing will happen.
That means that you can do something like in the following listing and
it will insert \verb+X+ (hex encoded) in the \PDF{}.
\begin{taglstlisting}
\newcommand\myactualtext{X}
\tagmcbegin{tag=Span,actualtext=\myactualtext}
\end{taglstlisting}
According to the PDF reference, \texttt{/ActualText} should only be used
on marked content sequence of type Span. This is not enforced by the code
currently. There is also some discussion going on, if
\texttt{/ActualText} can actually be used in a MC dictionary or if it
should be in a separate BDC-operator.
\item[\PrintKeyName{raw}] This key allows you to add more entries to the
properties dictionary. The value must be correct, low-level \PDF{}.
E.g. \verb+raw=/Alt (Hello)+ will insert an alternative Text.
\end{description}
\begin{docCommands}
{
{doc name=tagmcend},
{doc name=tag_mc_end:}
}
\end{docCommands}
These commands insert the end code of the marked content. They don't end a
group and it doesn't matter if they are in another group as the starting
commands. In generic mode both commands check if there has been a begin
marker and issue a warning if not. In luamode it is often possible to omit
the command, as the effect of the begin command ends with a new
\verb+\tagmcbegin+ anyway.
\begin{docCommands}
{
{doc name=tagmcuse,doc parameter=\marg{label}},
{doc name=tag_mc_use:n,doc parameter=\marg{label}}
}
\end{docCommands}
These commands allow you to record a marked content that you stashed away
into the current structure. Be aware that a marked content can be used only
once -- the command will warn you if you try to use it a second time.
\begin{docCommands}
{
{doc name=tag_mc_end_push:},
{doc name=tag_mc_begin_pop:n,doc parameter=\marg{key-val-list}}
}\end{docCommands}
If there is an open mc chunk,
the first command ends it and pushes its tag on a stack. If there is no
open chunk, it puts $-1$ on the stack (for debugging).
The second command removes a value from the stack. If it is different from
$-1$ it opens a tag with it. The command is mainly meant to be used inside hooks and command
definitions so there is only an expl3 version. Perhaps other content of the mc-dictionary (for example the Lang) needs to be saved on the stacked too.
\begin{docCommands}
{
{doc name=tagmcifinTF,doc parameter=\marg{true code}\marg{false code}},
{doc name=tag_mc_if_in:TF,doc parameter=\marg{true code}\marg{false code}}
}\end{docCommands}
These commands check if a marked content is currently open and allows you to e.g. add the end marker if yes.
In \emph{generic mode}, where marked content command shouldn't be nested, it works with a global boolean.
In \emph{lua mode} it tests if the mc-attribute is currently unset. You can't test the nesting level with it!
\begin{docCommand}{tag_mc_reset_box:N}{\marg{box}}\end{docCommand}
In lua mode this command will process the given box and reset all mc
related attributes in the box to the current values. This means that
if the box is used all its contents will be a kid of the current
structure. This should (probably) only be used on boxes which don't
contain tagging commands. See below section~\ref{sec:savebox} for
more details.
\subsubsection{Retrieving data} \label{sec:retrieve}
With more elaborate tagging the need arise to retrieve and store current data.
\begin{docCommand}{tag_get:n}{\marg{key word}}\end{docCommand}
This (expandable) command returns the values of some variables. Currently, the working key words are
\begin{itemize}
\item \verb+mc_tag+: the tag name of the current mc-chunk
\item \verb+struct_tag+: the tag name of the current structure
\item \verb+struct_id+: The ID of the current structure. This is a
string and is returned including parentheses.
\item \verb+struct_num+: This returns a number and works also if only
\pkg{tagpdf-base} has been loaded, but then doesn't give the same
output: if \pkg{tagpdf} is loaded and tagging is active,
\verb+struct_num+ gives the number of currently active structure, so
it reverts to the parent number if a structure is closed. If only
\pkg{tagpdf-base} is loaded nesting of structure is not tracked and
so the command gives back the number of the last structure that has
been created. In luatex this number can also be retrieved with the
lua function \verb+ltx.tag.get_struct_num()+.
\item \verb+struct_counter+: This returns a number and works also if
only \pkg{tagpdf-base} has been loaded. It gives back the state of
the absolute structure counter and so the number of the last structure
that has been created. This can be used to detect if in a piece of
code there are structure commands. Be aware that this is a \LaTeX{}
counter and so is reset in some places.
In luatex this number can also be retrieved with the
lua function \verb+ltx.tag.get_struct_counter()+. The number of the next
structure to be created is then \verb+ltx.tag.get_struct_counter()+ increased by
one), this can also be retrieved with the function \verb+ltx.tag.get_struct_num_next()+.
\item \verb+mc_counter+: This returns a number and works also if only
\pkg{tagpdf-base} has been loaded. It gives back the state of the
absolute mc-counter and so number of the last mc-chunk that has been
created. This can be used to detect if in a piece of code there are
mc-commands.
\end{itemize}
\subsubsection{Luamode: global or not global -- that is the question}\label{sec:global-local}
In\sidenote{lua mode} luamode the mc-commands set and unset an
attribute to mark the nodes. One can view such an attribute like a font
change or a color: they affect all following chars and glue nodes until
stopped.
From version 0.6 to 0.82 the attributes were set locally. This had the
advantage that the attributes didn't spill over in area where they are not
wanted like the header and footer or the background pictures. But it had the
disadvantage that it was difficult for an inner structure to correctly
interrupt the outer mc-chunk if it can't control the group level. For example
this didn't work due to the grouping inserted by the user:
\begin{taglstlisting}
\tagstructbegin{tag=P}
\tagmcbegin{tag=P}
Start paragraph
{% user grouping
\tag_mc_end_push:
\tagstructbegin{tag=Em}
\tagmcbegin{tag=Em}
\emph{Emphasized test}
\tagmcend
\tagstructend
\tag_mc_begin_pop:n{}
}% user grouping
Continuation of paragraph
\tagmcend
\tagstructend
\end{taglstlisting}
The reading order was then wrong, and the \emph{emphasized text} moved in the structure at the end.
So starting with version 0.9 this has been reverted. The attribute is now global again.
This solves the \enquote{interruption} problem, but has its price: Material inserted by the output routine
must be properly guarded. For example
\begin{taglstlisting}
\DocumentMetadata{uncompress}
\documentclass{article}
\pagestyle{headings}
\begin{document}
\sectionmark{HEADER}
\AddToHook{shipout/background}{\put(5cm,-5cm){BACKGROUND}}
\tagmcbegin{tag=P}Page 1\newpage Page 2\tagmcend
\end{document}
\end{taglstlisting}
Here the header and the background code on the \emph{first} page will be marked up as paragraph
and added as chunk to the document structure. The header and the background code on
the \emph{second} page will be marked as artifact. The following figure shows how the tags looks
like.
\includegraphics[alt=Show tags of examples]{global-ex}
It is therefore from now on important to correctly markup such code. Header
and footer are now marked as artifacts (see below). If they contain code
which needs a different markup it still must be added explicitly. With
packages like \pkg{fancyhdr} or \pkg{scrlayer-scrpage} it is quite easy to
add the needed code.
\subsubsection{Tips}
\begin{itemize}
\item Mark commands inside floats should work fine (but need perhaps some compilation rounds in generic mode).
\item In case you want to use it inside a \verb+\savebox+ (or some
command that saves the text internally in a box): If the box is used
directly, there is probably no problem. If the use is later, stash
the marked content and add the needed \verb+\tagmcuse+ directly
before or after the box when you use it.
\item Don't use a saved box with markers twice.
\item If boxes are unboxed you will have to analyze the \PDF{} to
check if everything is ok.
\item If you use complicated structures and commands (breakable boxes
like the one from \pkg{tcolorbox}, \pkg{multicol}, many footnotes)
you will have to check the \PDF{}.
\end{itemize}
\begin{figure}
\input{link-figure-input}
\caption{Structure needed for a link annotation}\label{fig:linkannot}
\end{figure}
\subsubsection{Header and Footer}\label{sec:header-footer}
Tagging header and footer is not trivial. At first on the technical side header and footer are
typeset and attached to the page during the output routine and the exact timing is not really under
control of the user. That means that when adding tagging there one has to be careful not to disturb
the tagging of the main text---this is mostly important in luamode where the attributes are global
and can easily spill over.
At second one has to decide about how to tag: in many cases header and footer can simply be ignored,
they only contain information which are meant to visually guide the reader and so are not relevant for
the structure. This means that normally they should be tagged as artifacts. The PDF reference offers
here a rather large number of options here to describe different versions of \enquote{ignore this}.
Typically the header and footer should get the type \texttt{Pagination} and this types has a number of subtypes like
Header, Footer, PageNum. It is not yet known if any technology actually makes use of this info.
But they can also contain meaningful content, for example an address. In such cases the content
should be added to the structure (where?) but even if this address is
repeated on every page at best only once. All this need some thoughts both from the users and the packages and code
providing support for header and footers.
For now tagpdf added some first support for automatically tagging:
Starting with version 0.92 header and footer are by default automatically marked up as (simple) artifacts.
With the key \PrintKeyName{exclude-header-footer} the behavior can be
changed: The value \texttt{false} disables the automatic tagging, the
value \texttt{pagination} add additionally an \texttt{/Artifact}
structure with the attribute \texttt{/Pagination}.
If some additional markup (or even a structure) is wanted, something like this should be used (here with
the syntax of the \pkg{fancyhdr} package) to close the open mc-chunk and restart if after the content:
\begin{taglstlisting}
\ExplSyntaxOn
\cfoot{\leavevmode
\tag_mc_end_push:
\tagmcbegin{artifact=pagination/footer}
\thepage
\tagmcend
\tag_mc_begin_pop:n{artifact}}
\ExplSyntaxOff
\end{taglstlisting}
\subsubsection{Links and other annotations}\label{sec:link+annot}
Annotations (like links or form field annotations) are objects
associated with a geometric region of the page rather than with a
particular object in its content stream. Any connection between a link
or a form field and the text is based solely on visual appearance (the
link text is in the same region, or there is empty space for the form
field annotation) rather than on an explicitly specified association.
To connect such a annotation with the structure and so with
surrounding or underlying text a specific structure has to be added,
see \ref{fig:linkannot}: The annotation is added to a structure
element as an object reference. It is not referenced directly but
through an intermediate object of type OBJR. To the dictionary of the
annotation a \texttt{/StructParent} entry must be added, the value is
a number which is then used in the ParentTree to define a relationship
between the annotation and the parent structure element.
To support this, \pkg{tagpdf} offers currently two commands
\begin{docCommand}{tag_struct_parent_int:}{}\end{docCommand}
This insert the current value of a global counter used to track such
objects. It can be used to add the \texttt{/StructParent} value to the
annotation dictionary.
\begin{docCommand}{tag_struct_insert_annot:nn}{\marg{object reference}\marg{struct parent number}}\end{docCommand}
This will insert the annotation described by the object reference into
the current structure by creating the OBJR object. It will also add
the necessary entry to the parent tree and increase the global counter
referred to by |\tag_struct_parent_int:|. It does nothing if
(structure) tagging is not activated.
Attention! As the second command increases the global counter at the
end it changes the value given back by the first. That means that if
nesting is involved care must be taken that the correct numbers is
used. This should be easy to fulfill for most annotations, as there
are boxes. There the second command should at best be used directly
behind the annotation and it can make use of
|\tag_struct_parent_int:|. For links nesting is theoretically
possible, and it could be that future versions need more sophisticated
handling here.
In environments which process their content twice like tabularx or
align it would be the best to exclude the second command from the
trial step, but this will need better support from these environments.
Typically using this commands is not often needed: Since version 0.81
\pkg{tagpdf} already handles (unnested) links, and form fields created
with the \pkg{l3pdffield-testphase} package will be handle by this
package.
The following listing shows low-level to create link where the two
commands are used:
\begin{taglstlisting}
\pdfextension startlink
attr
{
/StructParent \tag_struct_parent_int: %<----
}
user {
/Subtype/Link
/A
<<
/Type/Action
/S/URI
/URI(http://www.dante.de)
>>
}
This is a link.
\pdfextension endlink
\tag_struct_insert_annot:xx {\pdfannot_link_ref_last:}{\tag_struct_parent_int:}
\end{taglstlisting}
\subsubsection{Math}
Math is still a problem but some progress has been made.
To tag math you have to surround it with a \texttt{Formula} structure. But the content of such a structure is handled by readers as a black box so additional data is needed for accessibility.
There are a number of theoretical options here:
\begin{enumerate}
\item One can add an alternative text (\texttt{/Alt}) or an \texttt{/ActualText}
to the structure element either some text manually provided by the author or (with
the math module in the latex-lab bundle) the \LaTeX-source).
\item One can add an alternative text (\texttt{/Alt} or \texttt{/ActualText})
to the MC-chunks.
\item One can build inside the \texttt{Formula} structure element a tree with MathML structure elements --- with PDF 2.0 this not require to declare new tags as the MathML name space is built-in.
\item One can in PDF 2.0 attach a MathML file and/or the \LaTeX-source as associated file to the \texttt{Formula} structure (or to one or more MC-chunks).
\end{enumerate}
The question is how these work in reality.
Option 1 and 2 give not too bad results
with a screen reader, but can
require manual work and if you are unlucky the reader drops
important part of the math (like punctuation symbols).
Exploring the equation is not possible.
Option 3 creates many structure elements.
E.g. I have seen an example where \emph{every single
symbol} has been marked up with tags from MathML along with an
\texttt{/ActualText} entry and an entry with alternate text which
describes how to read the symbol. The \PDF{} then looked like this
\begin{taglstlisting}
/mn </Alt( : open bracket: four )>>BDC
...
/mn </Alt( third s )>>BDC
...
/mo </Alt( times )>>BDC
\end{taglstlisting}
If this is really the way to go one would need some script to add the
mark-up as doing it manually is too much work and would make the
source unreadable -- at least with pdflatex and the generic mode. In
lua mode is it possible to hook into the \texttt{mlist\_to\_hlist}
callback and add marker automatically. Some first implementation in this direction
has been done by Marcel Krüger in the luamml project. But up-to-now it was not possible
to test the usability of this approach: With the exception of the html derivation
with ngpdf no PDF-viewer/screen reader combination
seems to make use of such structures.
I'm not sure anyway that this is the best way to do math. It looks rather
odd that a document should have to tell a screen reader in such detail
how to read an equation.
The last option 4 has been implemented in the math module in the \texttt{latex-lab}
bundle. Here happily a proof of
concept was possible: With development versions of foxit and the NVDA reader
it was possible to access an attached MathML and get speech output from it \cite{todasoifferdeims2024,mittelbachfischerdeims2024}. See also \cite{mathexamples} for some
examples and section~\ref{sec:alt} for some more remarks and tests.
\subsubsection{Split paragraphs}\label{sec:splitpara}
%TODO: think about marginnote! Aside?
A\sidenote{Generic mode only} problem in generic mode are paragraphs
with page breaks. As already mentioned the end marker \texttt{EMC}
must be added on the same page as the begin marker. But it is in
pdflatex \emph{very} difficult to inject something at the page break
automatically. One can manipulate the shipout box to some extend in
the output routine, but this is not easy and it gets even more
difficult if inserts like footnotes and floats are involved: the end
of the paragraph is then somewhere in the middle of the box.
So with pdflatex in generic mode one until now had to do the splitting manually.
The example \texttt{mc-manual-para-split} demonstrates how this can be
done. The general idea was to use \verb+\vadjust+ in the right place:
\begin{taglstlisting}
\tagmcbegin{tag=P}
...
fringilla, ligula wisi commodo felis, ut adipiscing felis dui in
enim. Suspendisse malesuada ultrices ante.% page break
\vadjust{\tagmcend\pagebreak\tagmcbegin{tag=P}}
Pellentesque scelerisque
...
sit amet, lacus.\tagmcend
\end{taglstlisting}
Starting with version 0.92 there is code which resolves this
problem. Basically it works like this: every mc-command issues a mark
command (actually two slightly different). When the page is built in
the output routine this mark commands are inspected and from them
\LaTeX{} can deduce if there is a mc-chunk which must be closed or
reopened. The method is described in Frank Mittelbach's talk at
TUG~2021 \enquote{Taming the beast — Advances in paragraph tagging
with pdfTeX and XeTeX} \url{https://youtu.be/SZHIeevyo3U?t=19551}.
Please note
\begin{itemize}
\item Typically you will need more compilations than previously, don't
rely on the rerun messages, but if something looks wrong rerun.
\item The code relies on that related |\tagmcbegin| and |\tagmcend|
are in the same boxing level. If one is in a box (which hides the
marks) and the other in the main galley, things will go wrong (\texttt{longtable}
is for example problematic).
\end{itemize}
\subsubsection{Automatic tagging of paragraphs}\label{sec:paratagging}
Another feature that emerged from the \LaTeX{} tagged PDF project are hooks
at the begin and end of paragraphs. \pkg{tagpdf} makes use of these hooks to
tag paragraphs. In the first version it added only one structure, but this
proved to be not adequate:
Paragraphs in \LaTeX{} can be nested, e.g., you can have a paragraph
containing a display quote, which in turn consists of more than one
(sub)paragraph, followed by some more text which all belongs to the
same outer paragraph.
In the \PDF{} model and in the HTML model that is not supported: the rules in
\PDF{} specification do not allow \texttt{P}-structures to be nested --- a
limitation that conflicts with real live, given that such constructs are
quite normal in spoken and written language.
The approach we take (starting with march 2023, version 0.98e) to resolve
this is to model such \enquote{big} paragraphs with a structure named
\texttt{text-unit} and use \texttt{P} (under the name \texttt{text}) only for
(portions of) the actual paragraph text in a way that the \texttt{P}s are not
nested. As a result we have for a simple paragraph two structures:
\begin{taglstlisting}
The paragraph text ...
The paragraph text before the display element ...
Content of the display structure possibly involving inner tags
... continuing the outer paragraph text
\end{taglstlisting}
In other words such a display block is always embedded in a ||
structure, possibly preceded by a ||\ldots|| block and possibly
followed by one, though both such blocks are optional. More information about
this can be found in the documentation of \texttt{latex-lab-block-tagging}.
As a consequence \pkg{tagpdf} now adds two structures if paratagging is
activated. The new code to tag display blocks extends this code to handle the
nesting of lists and other display structures.
The automatic tagging require that for every begin of a paragraph with the
begin hook code there a corresponding end with the closing hook code. This
can fail, e.g if a |vbox| doesn't correctly issue a |\par| at the end. If
this happens the tagging structure can get very confused. At the end of the
document \pkg{tagpdf} checks if the number of outer and inner start and end
paragraph structures created with the automatic paratagging code are equal
and it will error if not.
The automatic tagging of paragraphs can be deactivated completely or only the
outer level with the |\tagtool| keys |para| and |para-flattened| or with the
(now deprecated) commands |\tagpdfparaOn| and |\tagpdfparaOff|.
Nesting the activation and deactivation of the tagging of paragraphs can be
quite difficult. For example if it is unclear if the inner code issues a
|\par| or not it is not trivial to exclude an end hook for every excluded
begin hook. In such cases it can be easier to use the |paratag| key with the
value |NonStruct| to convert some |P|-structures into |NonStruct|-structures
without real meaning.
\subsection{Task 2: Marking the structure}
The structure is represented in the \PDF{} with a number of objects of type
\texttt{StructElem} which build a tree: each of this objects points back to
its parent and normally has a number of kid elements, which are either again
structure elements or -- as leafs of the tree -- the marked contents chunks
marked up with the \verb+tagmc+-commands. The root of the tree is the
\texttt{StructTreeRoot}.
\subsubsection{Structure types}
The tree should reflect the \emph{semantic} meaning of the text. That means
that the text should be marked as section, list, table head, table cell and
so on. A number of standard structure types is predefined, see section
\ref{sec:new-tag} but it is allowed to create more. If you want to use types
of your own you must declare them. E.g. this declares two new types
\texttt{TAB} and {FIG} and bases them on \texttt{P}:
\begin{taglstlisting}
\tagpdfsetup{
role/new-tag = TAB/P,
role/new-tag = FIG/P,
}
\end{taglstlisting}
\subsubsection{Sectioning}
The sectioning units can be structured in two ways: a flat, html-like and a
more (in pdf/UA2 basically deprecated) xml-like version. The flat version
creates a structure like this:
\begin{taglstlisting}
section header
text
subsection header
...
\end{taglstlisting}
So here the headings are marked according their level with \texttt{H1}, \texttt{H2}, etc.
In the xml-like tree the complete text of a sectioning unit is surrounded
with the \texttt{Sect} tag, and all headers with the tag \texttt{H}. Here the
nesting defines the level of a sectioning heading.
\begin{taglstlisting}
section heading
text
subsection heading
...
\end{taglstlisting}
The flat version is more \LaTeX-like and it is rather straightforward to
patch \verb+\chapter+, \verb+\section+ and so on to insert the appropriates
\texttt{H\ldots} start and end markers. The xml-like tree is more difficult
to automate. It has been implemented in the sec module in latex-lab, but can break
if sectioning commands are hidden inside boxes.
\subsubsection{Commands to define the structure}
The following commands can be used to define the tree structure:
\begin{docCommands}
{
{doc name=tagstructbegin,doc parameter=\marg{key-val-list}},
{doc name=tag_struct_begin:n,doc parameter=\marg{key-val-list}}
}\end{docCommands}
These commands start a new structure. They don't start a group. They set all their values globally.
The key-val list understands the following keys:
\begin{description}
\item[\PrintKeyName{tag}] This is required. The value of the key is
normally one of the standard types listed in section
\ref{sec:new-tag}. It is possible to setup new tags/types, see the
same section. The value can also be of the form |type/NS|, where
|NS| is the shorthand of a declared name space. Currently the
names spaces |pdf|, |pdf2|, |mathml| and |user| are defined. This
allows to use a different name space than the one connected by
default to the tag. But normally this should not be needed.
\item[\PrintKeyName{stash}] Normally a new structure inserts itself
as a kid into the currently active structure. This key prohibits
this. The structure is nevertheless from now on \enquote{the
current active structure} and parent for following marked
content and structures.
\item[\PrintKeyName{label}] This key sets a label by which one can
refer to the structure. Currently the key writes a property whose
name starts with \texttt{tagpdfstruct-} to the aux-file with the two
attributes \texttt{tagstruct} (the structure number) and
\texttt{tagstructobj} (the object reference) but also stores the
name and the structure number into a prop for use in the current compilation.
The label is e.g. used by \cs{tag\_struct\_use:n} and by the |ref|
key (which can refer to future structures).
\item[\PrintKeyName{parent}] With the parent key one can choose another
parent. The value is a structure number which must refer to an
already existing, previously created structure. Such a structure
number can have been stored previously with \cs{tag\_get:n}, but one
can also use a label on the parent structure and then use
\cs{property\_ref:nn}\verb+{tagpdfstruct-label}{tagstruct}+ to retrieve
it.
\item[\PrintKeyName{firstkid}] If this key is used the structure is
added at the left of the kids of the parent structure (if the structure is not stashed). This means that it will be the first kid of the structure (unless some
later structure uses the key too). This can be needed e.g. for a caption as
the PDF reference requires it to be the first or last kid of its structure.
\item[\PrintKeyName{alt}] This key inserts an \texttt{/Alt} value in the
dictionary of structure object, see section~\ref{sec:alt}. The value
is handled as verbatim string and hex encoded. The value will be
expanded first once (so works like the key \texttt{alttext-o} in
previous versions which has been removed). If the value is empty,
nothing will happen.
That means that you can do something like this:
\begin{taglstlisting}
\newcommand\myalttext{\frac{a}{b}}
\tagstructbegin{tag=P,alt=\myalttext}
\end{taglstlisting}
and it will insert \verb+\frac{a}{b}+ (hex encoded) in the
\PDF{}. In case that the text begins with a command that should not
be expanded protect it e.g. with a \verb+\empty+.
\item[\PrintKeyName{actualtext}] This key inserts an \texttt{/ActualText}
value in the dictionary of structure object, see
section~\ref{sec:alt}. The value is handled as verbatim string. The
value will be expanded first once (so works like the key
\texttt{alttext-o} in previous versions which has been removed). If
the value is empty, nothing will happen.
That means that you can do something like this:
\begin{taglstlisting}
\newcommand\myactualtext{X}
\tagstructbegin{tag=P,actualtext=\myactualtext}
\end{taglstlisting}
and it will insert \verb+X+ (hex encoded) in the \PDF{}. In case
that the text begins with a command that should not be expanded
protect it e.g. with a \verb+\empty+
\item[\PrintKeyName{attribute}] This key takes as argument a comma
list of attribute names (use braces to protect the commas from
the external key-val parser) and allows to add one or more
attribute dictionary entries in the structure object. As an
example
\begin{taglstlisting}
\tagstructbegin{tag=TH,attribute= TH-row}
\end{taglstlisting}
See also section~\ref{sec:attributes}.
\item[\PrintKeyName{attribute-class}] This key takes as argument a
comma list of attribute names (use braces to protect the commas
from the external key-val parser) and allows to add them as
attribute classes to the structure object. As an example
\begin{taglstlisting}
\tagstructbegin{tag=TH,attribute-class= TH-row}
\end{taglstlisting}
See also section~\ref{sec:attributes}.
\item[\PrintKeyName{title}] This key allows to set the dictionary
entry \texttt{/T} (for a title) in the structure object. The value
is handled as verbatim string and hex encoded. Commands are not
expanded.
\item[\PrintKeyName{title-o}] This key allows to set the dictionary
entry \texttt{/T} in the structure object. The value is expanded
once and then handled as verbatim string like the
\PrintKeyName{title} key.
\item[\PrintKeyName{AF}] This key allows to reference an associated
file in the structure element. The value should be the name of
an object pointing to the \texttt{/Filespec} dictionary as
expected by \verb+\pdf_object_ref:n+ from a current
\texttt{l3kernel}. For example:
\begin{taglstlisting}
\group_begin:
\pdfdict_put:nnn {l_pdffile/Filespec} {AFRelationship}{/Supplement}
\pdffile_embed_file:nnn{example-input-file.tex}{}{tag/AFtest}
\group_end:
\tagstructbegin{tag=P,AF=tag/AFtest}
\end{taglstlisting}
As shown, the wanted AFRelationship can be set by filling the dictionary
with the value. The mime type is here detected automatically, but for
unknown types it can be set too. See the \texttt{l3pdffile}
documentation for details. Associated files are a concept new in PDF
2.0, but the code currently doesn't check the pdf version, it is your
responsibility to set it (this can be done with the \texttt{pdfversion}
key in \verb+\DocumentMetadata+).
\item[\PrintKeyName{root-AF}] This key allows to reference an
associated file in the root structure element. Using the root can
be e.g. useful to add a css-file. When converting the pdf to a
html with e.g. ngpdf \cite{ngpdf} this css-file is then referenced in the head of the html.
\item[\PrintKeyName{root-supplemental-file}] This is a variant of the previous key. It takes as argument a file name. It then embeds this file with \texttt{/AFRelationship /Supplement} and appends it as associated file to the structure root. ngpdf \cite{ngpdf} will store a \texttt{.css} attached in this way and reference it in the head of the html. If a \texttt{html} is attached in this way, ngpdf will
copy the content into the head of the derived html. This means that the content of such an html file should normally be some html snippet suitable for the head, e.g. some css-code inside \texttt{