XML and the Elsevier DTD family
Elsevier’s book and journal content is based on XML. XML stands for eXtensible Markup Language. XML documents are structured (“tagged”) independently of the presentation in a way that can be extended by the developer of the XML standard that is used.
Elsevier is basing its workflow for primary book and journal publications on the “XML-first” principle: all articles and books are converted to XML as they come in and this XML is used to prepare all output, irrespective of the format.
To define the XML it employs Elsevier is using DTDs; a DTD describes which elements may be used in tagging content and which rules apply to these elements. Elsevier has developed several related DTDs for, amongst others, scientific journal articles, book chapters and abstracted information. These DTDs are currently in version 5.5, and can be accessed via this site.
The DTDs describing the journal articles and book chapters only describe the highest level structure of the product, most of which is filled with “common” elements. These are stored in a “Common Element Pool” (CEP).
The Elsevier DTDs adopt several industry standards:
- Unicode, the character set of XML
- CALS tables, enhancing interoperability of tables in journal articles and existing tools
- MathML, making mathematical formulae accessible to existing and newly developed tools for the publication and exchange of mathematical information
- XLink, used to link to documents and resources on the web.
Using content on different platforms and in different guises
Not only is Elsevier operating by the “XML-first” principle for its current journal articles and books, all legacy content dating back to Elsevier’s origins as science publisher is also available in XML. From these XML sources, content is made available in various formats such as HTML, web and print PDF and ePub. XML-based full text content is published on a number of Elsevier web sites such as ScienceDirect and Clinical Key, but is also delivered to no-Elsevier platforms like PubMed Central. XML-derived content is being used on numerous abstracting and indexing services and databases, both Elsevier owned such as Scopus and Embase, but also outside platforms like PubMed.
Elsevier enriches its XML content by including relevant metadata; retrievability is improved by the attachment of taxonomy data. Moreover, all XML content can be made available for text and data mining.
Quality control: Documentation and validation
Developing a DTD alone is insufficient to allow an XML-based process; high-quality documentation helps in clarifying the interpretation of the tags and specifying the ways in which they are used. Elsevier has developed the so-called “Tag by Tag” format for its DTD documentation. The Tag by Tag documentation describes each element in the DTD family in detail in a uniform way.
Good documentation goes along with good validation, both to capture errors efficiently and consistently and to enforce quality requirements with business partners. Just parsing a document versus the DTD is insufficient to achieve the quality level required. Elsevier has developed its own quality checking application, a configurable rules-based tool allowing checking of many aspects that go beyond the validation by a parser. The rules file is in XML format. The tool is able to check not only XML files, but any tag-based file. In addition, it contains libraries to create tag-based files from non-tag-based files, such as PDF and artwork files.
Available DTDs and their documentation
All Elsevier XML DTDs, including older versions and together with accompanying documentation, are available on the Elsevier DTDs and transport schemas page.