Abstract
Semantic web and various ontologies have been evolving using the power of the web, connecting various documents through web-based inter-document linkages. In this paper a set of simple ID/HREF markers for intra-document cross-referencing is proposed. This can be used for semantic enrichment of documents, conveying the intent and purpose of the document very clearly. One of the main advantages is the ability to construct multiple views of the document, in graphic mode, though the text is very much embedded in the document. It only requires an anchor element with an ID attribute and an element with xlink:href attribute, so it can be implemented in almost all book or journal article DTD/Schemas. A simple XUL-based authoring application, MuLTiFlow, which is able to reproduce various graphical displays is also being developed as part of this work.
Table of Contents
Language is produced in a seemingly independent stream of characters, but both writing and speaking involve many layers of non-local cross-references within the document, usually involving three forms: (i) short-forms (non-terminals) such as symbols, acronyms, abbreviations, short definitions, etc... (ii) comparison of one with other as a reference point to a narrative which in extreme cases could even turn into transclusions or direct quotes. (iii) semantic ontologies that pertain to the deeper meaning of the paper.
The first category involves direct literary forms such as symbols, abbreviations, acronyms, short definitions, etc. The second category has been well studied and it usually involves citations, cross-references and also the use of transclusions with CONREF attribute as in the DITA XML [DITA,DITA-DB]. However, even in this case, there is difficulty in generating cross-references. One extreme is the use of transclusions with CONREF attribute, with auto-generated text (say using XSLT as part of an XML-pipeline [XPipe]) and the other extreme is to treat it as a plain vanilla hyperlink with duplicate text in markup. It is possible to construct solutions to this problem that can also be more sophisticated than these extremes. The third and final category is what is also the most important, in that it reveals the inner meanings of the paper, which is important for both document writers and the readers. We demonstrate here how both the semantic and literal markup offer interesting possibilities to expose these intra-document non-local linkages and bring about a richer reading experience using graphical views.
Microformats incorporating RDF triplets that extend XHTML formats [RDFa] and further extensions to it incorporating OWL have been proposed [OWL-RDFs]. However, the simplicity of our approach is that we propose no additional attributes other than the traditional ID/HREF mechanisms for semantic enrichment.
Consider a simple piece of text from Wikipedia recreated in HTML markup:
In statistics, <a id="oid..anova.expansion">analysis of variance</a> (<a id="oid..anova.acronym">ANOVA</a>) is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables. In its simplest form <a href="conref:#oid..anova.acronym"/> provides <a id="oid..anova.description">a statistical test of whether or not the means of several groups are all equal</a>, and therefore <a rel="owl:subClassOf" href="rel:+ #anova.Student_t-test #anova">generalizes</a> <a id="oid..anova_._Student_t-test" href="http://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test"> Student's two-sample t-test</a> to <a href="rel:? #anova_._Student_t-test #anova">more than two groups</a>...The requirement of xml:id Version 1.0 [xml:id] demands that we cannot use colon for namespace, so we use ".."(horizontal colon) instead (in order to avoid validity errors, although they are not fatal) and here we use three namespaces:
xmlns:oid="http://www.tnq.co.in/MuLTiFlow/oid" | a prefix mock-namespace for micro IDs to distinguish from other ordinary IDs; these IDs use "." to indicate literal subforms and "_._" for semantic containment, so one has avoid using "." in names as they have special meaning here. |
xmlns:conref="http://www.tnq.co.in/MuLTiFlow/conref" | used mostly in href attribute for transclusions as defined in DITA [DITA,DITA-DB]; |
xmlns:rel="http://www.tnq.co.in/MuLTiFlow/rel" | used mostly in href attribute to indicate both semantic and literal relationships between classes. The nature of the relationships is indicated using an operator that follows immediately after the "rel:" namespace prefix. A list of these operators and their proposed usage is shown in Appendix A. |
Figure 1
We use a simple "." to indicate various literal or textual
forms underlying the same class or object. To indicate deeper semantic
ontological relationships one must use "_._" and for the case of xml:id="oid..anova_._Student_t-test"
this can be expressed in the OWL [OWL] syntax as:
<owl:Class rdf:ID="Student_t-test"> <rdfs:subClassOf rdf:resource="#anova"/> </owl:Class>
Let us consider another piece of text from Wikipedia:
<p>In quantum mechanics, the <a id="oid..epr.acronym">EPR paradox</a> (or <a id="oid..epr.expansion">Einstein–Podolsky–Rosen paradox</a> is <a id="oid..epr.description">a thought experiment which challenged long-held ideas about the relation between the observed values of physical quantities and the values that can be accounted for by a physical theory</a>. <a id="oid..epr.paper.authors">Einstein, Podolsky, and Rosen</a> introduced the thought experiment in a <a id="oid..epr.paper.citation" href="#oid..epr.paper">1935 paper</a> <a id="oid..epr.paper.purpose">to argue that quantum mechanics is not a complete physical theory</a>.</p> <p><a id="oid..epr.paper.reference">A. Einstein, B. Podolsky, and N. Rosen, Can Quantum-Mechanical Description of Physical Reality Be Considered Complete?, Phys. Rev. 47 (1935) 777–780</a></p>This markup could be displayed graphically as shown below.
Figure 2
In this section we will consider some complex examples with ontological associations. The first one is again from Wikipedia:
<p>In physics, specifically quantum mechanics, the <a id="oid..scheqn">Schrödinger's equation</a> is an equation that describes how the quantum state of a physical system changes in time. It is as central to <a id="oid..qm">quantum mechanics</a> <a href="rel:~ #qm-#scheqn #cm-#nl">as</a> <a id="oid..nl">Newton's laws</a> are to <a id="oid..cm">classical mechanics</a>.</p> <p><a href="conref:#qm.scheqn"/> can be <a rel="owl:equivalentClass" href="rel:~ #scheqn #heieqn">mathematically transformed into</a> <a id="oid..heieqn">Werner Heisenberg's matrix mechanics</a>, and <a rel="owl:equivalentClass" href="rel:~ #heieqn #feyeqn">into</a> <a id="oid..feyeqn">Richard Feynman's path integral formulation</a>. The <a href="conref:#scheqn"/> describes time in a way that is <a href="rel:( #scheqn #rtt">inconvenient</a> for <a id="oid..rtt">relativistic theories</a>, a problem which is <a href="rel:| #heieqn #rtt">not as severe</a> in matrix mechanics and <a href="rel:) #feyeqn #rtt">completely absent</a> in the path integral formulation.</p>The geometrical representation of this markup would be:
Figure 3
Two important concepts have been introduced in this markup:
<a href="rel:~ #qm-#scheqn #cm-#nl">as</a>
creates a hyphenated object to access the relationship between #qm
and #scheqn
as an object.
The "rel:" namespace is suffixed with a relationship operator, rel:~
, rel:(
, rel:|
, rel:)
, with last three using a Smiley notation to indicate the mood of the relationship.
MuLTiFlow, a XUL-based HTML+MathML+SVG editor, is a demo implementation of the concepts introduced in this paper. A preliminary version of this editor is available as an open source debian package and as a FireFox addon. This WYSIWYG editor has Javascript widgets to generate editable SVG markup generated using the ID/HREF microformat and could form a useful part of an academic article. The usefulness of this concept is being tested in academic papers and lecture notes for students.
In this section we will consider the application of these concepts to DocBook [DB], NLM [NLM], and a proprietary DTD [Elsevier].
The table below shows possible ways of implementing the technique in each of the DTDs.
Table I
HTML | DocBook | NLM | Elsevier |
---|---|---|---|
<a id="oid...foo_id1"> | <olink type="oid" xml:id="oid...foo_id1"> | <target target-type="oid" id="oid..foo_id1"> | <ce:anchor role="oid" id="oid..foo_id1"> |
<a href="rel:~ #foo_id1 #foo_id2"> | <olink type="rel" xlink:href="rel:~ #foo_id1 #foo_id2"> | <ext-link ext-xlink-type="rel" specific-use="oid" xlink:href="rel:~ #foo_id1 #foo_id2"> | <ce:intra-ref xlink:role="oid" xlink:href="rel:~ #foo_id1 #foo_id2"> or <ce:intra-refs>...<ce:intra-ref-end
xlink:href="rel:~ #foo_id1"><ce:intra-ref-end xlink:href="rel:~
#foo_id2"> |
It is also possible to implement the technique using nested links but it is not clear at present if such requirements do arise in real situations.
We have to standardize markup. We also need to construct a range of examples to cater to a broad audience. Tutorials for both MuLTiFlow editor and this microformat are planned. Developing both a mapping (and a text protocol) for this microformat to other microformats such as RDFa and a direct mapping to full OWL syntax will be parts of future direction of this work.
I would like to acknowledge the support and encouragement from Mariam Ram, with much of the development taking place with interactions within the company, especially with Amartyo Bannerjee, Srikanth Vittal, M.V.Bhaskar, Shanthi Krishnamurthy and Palanichamy Arumugam. MuLTiFlow editor was developed by the author in collaboration with B. Manoponni. I would also like to acknowledge useful interactions with Simon Pepping, especially with regards to XML specifications, standards and, Sibasish Ghosh of the Institute of Mathematical Sciences for providing enough academic material to test the usefulness of these concepts.
"." | indicates literal subforms of the underlying class or object. |
"_._" | indicates semantic "subClassOf" relationships. |
"-" | this used to access relationships between two classes as an object. |
"_*_" | where * is any letter; it is possible to create other compound classes using this notation. |
"rel:+" | indicates that the arrow is a generalization (hypernym). |
"rel:-" | indicates that the arrow is a specialization (hyponym). |
"rel:~" | indicates that it is a double arrow equivalence class relationship. |
"rel:(" | indicates that their relationship is bad. |
"rel:|" | indicates that their relationship is neutral. |
"rel:)" | indicates that their relationship is good. |
"rel:(n)" | "n" is a signed integer (between -100 and 100) indicating a fuzzy relationship with an weightage. |
"rel:*" | Complete list of unicode operator symbols can be used to indicate various relationships between objects. |
[DITA] DITA, http://docs.oasis-open.org/dita/v1.1/CD01/archspec/archspec.html#dtdorganization
[DITA-DB] DITA for DocBook, http://norman.walsh.name/2005/10/21/dita
[XPipe] XML Pipeline Definition Language Version 1.0, http://www.w3.org/TR/xml-pipeline
[RDFa] RDFa Primer, http://www.w3.org/TR/xhtml-rdfa-primer/
[OWL-RDFs] Embedding OWL-RDFS syntax in XHTML with RDFa, http://ontologyonline.blogspot.com/2007/11/embedding-owl-rdfs-syntax-in-xhtml-with.html
[xml:id] xml:id Version 1.0, http://www.w3.org/TR/xml-id/
[OWL] OWL Web Ontology Language, http://www.w3.org/TR/owl-features/
[DB] DocBook V5.x, http://www.docbook.org/schemas/5x
[NLM] Journal Publishing Tag Set Tag Library version 3.0, http://dtd.nlm.nih.gov/publishing/tag-library/
[Elsevier] Tag by Tag, The Elsevier DTD 5 Family of XML DTDs, http://www.elsevier.com/locate/xml