FIELD MARKS FOR WEBSTER 1913 and CIDE
=====================================
Tagset.web:
Explanations of the tags used to mark the Webster 1913 dictionary
and the CIDE (Collaborative International Dictionary of English).
Note that the list of tags used to mark the public domain version
of this dictionary is shorter than the full set described here.
If any tag is not listed here, it is either (1) one of the
"point" (font size) or "type" (font style) tags, which should be self-explanatory; or
(2) Is a functional field with no effect on the typography.
Last modified March 12, 1999.
For questions, contact:
Patrick Cassidy cassidy@micra.com
735 Belvidere Ave.
Plainfield, NJ 07062
(908) 561-3416 or (908) 668-5252
-------------------------------------------------------------
A separate file, webfont.asc, contains the list of the individual
non-ASCII characters represented by either higher-order hexadecimal
character marks (e.g., \'94, for o-umlaut) or by entity tags
(e.g., .
Note: The tags on this list are similar in structure to SGML tags. Each
tag on this list marks a field; each field opens with a tagname between
angle brackets thus: , and closes with a similar tag containing
the forward slash thus: . No tags are used without closing
tags. Thus the HTML to indicate a line break is symbolized
here as an entity, has a corresponding
.
The absence of an end-field tag, or the presence of an end-field tag
without a prior begin-field tag constitutes a typographical error, of which
there may be a significant number. Any errors detected should be brought
to the attention of PJC or the appropriate editor.
Most of the tagged fields are presented in the text in italic type,
with a number of exceptions. Where a word is contained within more than
one field, the innermost field determines the font to be used. Wherever
recognizable functional fields were found, an attempt was made to tag the
field with a functional mark, but in many cases, words were italicised only
to represent the word itself as a discourse entity, and in some such cases,
the "italic" mark was used, implying nothing regarding functionality
of the word. The base font is considered "plain". Where an italic field
is indicated, parentheses or brackets within the field are not italicised.
Where no font is specified for a tag, the tag is merely a functional
division, and was printed in plain font unless otherwise tagged. This type
of segment is marked by an asterisk (*) where the font name would be.
The size of the "plain" font in the original text is about 1.6 mm for
the height of capitalized letters.
=============================================================
Explicit typographical tags:
These were used where the purpose of a different font was merely to
distinguish a word from the body of the text, and no explicit functional
tag seemed apropriate.
-----------------------------------
Tag Font
-----------------------------------
Explicit formatting tags:
. . . . . . . . . . . . . . . . . .
plain font (that used in the body of a definition) --
normally not marked, except within fields of
a different front.
italic (in master files)
italic (for use in HTML presentation)
bold (in master files)
bold (for use in HTML presentation)
bold, Collocation font. Same font as used in collocations.
smaller This is used only in the list of "un-" words not
by 1 point actually defined in the dictionary. Probably could be
replaced by a segment mark for the entire list!
The "un-" words should be indexed as headwords.
bold Same as , a font similar to that used in
collocations. However, this tag is used in a table
and could be set to a different font.
* HTML tag -- largest heading font.
* HTML tag -- second largest heading font.
* Marks a Row title in a table.
Font the same as the headword , though the field is
not a headword. Used only once.
* Multiple items, a set of items in a table.
A series of point size markers, many unique.
* One of the tags of the form where **
represents the typographic point size of the
enclosed text.
An HTML tag indicating that the enclosed text is
of teletype form, preformatted in a uniform-spaced
font.
small caps (used mostly for "a. d.", "b. c.")
This is the same font a , but has no functional
or semantic significance
group of table data elements in a table
subscript, like subscript
superscript
superscript
Sans-serif font
Bold (collocation font) and also a subtype.
HTML tage -- teletype font
A squared bold font without serifs approximating the
"universe bold" font on the HP Laserjet4, slightly
larger than the capitals in a definition body. Used
in expositions describing shapes, such as
"Y", "T", "U", "X", "V", "F".
Vertically organized column.
Vertically organized column -- only part of a table
which needs to be completed. Used once.
<...type> A series of tags, many unique, designating certain
unusual fonts, such as "bourgeoistype" for
"bourgeois type", in the section on typography.
Most of these occur only once, in the section on fonts.
=============================================================
Tags with semantic content:
. . . . . . . . . . . . . . . . . . . . . . . . . . .
* Alternative spelling segment. Almost always
contained within square brackets after the main
definition segment. Expository words
such as "Spelled also" are in plain font;
the actual alternative spelling is marked by
... tags within this segment.
italic Antonym.
italic Alternative spelling. The actual word which is an
alternative spelling to the headword. These
are functionally synonyms of the headword. In
most cases these also occur as headwords, with
reference to the word where the actual definition
is found, but not all such words are listed
separately, particularly if the spelling is
close enough to the headword to be found at the
same point in the dictionary. Whether listed
separately or not, these words should
be indexed at this location, also.
italic Authority or author. Used where an authority is
(may be right- given for a definition, and also used for the
justified. See author, where a quotation within double quotes
in the section is given in the same paragraph as the
on formatting). definition. The double quotes are indicated
by the open-quote (\'bd) and close-quote
(\'b8). In both cases, it is typically
right-justified, almost always fitting on
the same line with the last line of the
definition or quotation.
Within collocation segments, it is usually
used only after quotations, and is not right-
justified, except occasionally where it
would be close to the right margin, and then
apparently is is right-justified. We have
not explicitly marked those which are
right-justified, but they can be
recognized because they are on a line by
themselves, preceded by two carriage returns.
* Marks a biography. Should be longer than
a short mention of who a person was, which
is typically included as a definition.
* Same as italic Marks the name of a book, pamphlet, or similar
document.
* A field of knowledge which of which the headword
is a division.
* Caption of a figure or table.
* tags the CAS (Chemical Abstracts Service) registry
number for a chemical substance.
italic tags the infectious disease caused by the headword.
Implied type of the agent is a microorganism, and
the tag must mark a disease.
* Same as without the italic type.
* Same as without the italic type.
italic inverse of causes: tags the causative agent of an
infectious disease, which is the headword .
the tag must mark a microorganism, virus, or
prion, and the implied type of the headword is
a disease.
Used only for The single letter in the headers to each
letter of the alphabet.
* marks the proper name of a city. Used only
occasionally and not consistently at this stage.
italic Converted to: used to tag substances which are
products prepared by conversion from the
headword. Usually chemicals or complex
products from mnatuarl materials. Rarely used
up to 1998.
* List of heads for the columns of a table.
* Title of a column in a table.
* Comment -- differs from in being in-line with
the definition paragraph. Provides a little
additional information.
* Name of a company (commercial firm). Compare italic Composed of. Tags a substance of which the
headword is at least partly composed. The
substance may be particulate, such as
diatoms composing diatomaceous earth.
* marks an object contained within the headword.
italic Contrasting word. Not exactly an antonym, which
is marked , but a contrasting word which is
often introduced as "opposite to" or "contrasts
with".
* Name of a country (nation) of the world.
italic Collocation reference. A reference to a collocation.
Each such collocation should have its own entry,
marked by
... tags, and these
references should function as hypertext buttons
to access that entry.
* A Date, of any type, e.g. Dec. 25.
* Date-with-year tags a date containing a year.
* definition. The definition may have subfields,
particularly (an illustrative phrase
starting with "as" or "thus" and containing
the headword (or a morphological derivative).
The , \'bd...\'b8 quotations (left and
right double quotes) and fields may be
found within a definition field, but should
and usually are located outside the definition
proper. The marking macro was
inconsistent in this placement, and the
exclusion of the , and quotations
needs to be completed by the proof-readers.
Certain definitions contain
fields within them, where the headword is
an irregular derivative of another headword.
In these cases, the field follows
immediately after the tag, and these
entries do not have a separate field.
In such cases, the field is italic, as
usual.
* Division of the headword, usually an organization.
E. g. a faculty or department of a university,
or a United Nations agency.
* Marks an education institution, a subtype of
organization.
* tags a physical object or form of radiation
emitted by the headword
Just a place-holder for illustrations, but seldom used.
italic Marks the name of a movie film.
italic Field of specialization. Most often used for
Zoology and Botany, but many "fields of
specialization" are marked for technical
terms. The parentheses are usually within this
field, but are not themselves in italics.
* Name of a geograpahical region of any size;
if applicable, the more specific ,
, or are preferred.
* Hyperym. Points to the hypernym from WordNet 1.5
Initially, used only for entries extracted
from WordNet 1.5. Not present in the original
1913 version.
* Illustrative usage -- mostly from WordNet, and placed
outside the definition, in contrast to usage.
These should be converted to ... illustrative
usage format for consistency.
* Illustration place-holder. Seldom used.
* HTML usage -- points to an image file, usually
.gif or .jpg. These have no closing tag, and
will appear as errors in parsing.
* Points to a word whose meaning is an intensified
form of the headword. Taken from WordNet
tags, used with some adjectives from WordNet
* Designates one item in a row of a table. Used only when
intervening spaces do not serve properly as natural
field separaters.
italic Translation into a foreign (non-English) language
of the previous word in the text -- italic font.
( is a translation into English)
italic Same as * Title of a journal (periodical).
* Always a filled rectangular array.
* A 2x5 matrix (2 rows by 5 columns).
* Multiple synonymous subtypes -- used in
def. of "grass".
* Multiple table, encloses
figures.
* Music figure. Only in a note under the entry "Figure",
the two numbers of each such field
are bold, 20 point type, stacked as in a fraction with
a bar between them, but also having a horizontal stroke
midway through each numeral. Unique to this entry.
* paragraph tag, used always in pairs. Line breaks may
be embedded inside the paragraphs.
* marks the proper name of a person. Used only
occasionally, but should be used more frequently
for cases where first names are abbreviated,
to reduce ambiguity of the period for automatic
analysis. Where a title is given, prefixed
or postfixed, it is included in this tag.
* marks the name of a person, when only one name
(usually the last name) is given. Not used
consistently where it should be.
* Marks the name of a publication other than book,
which is marked by . It is often a
magazine or journal.
* Tags the name of a person who is speaking,
within a quotation.
Same as * Collocation, plain text -- used to tag phrases that
should be parsed as a unit, but has no typographical
significance.
italic Always right-justified, as described for .
* A reference to a word in the vocabulary.
* Marks the set of references used for a longer article
such as a biography.
* Marks the name of a river -- a proper name
* Right justified
* Designates a row in a table.
* Name of a geopolitical state, the first subdivision of
a country. Includes, e.g. Canadian provinces.
* Lists subtypes of the headword.
* superscript
* Supra. The two parts of each such field
are stacked, one over the other, *without* a
horizontal bar between (as in a fraction).
Used only in one entry, for a musical notation.
* Always a filled rectangular array, having and
elements.
* Table datum - one cell in a table
* Table header
* Tags a commercial Trade name
* Table title (Larger than normal font)
====================================================================
Functional Tags
--------------------------------------------------------------------
Tag Font Meaning
(Comparatives are relative to the plain font.)
-----------------------------------------------------------------------
<-- --> * Comment, not a tag. These segments should be deleted
from the written or printed text.
Page numbers of the original text are indicated
within such comments; these may be left in, if
desired.
* HTML-style comment. Used to indicate page numbers
in the public domain version.
small caps Tags for the actual adjective or adverb
comparatives or superlatives. Should be
indexed. See also conjf (verbs) and
decf (nouns).
italic Alternative name. Usually for plants or animals,
but also used for other cases where words
are introduced by "also called", "called also",
"formerly called". These are functionally
*synonyms* for that word-sense.
italic Same as , but the marked word is a
plural form, whereas the headword is singular.
* Adjective morphological segment, primarily
the comparative and superlative forms.
The occasional adverb morphology is
also tagged this way.
* A segment occurring within the definitional
sentence, providing an example of usage of
the headword. Not conceptually a part of the
actual definition.
smaller spacing Collocation definition. Similar in structure
to headword definitions (the field). May
contain an field. Plain type, but with
closer spacing than main definitions.
bold, Collocation. A word combination containing the
smaller by headword (or a morphological derivative).
1 point The collocations do not have an explicitly
marked part of speech.
See also , tagging embedded collocations.
Collocation, no typographic significance.
Used to mark a word combination defined in
the dictionary without affect on font.
small caps The conjugated (non-infinitive) forms of
verbs. imp. & p. p. is common, as well as
p. pr. & vb. n. Irregular variants of
these are less common. Words in this
field perhaps should be indexed.
smaller Collocation segment. The font and size is
vertical normal in a cs, but the spacing between lines
spacing is smaller (0.9 mm between lower-case letters,
rather than 1.1 mm in the main body of the
definition). For an on-line dictionary,
reproducing this typography is probably
pointless.
small caps The actual morphological variants of nouns or
pronouns. Should be indexed.
* Embedded Collocation. A word combination
containing the headword (or a morphological
derivative, embedded within a definition
without a separate definitin of its own.
These collocations should be defined
implicitly by the text of the definition in
which they are embedded.
See also
, tagging explicitly defined
collocations.
Small Caps Entry reference. References to headwords
within the "etymology" section are in small
caps. Such references also occur
in the body of definitions, and in "usage"
segments.
Such entry references should function as hypertext
buttons to access that entry.
* Etymology. Always contained within square
brackets. Normal type is used for explanatory
comments, and italics for the actual words
(marked ) considered as etymological
sources.
italic Etymological source. Words from which the
headword was derived, or to which it is related.
The Greek words within an etymology segment
are invariably etymology sources, and should
be marked as such, but are not so marked,
even in the rare cases where the Greek word
transliteration has been written in.
italic Etymological source, being the name of a person
or geographical location which is the eponym
for the concept. This is used to distinguish
eponymous etymologies from others, and can also
be found in the body of a definition or note,
not only in the etymology field. Very few
of the names that should be marked this way
have actually been so marked, as of version
0.42. In cases where such eponymous names
have not yet been thus marked, they will
usually be marked by , the non-semantic
italic-font marker, or, in etymologies, by
.
italic Example. An example of usage of the headword,
usually found within an or segment.
* Frequency of use, ordinal rank. This is used for
WordNet entries, in which the synonyms
were ranked in order of frequency of use.
1 indicates that the headword is the
first word on the list of synonyms.
* First use. A date at or around which the first
use of this word in writing is recorded.
Not in the original 1913 Webster, and usu.
taken from a recent dictionary. Only a few
such fields have been entered as of version
0.41
transliteration Greek. The Greek words have been transliterated
using the equivalents explained in the
file "webfonts.asc". In most cases, the
transliterations are typical for Greek
letters, except for theta (transl = q),
phi (transl. = f), eta (transl. = h), and
upsilon (transl. = y, whether pronounced
as y or u). This was to eliminate any
ambiguity. These words occur primarily
in etymologies, and to conform to the
usage of should also be marked
by , but as of version 0.41 they
are not usually thus marked.
bold, headword. Each main entry begins with the
larger by mark, and ends at the next mark. The
2 points main entries are not otherwise explicitly
marked as a distinctive field.
The same word may appear as a headword
several times, usually as different parts
of speech, but sometimes with different
entries as the same part of speech, presumably
to indicate a different etymology.
Within the hw field the heavy accent is
represented by double quote ("), the
light accent by open-single-quote (`),
and the short dash separating syllables by
an asterisk (*). A hyphen (-) is used to
represent the hyphen of hyphenated words.
italic, Usage mark. Almost always within square
brackets, occasionally in parentheses or
without any bracketing.
but The most common usage marks,
explanatory "Obs." = obsolete "R." = rare, "Colloq." =
may be plain. colloquial, "Prov. Eng." = Provincial England,
etc. are in italics. Some usage notes are also
marked with , but are in plain. For
simplicity, all words in this field may be
italic, until additional explicit marks are
added.
* A usage mark in plain type (not italic). Found
within a definition, when there are more than
one sense-number listed. "Fig." at the head
of an entry is the most common case.
* Multiple collocation. Similar to multiple
headword, when two or more collocations share
one definition; however, the two collocations
are in-line, rather than stacked or justified.
There may be "or" or "and" words
(italicised), or an "etc." (plain type)
within this field. In many cases, the
* Multiple headword. This field is used where
more than one headword shares a single
definition. In the dictionary, the
(usually) two headwords are left-justified
one below the other in the column, and are
tied together on the right side of the
headwords by a long right curly brace.
This division is strictly functional,
for analytical purposes, and does not
affect the typography.
* Noun morphology section. Rarely used, mostly
for irregular personal pronouns.
* Explanatory note. No explicit font is indicated.
These segments may be separate, as in the
separate paragraphs starting * Plural. The "plural" segment starts with a
"pl." which is italicised, but in this
segment is not otherwise marked as
italicised. Other words occurring in this
segment are plain type. The "pl." can be
easily explicitly marked if necessary.
italic Part of speech. Always an abbreviation: e.g.,
n.; v. i.; v. t.; a.; adv.; pron.; prep.
Combinations may occur, as "a. & n.".
small caps Plural word. The actual plural form of the word,
found within a segment.
* pronunciation. The default font is normal, but
many non-ASCII characters are used.
The pronunciation field may have more than
one pronunciation, separated by an " smaller by Quotation. No bracketing quotation marks,
two points, though occasionally \'bd-\'b8 quotations occur
centered, within these quotations. These quotations
Separate tend to be more complete sentences, rather
paragraph than just phrases, such as are contained
within quotation marks within the definition
paragraph.
italic, Quotation author. Used only for the quotations
right justified marked with that are centered in their
own paragraphs.
italic Quotation example. An example of usage of
the headword, within quotations marked
by .. tags.
italic Subdefinition, marked (a), (b), (c), etc. THese are
finer distinctions of word senses, used
within numbered word-sense (for main entries),
and also used for subdefinitions within
collocation segments, which have no numbering of
senses. The letter is italic, the parentheses
are not. This tag is also used to indicate the
lettered subdefinition when it is referred to
at another point in the text.
italic The name of a ship. Rarely used.
* Singular. Analogous to the segment, but more
rarely used, mostly for Indian tribes, which
are listed in the plural form.
small caps Singular word. The singular form of the
plural-form headword.
bold, Sense number. A headword may have over 20
larger by different sense numbers. Within each numbered
2 points sense there may be lettered sub-senses. See
the (sub-definition) field.