
Behind
the scenes in the movie industry is where much of
the most important action takes place. Makeup artists,
wardrobe experts, voice experts, choreographers, and more
work together, so that in the final production, the actors
can shine.
Behind
the scenes of our thesaurus is the true workhorse, the
indexing language. How did we create the controlled vocabulary
that comprises our thesaurus? What rules did we follow?
What relationships did we discover?
Join
us for a tour behind the scenes. We might even share a
few of our outtakes!
METHOD
We
used a combination of deductive and inductive methods
in the development of the corset thesaurus, as per 8.3.3
in the ANSI/NISO Guidelines for the Construction, Format,
and Management of Monolingual Thesauri (herein called
the Guidelines). The terms were gathered from a wide
selection of historical costume books, general encyclopedias,
articles, and contemporary websites. Terms were vetted,
incorporated into an overall hierarchical structure,
and thesaural relationships and features were added.
A thesaurus expert (Susie) reviewed our work.
The
corset thesaurus is not exhaustive. It reflects only
that which pertains to the current movie production,
and will expand as further modules are developed in
harmony with the greater plans of MMM.
SUITABILITY
We
have tailor-made this thesaurus for the users and conditions
delineated in our contract with Modern Movie Megacorp
(MMM). In the end our thesaurus will be integrated in
to the MMM Thesaurus.
First
of all, the corset thesaurus must reflect the fact that
it will be just one small component of a much larger
merchandise thesaurus. This means that careful consideration
must be given to whether or not homonyms require qualifiers
in the context of what other items are likely to be
sold. In most cases this meant including the qualifier.
Fortunately, the result is a much more adaptable module.
We
also took into account the fact that our employer has
plenty of money to invest in a top-notch thesaurus.
As such, they can afford to have professional indexers
deal with the long-term upkeep of thesaurus, as well
as index the ready to wear corsets and pre-packaged
corset styles. While this will of course be more expensive
for the company, it will be more efficient in the long
run, as less direct indexing instruction need be applied.
It also supports a much higher level of thesaurus complexity.
The
ability to support a large and complex thesaurus is
a great boon, since that is precisely what is demanded
by the varied nature of our user groups. In addition
to the indexers mentioned above, we also have seamstresses,
who normally use the terminology native to their profession,
and will require a great amount of detail in the indexing
in order to understand precisely what the characteristics
are of the corset they are about to make. We also have
layperson web designers and telephone representative
taking calls from corset consumers, who may have little
or no familiarity with corset terminology, and so will
require extensive scope notes to explain these matters
to the clients. We also have two classes of consumers
themselves - those who have seen Dressed to Kill:
The Merry Widow Project and are therefore familiar
with Victorian terms, and those consumers who have not
seen the movie but would like to buy a corset anyway,
and who will likely know only modern corset-related
terminology. In addition, consumers are unlikely to
know what options are available and may request an option
or search for a term that does not exist, and so these
must also be anticipated and included in the thesaurus.
The color 'pink' for example is not an available (historically
correct) corset color. Any request for 'pink' will redirect
the searcher to the 'basic colors' hierarchy which is
already included in the MMM Thesaurus (as it would be
a related term in the 'corset colors' hierarchy), where
they may select from the colors that are available.
Clearly,
a thesaurus is necessary here to translate each user
group's terminology into all the other's terminology,
such that the indexer may index a corset that means
the same thing to the telephone representative talking
to the consumer who may or not have seen the movie as
it does to the seamstress who will make said corset
in the end!
All
this means that for each concept, possible user terms
must be gathered for each of these user groups and incorporated
into the final thesaurus. This will result in a more
expensive, more complex, larger thesaurus. Efficiency,
however, is also a concern. Just because our employers
have money, there is no call to make the thesaurus any
larger or more complex than absolutely necessary to
meet the needs of our users. To that end we threw out
all and any terms that did not fulfill at least one
of the above needs.
Our
users will also interface with the thesaurus in electronic
format only, which will make it much quicker to use.
This could mean an even larger list terms, however.
Since users will be unable to browse for the next closest
alternative to a requested term, we would have to provide
every possible form of every term (e.g. corset and corsets).
One possible solution to this problem would be to give
a reminder to always use plural forms, but unfortunately,
between the different types of fabric and the body parts
measurement module there would also be plenty of singular
forms to make this unsatisfactory. We have instead solved
this problem by specifying to MMM that they must provide
truncation-capable search software, along with an ever-present
note to searchers saying what truncation is and how
best to employ it.
Fortunately,
the one thing all our user groups have in common is
some involvement with corsets, and in that respect,
our thesaurus is quite suitable indeed.
Top
of Page
TYPE
OF INDEXING LANGUAGE
Free
language indexing would be just that, free, and TOB&B
would be unlikely to make much money in that business!
That leaves us with natural and controlled languages.
Natural language is derived from the item itself, and
only applies to verbal or textual documents. Corsets
are not textual documents. There are no words written
on them, except perhaps a 'care and cleaning' tag
which we have no reason to index whatsoever. MMM will
not be monogramming the corsets, nor will they be using
fabric that has written words in the pattern. The corset
module, then, does not reflect natural language indexing.
It may in the future, if any of these exceptions becomes
relevant. Also, the larger thesaurus in which the corset
module resides may itself incorporate natural language,
though this depends entirely on just what they decide
to sell. It is, however, a controlled language.
PRE-
VS. POST-COORDINATION
"What
is this object? Why, it's a corset!"
Because
the objects in question need to be indexed to some depth
for the indexing to be of use, each will require the
application of more than one term. This brings up the
question of how those multiple terms will be applied
- as one great long string, or individually. That is,
will the indexing be pre- or post-coordinate? While
this seems like a simple question, it is quite a bit
more complex than it seems.
MMM
only carries a few actual pre-set corsets for sale.
It would be quite easy, if somewhat lengthy, to string
together a pre-coordinate description for each one and
be done ("Why, it's a scarlet-sateen-divorce-corset-with-steel-diagonal-boning-and-a-spoon-busk!").
However, in order for a user's search to be successful
it must match a string the indexer created, but since
MMM also makes custom corsets, there are an almost infinite
number of items to index that must be indexable before
they ever exist. This means that every possible combination
of the terms in the thesaurus must be strung together
in the absence of any actual request for the (not yet
existing) item, resulting in a ridiculous amount of
conceptual redundancy in the resulting catalogue.
Rules
for pre-coordination are intended to improve relevance
in retrieved items by making it clear what a combination
of terms means. Fortunately, an item with the following
descriptors - scarlet, sateen, divorce corset, steel,
diagonal boning, spoon busk - is unlikely to be misconstrued
as anything else! (Especially when laid out in a web-based
order forms with labels indicating what each option
applies to.)
In
the end, pre-coordination is more time-consuming to
do, and there seems not to be any compelling reason
to do so. Post-coordination is easier to do, more flexible
and more efficient, especially when adding new modules
such as an evening dress module. Therefore, we post-coordinate.
And since the indexing is post-coordinate, the retrieval
cannot be both pre- and post-coordinate, but post-coordinate
only.
Top
of Page
FORM
OF TERMS
Single
vs. Multiword Descriptors
The
corset thesaurus includes both single and multiword
descriptors. According to section 3.1 of the Guidelines,
each descriptor selected for inclusion in the thesaurus
must represent only one concept, though more than one
word may be necessary. The restriction to a single indexable
concept is absolutely necessary to the concept of a
controlled language, and as such we have followed the
Guideline here. While we also endeavored to keep multiword
descriptors to a minimum, several proved to be necessary.
Our treatment of those terms is discussed below under
the heading 'Compound Descriptors'.
Scope of the Descriptors
We
limited the scope of the descriptors to those meanings
within the domain of the thesaurus (keeping good company
with section 3.2 of the Guidelines). The potential domain
of the thesaurus as a whole (not just our module) however,
is great. As a result, several orphan terms entered
the thesaurus as alternative potential meanings of the
homographs we had to modify with parenthesized qualifiers.
This practice is also recommended in the Guidelines
in section 3.2.1. For example, 'amber' gained the qualifier
'(color)', as it was anticipated that 'amber (fossilized
resin)' would be in the larger MMM thesaurus, as they
may also market amber jewellery.
We
also included many scope notes, in particular where
we felt a descriptor was a sufficiently uncommon term
that one or more of our user groups would not know its
meaning. After all, most indexers are unlikely to also
be corseters! As section 3.2.2 of the Guidelines suggest
that scope notes can be used to give advice on term
usage, this appears to be an acceptable practice.
Types
of Concept
Section
3.3 of the Guidelines discusses various types of concepts
that descriptors may represent, several of which are
included in the corset thesaurus. For example there
are 'things and their physical parts' ('corsets' and
'corset components'), materials ('steel') and disciplines
('corsetry'). There are also unique entities ('Nicolette
Kidman'), as discussed in section 3.3.1. Normally these
would be expressed as proper nouns, with the first letter
capitalized and the rest lower case. As discussed below
in the 'Capitalization' section, however, in the corset
thesaurus they are represented either in all capitals
or all lower case.
Grammatical
Forms of Descriptors
As
represented in section 3.4 of the Guidelines, we have
endeavoured to format as many descriptors as possible
as nouns. Alternative formats, where applicable, were
included as entry terms.
Singular vs. Plural
The
'count vs. mass nouns' issue seems like such a simple
proposition when dissolved down to the basic rule: how
much or how many? Section 3.5 indicates we should decide
which applies, pluralize the count nouns and leave the
mass nouns singular (and then immediately proceeds to
list exceptions).
Sometimes,
however, the answer is both! Take the word 'fish'. I
know there are no fish in our corset thesaurus, but
'bear' with me
when one is talking about different
types of fish, one says 'fishes', as in 'how many fishes
do you carry?' When one is talking only about one type,
one says 'fish', as in 'how much fish would you like?'
Suddenly our rule is of no help! Well, it appears the
same holds true for 'fabric' and 'color'. And with the
guidelines offering so many exceptions
and so
many different user groups to consider
in the
end we simply took a vote within TOB&B to go with
the plural form, and included the alternate forms as
entry terms.
Preferred
vs. Non-preferred Descriptors
Such
varied users means that the very concept of choosing
preferred terms based on user warrant, as recommended
in section 3.6.1 of the Guidelines, is impossible. It
is unlikely that all these groups will agree on any,
never mind most, terms. Fortunately, MMM's marketing
strategy solves this problem. Their goal of course is
to sell as many corsets and movie tickets as possible,
and part of their strategy is to inspire a Victorian
craze. What better way to enhance consumers' 'Victorian
experience' then using authentic Victorian terminology?
We
used American spellings since MMM is an American based
company employing largely Americans and selling those
movie tickets and corsets to a largely American consumer
base. We did however include the alternate British and
Canadian spellings as entry terms in accordance with
section 3.6.2.
Though
the thesaurus uses American spelling, we have also included
several preferred terms of non-English origin. These
include 'corsets spécialité' and 'corsets
callisthenic', and were included to further enhance
the above-mentioned 'Victorian experience'. While section
3.6.7.1 suggest we use words from other languages when
they are commonly accepted, it says nothing about what
terms you may include if you are intending to manipulate
what is commonly accepted.
MMM
intends to make available a few pre-set corset styles
- those that are worn by the stars in the movie. As
such, the characteristic of the star association is
an important element to include in the indexing of those
items, as consumers will likely want to search by the
stars' names. That being said, it was necessary to include
the stars' names in the thesaurus. Fortunately, only
a select few of the stars in the movie were actually
wearing corsets (er, I mean, the rest of the stars were
men), so the number to include was limited. However,
in accordance with section 3.6.8 of the Guidelines,
we did see fit to include variant forms of the stars'
names as entry terms. This included both their real
names and the names of the roles they played in the
movie.
Top
of Page
Capitalization
Section
3.7 in the Guidelines essentially recommends that one
use lower case throughout the thesaurus, except for
the first letter of proper names. You may have noticed
that this is simply not the case for the corset thesaurus.
The problem with this suggestion is that it does not
play nicely with the thesaurus management and creation
software we used, MultiTes.
Or,
more precisely, the display ordinances in section 6.3.3
invoke us to distinguish between preferred and non-preferred
terms. While the Guidelines suggest we do this via bolding
or italics, MultiTes does not support these. Our choices,
then, were to format everything later in a word processing
environment, or find another way to distinguish preferred
from non-preferred terms. The problems with formatting
later are that, first, it was going to cost MMM even
more (they do have a budget), and second, as new indexers
and thesauri creators, we needed to be able to distinguish
the terms right from the beginning to understand what
we are doing.
MultiTes
does however support both capitalization and lowercase,
and so we decided to use that to distinguish the non-preferred
from the preferred terms. While this works fine from
6.3.3's perspective, it doesn't work at all from 3.7's
perspective. We did however display the non-preferred
proper noun with proper orthography.
In
the end however, MMM is concerned with selling those
corsets and movie tickets, and sadly, indicating proper
orthography to its various user groups is not a major
concern.
To
appease the NISO Gods via section 3.7.2.2 we did however
remove as many hyphens as possible. This left only two
- the first was the 'S-curve corset', because we felt
it would be confusing without the hyphen unless we added
quotation marks, which of course would have thrown the
sorting out of whack. The second was a star's name,
which was a proper noun.
The
other difficulty we encountered was due to the fact
that MultiTes does not support italics, either. This
meant that instead of distinguishing nodes or facets
from descriptors using italics, we could rely only on
the application of angle brackets. However, since some
of our nodes required alternate forms as entry terms
('colours' for 'colors'), the preferred versions will
remain lowercase to further distinguish them from descriptors.
Because
MultiTes does not support diacritical marks (Section
3.7.2.4 & 3.6.7.1) we chose not to insert them into
our thesaurus. As well, the typical user would not insert
them in searching. Our search software supports our
terms with or without diacritical marks.
Compound
Descriptors
As
mentioned above under the heading 'Single vs. Multiword
Descriptors', Section 4.1 of the Guidelines permits
multiword or compound descriptors so long as it represents
a single indexable concept. We retained compound descriptors
when splitting the two words apart would change their
meaning, for example 'diagonal boning'. We also kept
compound descriptors that represented a 'type' as opposed
to a 'part', for example 'divorce corset'.
Top
of Page
RELATIONSHIP
STRUCTURES
The corset thesaurus features the syndetic structure of
the three relationships found in section 5 of the Guidelines.
Reciprocity is a feature of all three relationships, and
may be either asymmetric, as in the case of equivalence
relationships, and hierarchical relationships. Or it may
be symmetrical as is the case with associative relationships.
EQUIVALENCE
Where
there is more than one term in the thesaurus that expresses
the same concept, a preferred term, or descriptor had
to be determined. We chose our descriptors following
literary warrant where possible, and kept in mind user
warrant when choosing lexical variants and synonyms.
This is in accordance with section 5.2.
Synonyms
The corset thesaurus features a number of types of equivalence
relationships, as discussed in section 5.2.2 of the
Guidelines, including some synonyms based on current
or favoured terms, as well as common or slang nouns.
For example:
5.2.2.e:
current vs outdated terms
stays
USE CORSETS
CORSETS
UF stays
5.2.2.f:
common nouns and slang or jargon
merry widows
USE CORSETS
CORSETS
UF merry widows
Section
5.2.3, discusses lexical variants. The term 'corsets'
is found in many spellings in the thesaurus, as through
history the spelling has varied. As the thesaurus focuses
on nineteenth century corsets, we included some of the
historical spellings, but for user warrant chose the
most current spelling to aid in recall for searching.
We offer the lexical variants such as 'coursettes' as
a reciprocal USE/UF relationship.
coursettes
USE CORSETS
CORSETS
UF coursettes
HIERARCHICAL
The hierarchical relationship is discussed in section
5.3 of the Guidelines. This relationship is the one
that distinguishes a thesaurus from a glossary, or list
of words. There are two levels in this relationship
- superordinates or broader terms (BTs) and subordinates
or narrower terms (NTs). Of course in a corset thesaurus
Narrower Terms can have more than one meaning! The hierararchy
may extend to many levels.
As
discussed in sections 5.3.1 & 5.3.2, both generic
relationships and whole-part relationships are present
in the corset thesaurus. In fact, the term 'corsets'
includes both relationships in its narrower terms, with
'corset types' and 'corset components'. However, 'corsets'
are not one of the listed instances where a whole-part
relationship is recommended, and so we have used thesaural
licence here. Section 5.3.2 does however explicitly
state that "the four types enumerated below are
not intended to be exhaustive", so it is unlikely
that we will be dragged away by the thesaurus police
anytime soon.
As
discussed in section 5.3.5, we incorporated node labels
to bring together sibling terms, such as <corset
types>. We enclosed these in angle brackets, and
they will not be used as descriptors in indexing.
ASSOCIATIVE
RELATIONSHIPS
This
symmetrical relationship, covered in section 5.4, links
terms that are not equivalents, nor hierarchical in
nature, yet are conceptually or semantically linked
in such a manner that the relationship should be noted.
For
example:
5.4.2 a: Discipline and objects studied
CORSETRY
RT: CORSETS
Top
of Page
PRECISION
AND RECALL
The
corset thesaurus uses a number of features to aid in
precision and recall.
We
used a controlled vocabulary rather than natural language
or free language to increase the precision and recall.
Our hierarchical relationships extend to as many as
5 levels that aids in precision.
We
have included many scope notes to clarify terminology.
Because we had such a heterogeneous user group we felt
it was important to clarify any terms that could be
unfamiliar to our users. This way indexing terms are
more likely to be applied correctly, increasing precision.
For example:
BUSKS
SN: Piece of wood, whalebone, ivory, horn, or steel
slotted into the front of the corset to hold the torso
erect.
Homographs
within the context of the thesaurus are differentiated
by qualifiers to aid in precision and to aid in the
integration of the thesaurus into the MMM thesaurus.
For
example:
horn (animal)
For
the purchasers of corsets, precision and recall are
controlled by the organization in the order form. Drop-down
menus for each component and variable will specify choices.
There is an inverse relationship here. The higher the
precision, or number of components specified, the lower
the number of hits or recall. Conversely, purchasers
who explicitly choose only a few components will have
a higher recall or number of hits (corset choices),
but lower precision.
A
link to the thesaurus will be available on the order
page for users to read scope notes. Access to these
scope notes should help increase precision from the
searcher's perspective.
SPECIFICITY
AND EXHAUSTIVITY
These
two areas relate to the detail and depth of vocabulary
for the domain of our thesaurus. How precisely can the
user describe a concept or item? How many terms are
available for a specific concept?
The
corset thesaurus covers a very narrow domain, and within
that domain, specificity and exhaustivity at this time
varies. For example: both bodice materials and colors
are offered in many varieties and historical colors,
as we felt these very visible details would be of primary
importance to buyers. This held true for busk materials
as well, as this is a special feature that we felt users
would want control over. But lacings and lining are
not covered in any depth as users will have no choice
for these less visible components.
Color
was a thorny issue. The level of specificity and exhaustivity
for color was a concept that caused us some difficulty.
Since this module is predicted to be the first of many
modules for merchandise, all major colors in the color
wheel should be represented. However, the basic colors
are already included in MMM's Thesaurus, so only historically
accurate colored corsets were included in our thesaurus.