Tuesday, July 21, 2009

Color and Thesauruses

by Nathan R. Moroney, HP Labs

At the recent MCSL 25th Anniversary Symposium, Nathan Moroney showed us a “book with 5000 authors” comprising color names printed in colors that denote their synonyms. The 5000 authors were contributors to his on-line color thesaurus. Here are more details from Nathan ...

Is the color “zaofulvin” synonymous with “orpiment” or “smalt”?
The 1911 edition of Roget’s Thesaurus1, which early-on attempted to cluster color names, has an answer.

As you can see from Table 1, “zaofulvin” and “orpiment” are synonyms.











WhiteNiveous
Canescent
Lactescent
BlackAtramentous
Fulginous
GrayFavillous
Cinereous
BrownCasteneous
Fuscous
RedAnotto
Realgar
Minium
GreenVerdine
Copperas
YellowOrpiment
Zaofulvin
Luteous
PurpleGridelin
Heliotrope
BlueBice
Zaffer
Smalt
OrangeGild
Ocherous

But if a thesaurus is “a resource to group words according to similarity”2 then how are we to judge the similarity of words, and particularly of color names? Kilgarriff2 summarizes the contrasting methods of manual creation (e.g., for Roget’s Thesaurus) with the automatic extraction from corpora or large collections of machine-encoded text. He also emphasizes that, besides grouping words according to similarity, a thesaurus should also indicate how frequently each word is used.

The ISCC-NBS color dictionary3 significantly advanced the grouping of color names by providing about 300 name categories for over 7,500 color names taken from 13 different earlier color dictionaries and vocabularies. This work partitions Munsell color space for the core vocabulary and maps it to the larger collection of earlier color-name collections. This is a hybrid approach that merges multiple manual efforts into a single manual or expert framework.

Modern efforts at thesaurus creation through automatic extraction are making progress, but experiments with nine similarity metrics show4 that much more work is needed. Some of the challenge is to have a large enough collection of text for analysis. However, size alone is not the solution. For instance, the very large Moby Thesaurus5 returns “pineapple” and “pear” as synonyms for “orange”, thereby apparently including the fruit meaning with the color-name meaning.

An alternative approach is the direction construction of a specific color naming corpus using the World Wide Web. For the past eight years, I collected over 35,000 color names from over 5,000 online volunteers. Each volunteer named seven randomly generated colored patches displayed on a white background. The colors were selected from a uniform red, green and blue sampling of what was at the time the “web-safe” palette for lower bit-depth displays. Of these 35,000 color names, many are used repeatedly. Assuming an imposed minimum of three participants to provide a specific color name, the program derives over 600 color names. Using these names as an initial collection, it computes synonyms by finding the closest color names in a corresponding color space. It also finds the color names that are closest to the inverses in hue and lightness as possible color antonyms. Finally, because the data are collected from thousands of participants, the program infers the relative frequency of use. In this way, we created a color thesaurus that is closest to an aggregate or collective clustering of colors across a large number of English speakers.

This data is formatted as a web-based color thesaurus,6 which has been well received by online users. It has served almost 200,000 color names to date --- although zaofulvin and orpiment are not included.

(1) Peter Mark Roget, Roget's Thesaurus (1911), Project Gutenberg edition, http://www.gutenberg.org/etext/10681, retrieved May 2009.

(2) Adam Kilgarriff, "Thesauruses for Natural Language Processing", Proc. Natural Language Processing and Knowledge Engineering, p. 5-13, (2003)

(3) Kenneth L. Kelly and Deane B. Judd, The ISCC-NBS Method of Designating Colors and a Dictionary of Color Names, National Bureau of Standards Circular 553, (1965).

(4) James R. Curran and Marc Moens, "Improvements in Automatic Thesaurus Extraction", Unsupervised Lexical Aquisition: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia, Association for Computational Linguistics, pp. 59-116, (July 2002).

(5) Moby Thesaurus – online interactive version from dict.org, http://www.dict.org/bin/Dict?Form=Dict3&Database=moby-thes , results retrieved May 2009.

(6) The Online Color Thesaurus - http://www.hpl.hp.com/personal/Nathan_Moroney/color-thesaurus.html.