Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Lindh-Knuutila, Tiina; Honkela, Timo

doi:10.1186/s40469-015-0001-1

Research
Open access
Published: 07 July 2015

Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Tiina Lindh-Knuutila^1,3 &
Timo Honkela^2,4,3

Computational Cognitive Science volume 1, Article number: 2 (2015) Cite this article

5235 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Background

In this article, automatically generated and manually crafted semantic representations are compared. The comparison takes place under the assumption that neither of these has a primary status over the other. While linguistic resources can be used to evaluate the results of automated processes, data-driven methods are useful in assessing the quality or improving the coverage of hand-created semantic resources.

Methods

We apply two unsupervised learning methods, Independent Component Analysis (ICA), and probabilistic topic model at word level using Latent Dirichlet Allocation (LDA) to create semantic representations from a large text corpus. We further compare the obtained results to two semantically labeled dictionaries. In addition, we use the Self-Organizing Map to visualize the obtained representations.

Results

We show that both methods find a considerable amount of category information in an unsupervised way. Rather than only finding groups of similar words, they can automatically find a number of features that characterize words. The unsupervised methods are also used in exploration. They provide findings which go beyond the manually predefined label sets. In addition, we demonstrate how the Self-Organizing Map visualization can be used in exploration and further analysis.

Conclusion

This article compares unsupervised learning methods and semantically labeled dictionaries. We show that these methods are able to find categorical information. In addition, they can further be used in an exploratory analysis. In general, information theoretically motivated and probabilistic methods provide results that are at a comparable level. Moveover, the automatic methods and human classifications give an access to semantic categorization that complement each other. Data-driven methods can furthermore be cost effective and adapt to a particular domain through appropriate choice of data sets.

Background

In this article, we explore the relationship between human and data-driven semantic similarity judgments. The general architecture of this work is presented in Figure 1. We aim to see a) whether the representations that are automatically generated in a data-driven manner coincide with manually constructed semantic categories, and b) critically assess manually constructed semantic categories and semantically annotated data using statistical machine learning and visualization methods.

Challenge of semantics

Semantics is an intriguing but also a challenging area of linguistics. Linguists and researchers in nearby disciplines have created a number of theories related to semantics. These theories have been used as frameworks for semantic description or for labeling of lexica and corpora (Cruse 1986). On the other hand, availability of large text corpora and sophisticated statistical machine learning algorithms has made it possible to automatically conduct semantically oriented analysis of corpora and lexical items (Manning and Schütze 1999).

When the objective is to create linguistic models and theories, a traditional approach is to rely on linguists’ intuition and knowledge building in a community of professional linguists. In corpus linguistics, linguistic theories are usually the starting point and statistical analyses on corpus data are used to confirm, reject and refine these theories (McEnery 2001). In such a paradigm, basic linguistic categories like noun and verb are taken as given and may even be assumed to have an objective status. Similarly, when computer scientists work on some linguistic data, they very often use human-constructed categories and labels as a ground truth to evaluate the performance of the computational apparatus. For syntax, there is a large number of competing and mostly mutually incompatible theories and category systems (Rauh 2010). In recent years, some linguists have pointed out that there is no good evidence for pre-established syntactic categories that would be shared by all or a large number of languages (Haspelmath 2007).

A unidirectional view on knowledge formation within computational linguistics is problematic. There is no generally accepted theory of semantics at the level of semantic categories or primitives, even though the quest for universal primitives has been active (Goddard and Wierzbicka 2002). In general, any classification system is prone to subjective variation even among experts in the field (Johnston 1968). Some research has been conducted on modeling this subjective variation (Caramazza et al. 1976; Honkela et al. 2012). In information retrieval, it has been known for a long time that indexers are inconsistent from one to another or from one time to another (Bates 1986) and that two individuals often use different expressions to describe the same thing (Chen 1994). This kind of inherent human subjectivity should also influence semantic theories in linguistics. It is useful to view language as a complex adaptive socio-cognitive system, rather than a static system of abstract grammatical principles (Beckner et al. 2009).

Unsupervised learning of linguistic models

Based on what was discussed above, we must consider any semantic category system or a semantically labeled corpus as a representation which may have well motivated alternatives. Based on the availability of text and speech corpora as well as sophisticated computational tools, an increasingly popular approach is data-driven: linguistic models are created using statistical and machine learning methods.

We are particularly interested in methods that are applicable without strong linguistic assumptions. Therefore, we focus on the unsupervised learning approach rather than any supervised learning (classification) methods. More specifically, we first compare the use of Independent Component Analysis (ICA) (Hyvärinen et al. 2001) and generative topic models, in particular Latent Dirichlet Allocation (LDA) (Blei et al. 2003) in extracting automatically linguistic features in a data-driven manner. In comparison with clustering methods that also belong to the unsupervised learning methods, ICA and LDA provide an important additional advantage. Namely, they find feature representations for words, i.e., they do not simply position words to different clusters but represent words through a collection of features. In the ICA method, these emergent features are called components, whereas in the LDA model they are called topics. For example, the word ‘women’ could be associated with emergent categories of living things, humans and females. Furthermore, the methods can also come up with a representation where the syntactic category plural is also associated with the word ‘women’. In this, like in many other cases, syntactic categories are actually related to an abstract level of meaning. The difference between clustering and feature analysis is illustrated in Figure 2.

We further analyze and visualize the data using the Self-Organizing Map (SOM) (Kohonen 2001). The SOM is widely used as a visualization method and has proven to be a viable alternative even when compared with more recent developments (Venna and Kaski 2006). We use the SOM for an analysis of special cases highlighted by the ICA and LDA analysis to reveal additional structure and to consider potential problems and ambiguities related to manually constructed semantic models.

Earlier and related work

Here the basic building blocks for this research are described including methods for vector space modeling, semantic similarity calculations, and unsupervised learning algorithms for linguistic processing. Earlier work in these areas is also discussed.

Word vector space model

Word vector space models (VSM) are based on (Miller and Charles 1991) a well-known hypothesis on the relationship between semantic similarity and context data: “two words are semantically similar to the extent that their contextual representations are similar” (Miller and Charles 1991). They capture meaning through word usage and are widely used in computational linguistics (Honkela et al. 2010; Landauer and Dumais 1997; Sahlgren 2006; Schütze 1993; Turney and Pantel 2000). For example, Turney and Pantel (2000) and Erk (2012) provide extensive reviews on the current state-of-the-art of vector space models. In a vector space model, it is assumed that semantic relatedness equals proximity in the vector space: related words are close, and unrelated words are distant (Schütze 1993).

The model construction takes place in several steps. First, the text data is pre-processed and feature selection can be applied. The context word frequencies are calculated, and raw frequency counts are transformed by weighting. Dimensionality reduction can be applied to smooth the space. Finally, the similarities between word vectors are calculated by using a vector distance measure (Turney and Pantel 2000).

To obtain the raw word co-occurrence count representation for N target words, the number of context words C occurring inside a window of size l positioned around each occurrence of the target word is counted. The accumulation of the occurrences of the context word in the window creates a word-co-occurrence matrix X _C×N. The size of context around the target word affects the results. The context used can be a document, or a more immediate context around the target word. Bullinaria and Levy (2007) provide a systematic analysis on different context sizes. Sahlgren (2006) concludes that a small context around a target word gives rise to paradigmatic relations between words, whereas larger context allows syntagmatic relations to be more prominent. See also Rapp (2002) for comparisons of paradigmatic and syntagmatic relations. As the concepts in the categories are mostly in paradigmatic relationship, we use a bag-of-words representation with a window of size l=1+1, that is, one word left and one word to the right around the target word.

Semantic similarity judgments

Similarity judgment is considered to be one of the most central functions in human cognition (Goldstone 1994). Humans use similarity to store and retrieve information, and to compare new situations to similar experiences in the past. Category learning and concept formation also depend on similarity judgment (Schwering 2008). Research has been carried out to obtain information on human similarity judgments and different types of similarity have been identified, such as synonymy (automobile:car), antonymy (good:bad), hypernymy (vehicle:car) and meronymy (car:wheel) (Cruse 1986). A special case is family resemblance, in which the members of a category are perceived as possessing some similar characteristics (VEHICLE: car, bicycle). Based on similarity judgment research in psychology and related fields, data sets that list words that are judged to be similar have been used to evaluate vector space models, explored for example in Baroni and Lenci (2011) and Lindh-Knuutila et al. (2012), with an intuition that the similarity perceived by humans should be translated as proximity in a word vector space. Another approach is to use a taxonomy or ontology as a basis for the similarity calculations (Seco et al. 2004). A new prominent evaluation direction is comparing corpus-derived vector representations to brain imaging results obtained with functional Magnetic Resonance Imaging (fMRI) (Mitchell et al. 2008; Murphy et al. 2012) or magnetoencephalography (MEG) (Sudre et al. 2012).

Direct vector space model evaluation concentrates on VSM performance, and measures the similarities of given words in the VSM model, and require human-annotated sources. For English, there are several such evaluation sets for analyzing the semantic similarity of the vector space models, that use synonym or antonym pairs, categories and association data (Sahlgren 2006) or separating a correct answer from the incorrect ones such as the TOEFL test set (Landauer and Dumais 1997).

General purpose algorithms for linguistic processing

In this article, we compare two methods, Independent Component Analysis (ICA) (Hyvärinen et al. 2001) and Latent Dirichlet Allocation (LDA) (Blei et al. 2003) in the analysis of vector spaces and contextual information. In particular, we are interested in how well these methods are able to extract meaningful linguistic information in an automated fashion. Latent semantic analysis (LSA) is a very popular method that is used to analyze linguistic vector spaces (Deerwester et al. 1990; Landauer and Dumais 1997). It has been shown, however, that even though LSA is useful in applications, it fails to provide explicit representations that would be comparable to linguists’ intuitions. In this task ICA is successful (Honkela et al. 2010). Now we wish to find out how the information-theoretically motivated ICA and the probabilistically motivated LDA succeed in this task. In other words, do these methods automatically find categorizations that would coincide with manually constructed semantic resources? Moreover, do these corpus based methods detect semantic similarities that have been neglected by linguists?

Terms that have been used to describe semantically related words or semantic categories that have been found using unsupervised learning methods include ‘emergent category’ (Honkela 1998), ‘latent class’ (Hofmann 1999), ‘topic’ (Blei et al. 2003; Steyvers and Griffiths 2007) and ‘sense’ (Brody and Lapata 2009). The first three can be considered to be synonymous. The term ‘sense’ is often used when multiple meanings of words are considered. Essentially, the phenomenon is still the same: What are the semantic distinctions that are made?

Methods

In this section, the corpus and evaluation data sets and the computational metodology are described in more detail. We begin by describing the evaluation data sets, and continue with the details of the corpus and methodological choices for building a vector space model. We then further describe the unsupervised learning methods that are used in the analysis.

Data and pre-processing

In this article, we use evaluation sets that contain information on semantic categories, that is, groups of words that are judged similar in some sense. The two test sets used in this article, the Battig set (Bullinaria 2012), based on 56 categories collected by (Battig and Montague 1969), and BLESS (Baroni and Lenci 2011) are introduced in more detail in the following sections. Other category-based evaluation sets not used in this article include the ESSLLI 2008 set (Baroni et al. 2008), which contains 44 concrete nouns that belong to six classes, and 45 verbs that belong to nine semantic classes; Baroni’s category list of 83 concepts in 10 categories (Baroni et al. 2010) based on an updated version of the Battig-Montague list (Van Overschelde et al. 2004); and the Almuhareb list (Almuhareb 2006), which contains 402 concepts.

Battig set

The Battig evaluation set (Bullinaria 2012) has earlier been used, for example, in formulating and validating representations of word meanings from word co-occurrence statistics (Bullinaria and Levy 2007, 2012). The test set contains 53 categories with 10 words in each category. The total evaluation set size is 530 words, out of which 528 words are unique. The categories are listed in Table 1. The set contains the words in each category in the frequency order they are listed in (Battig and Montague 1969). All words in the set are nouns, and only two word forms have more than one label: ‘orange’ is labeled with FRUIT and in COLOR, and ‘bicycle’ with TOY, and VEHICLE. For this article, the British English spelling of some words was changed back into American English (e.g., ‘millimetre’–‘millimeter’) to better conform to the English used in the Wikipedia corpus used in this article.

Table 1 The Battig categories used in this article

Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Abstract

Background

Methods

Results

Conclusion

Background

Challenge of semantics

Unsupervised learning of linguistic models

Earlier and related work

Word vector space model

Semantic similarity judgments

General purpose algorithms for linguistic processing

Methods

Data and pre-processing

Battig set

BLESS set

Wikipedia corpus

Term weighting

Unsupervised learning methods

Independent component analysis

Probabilistic topic modeling

Visualization with the self-organizing map

Finding category information

Results and discussion

Battig

Method performance

Analysis of the categories

BLESS

Method performance

Analysis of the categories

ICA on large vocabulary

Exploration

Visualization of categories and relations

Analysis based on BLESS set

Analysis based on Battig categories

Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords