concept_extraction

Concept extraction

Firstly, VDGEC derives the set of candidate concepts from the multimedia corpus. We extract quality concepts as our concept candidates.

A quality concept is defined as a sequence of words that appear contiguously in the text, and forms a complete semantic unit in certain context of the given documents, meeting the following criteria:

Informativeness: A quailty concept is informative if it indicates a topic or concept;
Representativeness: quality concept should occur with adequate frequency in the given document collection;
Completeness: A quality concept is deemed complete when it only can be interpreted as a complete semantic unit. For example,"artificial intelligence'', apparently neither "atrificial'' nor "intelligence'' is indicative of artificial intelligence.
Concordance: As a whole semantic unit, a quality concept should appear with significantly higher probability than that of its components.

Frequent pattern mining is first established to collect aggregate counts for all concept $\mathrm{C}$ in a corpus that satisfy a certain minimum support threshold, according to the representativeness criterion.
Feature Extraction: Multiple features that are designed to quantify the informativeness and concordance are computed for each candidate, such as inverse document frequency, point-wise mutual information and left-right collocation entropy.

Concept Quality Estimation: A classifier is trained on these features to predict quality $Q$ for all candidate concepts with labels from knowledge bases. $Q(\cdot)$ is also known as quality estimator.
NLP tools: Inflect and Stanford NLP tools are used to correctly generate plurals, singular nouns, ordinals, indefinite articles and merge different words from the same concepts.

The above idea can be implemented in various ways.

We implement it as follow

1. Frequenct pattern Mining

Frequent pattern mining is first established to collect aggregate counts for all concept $\mathrm{C}$ in a corpus that satisfy a certain minimum support threshold $\tau$ , according to the representativeness criterion.

For eﬃciently mining these frequent phrases (concept candidates), we draw upon two properties.:

Downward Closure property: If a phrase is not frequent, then any its super-phrase is guaranteed to be not frequent. Therefore, those longer phrases will be ﬁltered and never expanded.
Preﬁx property: If a phrase is frequent, any of its preﬁx units should be frequent too. In this way, all the frequent phrases can be generated by expanding their preﬁxes.

2. Feature Extraction

Multiple features that are designed to quantify the informativeness and concordance are computed for each candidate, we first run Aho-Corasick Automaton algorithm and tailor it to ﬁnd all the occurrences of phrase candidates, can extract the follow features.

Concordance Features

To make phrases with diﬀerent lengths comparable, we partition each phrase candidate into two disjoint parts in all possible ways and derive eﬀective features measuring their concordance. Suppose for each word or phrase $u \in \mathcal{U}$ , we have its raw frequency $f[u]$ . Its probability $p(u)$ is deﬁned as:

$p(u) = \frac{f[u]}{\sum_{u' \in \mathcal{U}}f[u']}$

Given a phrase $v \in \mathcal{P}$ , we split it into two most-likely sub-units $<u_l. u_r>$ , such that pointwise mutual information is minimized. Pointwise mutual information quantiﬁes the discrepancy between the probability of their true collocation and the presumed collocation under independence assumption. Mathematically,

$<u_l, u_r> = \arg \min_{u_l \bigoplus u_r = v} log\frac{p(v)}{p(u_l)p(u_r)}$

With $<u_l, u_r>$ , we directly use the pointwise mutual information as one of the concordance features.

$PMI(u_l , u_r) = \log \frac{p(v)}{p(u_l)p(u_r)}$

Another feature is also from information theory, called pointwise Kullback-Leibler divergence:

$PKL(v||<u_l, u_r>) = p(v) \log \frac{p(v)}{p(u_l)p(u_r)}$

The additional $p(v)$ is multiplied with pointwise mutual information, leading to less bias toward rare-occurred phrases.

Both features are supposed to be positively correlated with concordance.

Informativeness Features

Some candidates are unlikely to be informative because they are functional or stop words. We incorporate the following stop word-based features into the classiﬁcation process.

Whether stop words are located at the beginning or the end of the phrase candidate,

which requires a dictionary of stop words. Phrases that begin or end with stop words, such as “I am,” are often functional rather than informative.

A more generic feature is to measure the informativeness based on corpus statistics:

Where IDF for a word $w$ is computed as

$IDF(w) = \log \frac{|\mathcal{C}|}{|\{d\in |D|: w \in C_d\}|}$

It is a traditional information retrieval measure of how much information a word provides in order to retrieve a small subset of documents from a corpus. In general, quality phrases are expected to have not too small average IDF.

3. Concept Quality Estimation

The concept quality estimater is an arbitrary classiﬁers that can be eﬀectively trained with small labeled data and output a probabilistic score between 0 and 1. For instance, we can adopt random forest which is eﬃcient to train with a small number of labels. The ratio of positive predictions among all decision trees can be interpreted as a phrase’s quality estimation.

The classifier is trained on these above features to predict quality $Q$ for all candidate concepts with labels from knowledge bases. $Q(\cdot)$ is also known as quality estimator. Finally, the concept will be extracted as well as its content units.

4. NLP tools

Inflect and Stanford NLP tools are used to correctly generate plurals, singular nouns, ordinals, indefinite articles and merge different words from the same concepts. Similarity is computed based on the context units they appear. A synonym list for concepts is built, and the most frequent one will be the final concept.