Firstly, VDGEC derives the set of candidate concepts from the multimedia corpus. We extract quality concepts as our concept candidates.
A quality concept is defined as a sequence of words that appear contiguously in the text, and forms a complete semantic unit in certain context of the given documents, meeting the following criteria:
The above idea can be implemented in various ways.
We implement it as follow
Frequent pattern mining is first established to collect aggregate counts for all concept in a corpus that satisfy a certain minimum support threshold , according to the representativeness criterion.
For efficiently mining these frequent phrases (concept candidates), we draw upon two properties.:
Multiple features that are designed to quantify the informativeness and concordance are computed for each candidate, we first run Aho-Corasick Automaton algorithm and tailor it to find all the occurrences of phrase candidates, can extract the follow features.
Concordance Features
To make phrases with different lengths comparable, we partition each phrase candidate into two disjoint parts in all possible ways and derive effective features measuring their concordance. Suppose for each word or phrase , we have its raw frequency . Its probability is defined as:
Given a phrase , we split it into two most-likely sub-units , such that pointwise mutual information is minimized. Pointwise mutual information quantifies the discrepancy between the probability of their true collocation and the presumed collocation under independence assumption. Mathematically,
With , we directly use the pointwise mutual information as one of the concordance features.
Another feature is also from information theory, called pointwise Kullback-Leibler divergence:
The additional is multiplied with pointwise mutual information, leading to less bias toward rare-occurred phrases.
Both features are supposed to be positively correlated with concordance.
Informativeness Features
Some candidates are unlikely to be informative because they are functional or stop words. We incorporate the following stop word-based features into the classification process.
which requires a dictionary of stop words. Phrases that begin or end with stop words, such as “I am,” are often functional rather than informative.
A more generic feature is to measure the informativeness based on corpus statistics:
Where IDF for a word is computed as
It is a traditional information retrieval measure of how much information a word provides in order to retrieve a small subset of documents from a corpus. In general, quality phrases are expected to have not too small average IDF.
The concept quality estimater is an arbitrary classifiers that can be effectively trained with small labeled data and output a probabilistic score between 0 and 1. For instance, we can adopt random forest which is efficient to train with a small number of labels. The ratio of positive predictions among all decision trees can be interpreted as a phrase’s quality estimation.
The classifier is trained on these above features to predict quality for all candidate concepts with labels from knowledge bases. is also known as quality estimator. Finally, the concept will be extracted as well as its content units.
Inflect and Stanford NLP tools are used to correctly generate plurals, singular nouns, ordinals, indefinite articles and merge different words from the same concepts. Similarity is computed based on the context units they appear. A synonym list for concepts is built, and the most frequent one will be the final concept.