| |
|
|
 |
|
Automatic tag recommendation. (SIGIR 08 paper) A real-time, automatic tag recommendation system has been proposed for the Next Generation CiteSeer. Our algorithm leverages Spectral Recursive Embedding (SRE) to partition the document-tag bipartite graph, then a two-way Poisson Mixture Model (PMM) is built for each cluster. The system is capable of making 10 tag recommendations for one document within 1 second. |
 |
|
Gaussian Process Text Classification. (CIKM 08 Paper) A sparse multi-class Gaussian process framework is proposed to deal with the computational complexity of using Gaussian process for multi-class classification problem. A novel prototype selection algorithm is proposed to select the best subset of points from the training set. The resulting algorithm shows both scalability and high precision. |
 |
|
Multiclass Boosting Classification. (SDM07 paper) We extended the two class Gentle Adaboost algorithm to multiclass classification by leveraging multiclass exponential loss (GAMBLE). Unlike other multiclass algorithms which reduce
the K-class classification task to K binary classifications,
GAMBLE handles the task directly and symmetrically, with
only one committee classifier.
To scale up to large datasets, we utilize the generalized
Query By Committee (QBC) active learning framework to
focus learning on the most informative samples. |
 |
|
Text Classification. (ICDM06 paper) We used a two-level decision tree to extract noun-phrases from text, and use them as features. In this way the dimensionality of the feature space is reduced significantly. We also proposed a noval collaborative filtering (CF) method to predict missing features for small samples, to augment the feature space. SVM and AdaBoost are applied to the feature space for classification with better precision and recall.
|
 |
|
Text Classification. (PKDD07 paper) We proposed a new metric named informativeness and applied as a distance metric for nearest neighbor classification. An instance is defined as informative if it is close to similar instances while far away from dissimilar ones. Two KNN extensions, Local-informative KNN (LI-KNN) and Global-informative KNN (GI-KNN) are implemented. |
 |
|
Non-parametric Topic Correlation Detection. ( in review) A
non-parametric method is proposed to discover dynamic topic
correlation in text documents. The model is extended from the
hierarchical Gaussian process latent variable models (GP-LVM).
By marginalizing model parameters rather than the latent
variables, the dynamic correlated topic model (DCTM) sexhibits
a non-parametric characteristic which is often desirable for
large-scale text data. Unlike generative aspect models such like
LDA, DCTM demonstrates a much faster converging rate with better
model fitting to the data. |
 |
|
Unsupervised name classification. (Name disambiguation. WWW07 paper) We focused on the issue of entity resolution on the web and in digital libraries. Two graphical models are extended from PLSA and LDA. Our models differ from previous ones by explicitly introducing a variable for person name.
Scalability is addressed by disambiguating authors
in over 750,000 papers from the entire CiteSeer dataset. |
 |
|
Unsupervised author classification. (Author name disambiguation. JCDL07 paper) We apply hierarchical aggromerative clustering method to the pre-calculated author-topic matrix, in which we assume that an author has a unique (research) topic distribution that distinguish oneself from others. Our model is capable of clustering name variants (of the same author) together, while diambiguating authors with EXACTLY the same name. The top 10 most ambiguous author names from CiteSeer are tested with better precision and recall performance than previous approaches. |
 |
|
Tag Classification. (ACM Group07 paper) We suggested several evaluation metrics for tag evaluation to improve the performance of social bookmarking system. Specifically, six tag metrics were proposed - tag
growth, tag reuse, tag non-obviousness, tag discrimination, tag
frequency, and tag patterns. paper, We analyze over two
years of data from CiteULike, and suggest design heuristics to
implement a social bookmarking system for CiteSeer. |
 |
|
Tag Analysis. (IEEE Computing 08 Paper) We investigate the relationship between tag growth and tag reuse in social bookmarking sites. We propose methods on enhancing the services of tag suggestion. Empirical study was carried out on CiteSeer. |
 |
|
Next Generation CiteSeer. (CiteSeerX, Inforscale06 paper) We proposed a new architecture
for a next generation CiteSeer application. The new architecture
is based on modular web services and pluggable service components.
Preliminary results based on a prototype system show the
new architecture enhances flexibility, scalability, and performance
for CiteSeer. In addition, new services in development for the next
generation CiteSeer system are also dicussed. |
 |
|
Social network analysis. (ICMLA06 paper) A two-phase framework is
introduced to address the problem of leadership discovery
in an organization based on email communication history
among the employees. Two heuristic metrics are proposed
for evaluating pair-wise leadership factors among a group
of employees. We also address several issues in discovering
the organization's structure through mining leadership
graph constructed from the leadership factors. |