"Nothing is impossible if you try your best. Impossible = I'm possible."

 

 

My current research interests include developing and applying statistical data mining and machine learning methods for text classification, clustering, collaborative filtering in the area of information retrieval, search engines and digital libraries.

Here is my resume.

 

 

 

IBM Research, TJ Waston Center, New York

With Anca Sailer and Hidayatullah Shaikh 
Hierarchical online text classification for real-time problem classification in large-scale distributed technical support services.

 

     

 

 

Microsoft Research, Redmond
Machine Learning and Applied Statistics Group &
Live Labs

With Dr.Aleksander Kolcz.
Text classification for web data and Spam Detection.

 

 


 

 

 

 

 

Research Group: Ask, Piscataway, NJ
Entity extraction for real-time queries on large-scale web applications.

With Dr. Eric Glover, Dr. Tomasz Imielinski

 

 

 


 

 

 


 

Advisor Professor C. Lee Giles.

  • Personalized Service of the Next Generation CiteSeer, including personalized search, automatic taxonomy generation, topic-based document classification, submission system and so on.
  • Name disambiguation and entity resolution for meta-data in digital library, leveraged statistical machine learning methods.
  • Performed efficient document classification in large-scale digital libraries, novel dimension reduction technique was introduced by applying entity extraction and collaborative filtering methods.
  • Co-designed and implemented a novel multi-class boosting algorithm.
  • Distributed Event Management for the Next Generation CiteSeer. Designing new two-phase commit algorithms for distributed user events, including post-validation, propagation-validation, failure recovery and etc.
     

 

 

 


 

Supervisor: Dr. Bo Zhou
State Street Technology Center, Zhejiang University, China

  • Oscar database platform, co-operated with a US top invest company SSgA, I was in charge of designing the query optimizer.

 



 

     
 
Automatic tag recommendation. (SIGIR 08 paper) A real-time, automatic tag recommendation system has been proposed for the Next Generation CiteSeer. Our algorithm leverages Spectral Recursive Embedding (SRE) to partition the document-tag bipartite graph, then a two-way Poisson Mixture Model (PMM) is built for each cluster. The system is capable of making 10 tag recommendations for one document within 1 second.
 
Gaussian Process Text Classification. (CIKM 08 Paper) A sparse multi-class Gaussian process framework is proposed to deal with the computational complexity of using Gaussian process for multi-class classification problem. A novel prototype selection algorithm is proposed to select the best subset of points from the training set. The resulting algorithm shows both scalability and high precision.
 

Multiclass Boosting Classification. (SDM07 paper) We extended the two class Gentle Adaboost algorithm to multiclass classification by leveraging multiclass exponential loss (GAMBLE). Unlike other multiclass algorithms which reduce the K-class classification task to K binary classifications, GAMBLE handles the task directly and symmetrically, with only one committee classifier.

To scale up to large datasets, we utilize the generalized Query By Committee (QBC) active learning framework to focus learning on the most informative samples.

 

Text Classification. (ICDM06 paper) We used a two-level decision tree to extract noun-phrases from text, and use them as features. In this way the dimensionality of the feature space is reduced significantly. We also proposed a noval collaborative filtering (CF) method to predict missing features for small samples, to augment the feature space. SVM and AdaBoost are applied to the feature space for classification with better precision and recall.

 
Text Classification. (PKDD07 paper) We proposed a new metric named informativeness and applied as a distance metric for nearest neighbor classification. An instance is defined as informative if it is close to similar instances while far away from dissimilar ones. Two KNN extensions, Local-informative KNN (LI-KNN) and Global-informative KNN (GI-KNN) are implemented.
 

Non-parametric Topic Correlation Detection. ( in review) A non-parametric method is proposed to discover dynamic topic correlation in text documents.  The model is extended from the hierarchical Gaussian process latent variable models (GP-LVM). By marginalizing model parameters rather than the latent variables, the dynamic correlated topic model (DCTM)  sexhibits a non-parametric characteristic which is often desirable for large-scale text data. Unlike generative aspect models such like LDA, DCTM demonstrates a much faster converging rate with better model fitting to the data.

 

Unsupervised name classification. (Name disambiguation. WWW07 paper) We focused on the issue of entity resolution on the web and in digital libraries. Two graphical models are extended from PLSA and LDA. Our models differ from previous ones by explicitly introducing a variable for person name.

Scalability is addressed by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

  Unsupervised author classification. (Author name disambiguation. JCDL07 paper) We apply hierarchical aggromerative clustering method to the pre-calculated author-topic matrix, in which we assume that an author has a unique (research) topic distribution that distinguish oneself from others. Our model is capable of clustering name variants (of the same author) together, while diambiguating authors with EXACTLY the same name. The top 10 most ambiguous author names from CiteSeer are tested with better precision and recall performance than previous approaches.
  Tag Classification. (ACM Group07 paper) We suggested several evaluation metrics for tag evaluation to improve the performance of social bookmarking system. Specifically, six tag metrics were proposed - tag growth, tag reuse, tag non-obviousness, tag discrimination, tag frequency, and tag patterns. paper, We analyze over two years of data from CiteULike, and suggest design heuristics to implement a social bookmarking system for CiteSeer.
  Tag Analysis. (IEEE Computing 08 Paper) We investigate the relationship between tag growth and tag reuse in social bookmarking sites. We propose methods on enhancing the services of tag suggestion. Empirical study was carried out on CiteSeer.
  Next Generation CiteSeer. (CiteSeerX, Inforscale06 paper) We proposed a new architecture for a next generation CiteSeer application. The new architecture is based on modular web services and pluggable service components. Preliminary results based on a prototype system show the new architecture enhances flexibility, scalability, and performance for CiteSeer. In addition, new services in development for the next generation CiteSeer system are also dicussed.
  Social network analysis. (ICMLA06 paper) A two-phase framework is introduced to address the problem of leadership discovery in an organization based on email communication history among the employees. Two heuristic metrics are proposed for evaluating pair-wise leadership factors among a group of employees. We also address several issues in discovering the organization's structure through mining leadership graph constructed from the leadership factors.

 

 

Back to main

s