Research

My resume can be downloaded from here.

Interests   Education   Projects  Awards  Services

Research Interests

I am a Ph.D. student in the Department of Computer Science and Engineering of PSU. My advisor is Dr. Wang-Chien Lee. My current research interests include information retrieval, applying statistical methods for textual data analysis, social networks, community study, system performance analysis and optimization, spatial search in the context of search engines and digital libraries.

Education

  • The Pennsylvania State University, University Park, PA 16802 (Sept 2004 - Present).
    Ph.D. Candidate, Computer Science and Engineering, GPA 3.91/4.
  • Tsinghua University, Beijing, China (Sept 1997 - Jun 2004).
    M. S. in Computer Software and Theory, July 2004.
    B. Eng in Computer Science and Technology, July 2001.

Selected Projects

Spatial Search Support on Geographical Information Retrieval.

Different with conventional information retrieval, geographic information retrieval concerns the retrieval of information with aware of spatial information. Existing works simply combines the IR techniques and the spatial database techniques and use spatial information as a filter for the search results. Until now, there is not yet a successful system that can seamlessly support geographic information retrieval that can support large document corpus and provide rich location ontology. This project aims at developing a search system that can provide users with convenient spatial search facility that automatically extracts implicit location information of queries, explores location ontology and ranks documents using spatial impacts as well as textual information. This system also includes the full working cycle of a spatial search engine, which includes spatial document acquisition, extraction, mapping and indexing. The system prototype is under development.

Next Generation CiteSeer (CiteSeerX): Architecture and Indexing.

CiteSeerX is a project to rebuild the well-established computer science academic search engine, CiteSeer. It has been observed that rising demands from system use and the increasing size of CiteSeer's archive are causing query latencies to rise as well as significant degradation of system stability. This project aims at solving the existing system problems by introducing a new flexible and scalable architecture. In addition, I am involved in the index construction and maintenance of the new system, which is based on the open source full-text indexing library, Lucene. To improve basic ranking of academic objects, we implement the indexer to consider both textual information and auxiliary metadata in ranking. To improve performance, we employ a hybrid caching system that is driven by the workload analysis, which suggests the entire trace, from the server perspective, can be broken into several fine-grained categories, each of which follow similar access patterns. Based on request classifications, different cache policies are used to better utilize limited cache capacity. We adopt an efficient classification mechanism in labeling incoming user requests based on recent activities in the active window. An overview of the system design and has been published in WWW 06. In addition, research results in the direction have been published in prestigious journals and conferences, such as ICWE, JCDL, and Infoscale.

Academic Community Mining on Document Sets with Relations.

In social sciences, communities are studied to understand the causality of events and roles of agents. Correspondingly, for information systems, communities can suggest implicit yet useful information, which can effectively improve the quality of services such as search and ranking if appropriately explored and used. We study the community discovery from a service's point of view and propose a hierarchical community model, in which a community is constructed with a core set and affiliated members. Classical document clustering methods suffer from the following problems: 1) only flat-structured clusters are discovered; 2) their scalability, especially to feature dimensions, is poor for large-scale document corpus; 3) decent prior knowledge regarding datasets are required. To overcome these identified problems, this project proposes a solution that utilizes both document attributes and relations to mine communities. We demonstrate the usability of the work by adopting it to Libra, an academic document search engine. The research in this direction is written into a paper submitting to KDD 2008.

Spatial Query Support on High Dimension Datasets and P2P Networks.

Spatial query is very useful for spatial databases, information retrieval, and location services. In this project, we try to support a newly-proposed spatial query type, skyline query, on high-dimension datasets, which is not well supported by traditional approaches due to their scalability issues. Via thoroughly analyzing the properties of skyline query, we dig out a perfect match between skyline and Z-order curve. Based on our findings, we, for the first time, propose to organize data points according to Z-order curve and design a new structure, ZBtree, as an index. Based on ZBtree, we develop a suite of novel and efficient skyline algorithms, which scale well in both dimensionality and cardinality, including (1) ZSearch, which processes skyline queries and supports progressive result delivery; (2) ZUpdate, which facilitates incremental skyline result maintenance; and (3) k-ZSearch, which answers k-dominant skyline query (a new skyline variant that retrieves a representative subset of skyline results). Also, in another project, we try to explore an efficient method to support skyline query on P2P networks, in which each node may store incomplete data pieces. Heuristic-driven methods are proposed to find both precise skylines as well as approximate results. The research results in this direction have been published in VLDB and Infoscale.

Personalized Search and Ranking.

Ranking as the core of effective documents retrieval has continuously been a challenging problem since the first search engine. Due to the rapid growth of online information and the diversity in user backgrounds, it is essential to providing a more efficient retrieval system by adopting personalization techniques. In this project, we extract user search preferences and browse patterns by studying the logged user history of CiteSeer and propose a regression-based model to capture user behavior characteristics in accessing CiteSeer. Moreover, we use temporal prediction techniques to further study the stability of user preferences. Research results of the project has been sent to prestigious conferences such as SIGIR and JCDL for review.

Workload Analysis and Performance-Driven System Design.

Facing the increasing challenges of scalability in data and user requests, it is important to investigate the patterns and characteristics of system workload. An in-depth understanding of workloads can benefit digital libraries in various ways. However, previous workload analysis for Web applications is typically focused on generic platforms, neglecting the unique characteristics exhibited in various domains of these applications. It is observed that different application domains have intrinsically heterogeneous characteristics, which have a direct impact on the system performance. In this study, we present an extensive analysis into the workload of scientific literature digital libraries, unveiling their temporal and user interest patterns. Logs of a computer science literature digital library, CiteSeer, are collected and analyzed. To eliminate the bias caused by CiteSeer's unique services, we intentionally remove service details specific to CiteSeer to make this work applicable to the research domain. Based on the characteristics we observed from the workload, we develop a synthetic workload generator that can be used to mimic the typical workload received by an academic search engine. Also, the analysis results are used by the system to improve its performance. We published our research results of workload analysis in selective conferences and journals including IJDL and JCDL.

Social Network Analysis and Its Application in Academic Searching and Ranking.

The increasingly available social information regarding Web users has posed both social challenges and research interests. By studying the social interaction between academic collaboration between scholars, we try to reveal their implicit communities and impacts. This project analyzes the co-authorship social networks of CiteSeer. We correlate the discovered topics in CiteSeer with the latent social networks and seek to rank authors by their impact on such topic evolutions. We published our research results in ICDM.

Awards

  • Student Travel Grant Award to JCDL'07, Vancouver, Canada, Jun 2007.
  • CiteSeer/CiteSeerX system co-administrator and developer, 2005-2007.
  • Paper "Learning Metadata from the Evidence in an On-line Citation Matching Scheme" was nominated for Vannevar Bush Best Paper Award in JCDL 2006, Sept 2006.
  • Consecutive three years of College of Engineering Fellowship Award, College of Engineering, the Pennsylvania State University, 2004-2006.
  • Research Assistant Award, College of Information Sciences and Technology, the Pennsylvania State University, 2005-2006.
  • Teaching Assistant Award, Department of Computer Science and Engineering, the Pennsylvania State University, 2004-2005.
  • Excellent Undergraduate Thesis Award, Jun 2001.
  • Ranked 2nd in the National College Entrance Exam, Shanghai, China, July 1997.

Academic Services

  • Peer reviewer for ICDE'08, ICML'07, SIRIR'07, JCDL'07, ICDE'07, ICDCS'07, WWW'07, SDM'07, CIKM'06, ICDE'06, SIGIR'06, JCDL'06, WWW'06, CIKM'05, ICDE'05, etc.
  • Session chair for P2PIM'06 (International Workshop on Peer-to-Peer Information Management), Hong Kong, 2006.
  • System developer and co-designer, CiteSeerX, the Pennsylvania State University, 2005-2007.

 

Return