Google_Summer_of_Code_2009/Jianjiong

GSoC09 project: ID mapper API and plugin for Cytoscape

Student: JianJiong Gao
Mentor: David States


Overview

This project is to develop an ID mapping plugin and publish the corresponding ID mapping API for other plugin developers. Different sources of ID information will be supported including local/remote files, databases and web services.


Implementation Plan

ID mapper API

  • Basic ID mapper interface
    • Currently defined as follows in ANM plugin
      // IDMapper v1
      public interface IDMapper {
          /**
           * Supports one-to-one mapping and one-to-many mapping.
           * @param srcIDs source IDs
           * @param srcType source ID type
           * @param tgtType target ID type
           * @return a map from src ID to its target IDs
           */
          public Map<String, Set<String>> mapID(Set<String> srcIDs, String srcType, String tgtType);
      
          // Check whether an ID exists in a specific type.
              public boolean idExistsInSrcIDType(String srcID, String srcType);
      
          // returns supported source ID types
          public Set<String> getSupportedSrcIDType();
          
          // returns supported target ID types
          public Set<String> getSupportedTgtIDType();
      }
    • Alternatively, we can use class  Xref in  BridgeDb.
      // IDMapper v2
      public interface IDMapper {
          /** 
           * Supports one-to-one mapping and one-to-many mapping. 
           * @param srcXrefs source Xref, containing ID and ID type/data source
           * @param tgtDataSource target ID type/data source
           * @return a map from source Xref to target Xref's
           */
          public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXrefs, DataSource tgtDataSource);
      
          // Check whether an Xref exists.
          public boolean validateXref(Xref xref);
      
          // returns supported source ID types
          public Set<DataSource>  getSupportedSrcDataSources();
          
          // returns supported target ID types
          public Set<DataSource> getSupportedTgtDataSources();
      }
    • Discussion:
      • mapID in v2 supports for multiple source types (which is good because the user may not sure the exact ID types and may select several or the selected node attribute may actually contains multiple id types), whereas mapID in v1 only support single source type.
      • In mapID in v2, is it better to use a set of target data sources instead?
            public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXref, Set<DataSource> tgtDataSources);
      • srcXrefs are from user input and it is not necessary that all srcXrefs exist in the real ID system. Therefore we need the method validateXref to check whether an xref is real. For instance, if mapID returns no target ID for a source ID, it is possible that either there is no ID in the target type for that source ID or the source ID are acutually not exist in the source type. validateXref would be used to check which is the actual case.
      • Do we need two methods getSupportedSrcDataSource() and getSupportedTgtDataSource(), or one method getSupportedDataSource() would be enough? Would there be any cases that the source DataSources and target DataSources are different?
      • IsEqual() and HashCode() in class DataSource in BridgeDb may need to be overridden.
      • Therefore, another version of IDMapper interface could be
        // IDMapper v3
        public interface IDMapper {
            /** 
             * Supports one-to-one mapping and one-to-many mapping. 
             * @param srcXrefs source Xref, containing ID and ID type/data source
             * @param tgtDataSources target ID types/data sources
             * @return a map from source Xref to target Xref's
             */
            public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXrefs, Set<DataSource> tgtDataSources);
        
            // Check whether an Xref exists.
            public boolean validateXref(Xref xref);
        
            // returns supported ID types
            public Set<DataSource>  getSupportedDataSources();    
        }
      • Combined with freeSearch() and getCapacities() suggested by Martijn.
        // IDMapper v4
        public interface IDMapper {
            /** 
             * Supports one-to-one mapping and one-to-many mapping. 
             * @param srcXrefs source Xref, containing ID and ID type/data source
             * @param tgtDataSources target ID types/data sources
             * @return a map from source Xref to target Xref's
             */
            public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXrefs, Set<DataSource> tgtDataSources);
        
            // Check whether an Xref exists.
            public boolean xrefExists(Xref xref);
        
            // Free text search
            public Set<Xref> freeSearch (String text, int limit);
        
            // returns capacities of the ID mapper
            public IDMapperCapabilities getCapabilities();
        }
        
        public interface IDMapperCapabilities {
            // get all supported organisms
            // TODO: to modify--better to keep the bio concept Organism in a high level
            public Set<Organism> getSupportedOrganisms();
        
            public boolean isFreeSearchSupported();
        
            // returns supported source ID types
            public Set<DataSource>  getSupportedSrcDataSources();
        
            // returns supported target ID types
            public Set<DataSource> getSupportedTgtDataSources();
        }
  • The basic Mapper interface will be extended to add some more methods for each source of ID mappings. Supported sources are:
    • Mapping file
    • Relational database
    • Web service
      public interface IDMapperFile extends IDMapper {}
      public interface IDMapperRDB extends IDMapper {}
      public interface IDMapperWebService extends IDMapper {}
  • Implementations for each source of ID mappings
    public class IDMapperText implements IDMapperFile {
        // Delimited text file mapper implementation
    }
    public class IDMapperExcel implements IDMapperFile {
        // Excel file mapper implementation
    }
    public class IDMapperDerby implements IDMapperRDB {
        // Apache Derby specific mapper implementation
    }
    public class IDMapperMySQL implements IDMapperRDB {
        // MySQL specific mapper implementation
    }
    public class IDMapperUniprotWS implements IDMapperWebService {
        // Uniprot web service specific mapper implementation
    }
    .
    .
    .
  • Some supporting interfaces
    • IDMappingContainer to store the ID mappings
      public interface IDMappingContainer {
          // xrefs map to each other, ie., they represent a identical biological entity
          public void addIDMapping(Set<Xref> xrefs);
      
          // return data sources added
          public Set<DataSource> getDataSources();
          
          // return all xrefs mapped from xref, ie., all the returned xrefs represent the same biological entity as xref did
          public Set<Xref> getIDMapping(Xref xref);
          
          // return the identifiers (for a particular data source) mapped from xref
          public Set<String> getIDMapping(Xref xref, DataSource tgtType);
      
      }

Local file based ID mapping

  • Text delimited and Excel file based mapping has been implemented in ANM. The corresponding code has been extracted, refactored and submitted to BridgeDb.

RDB based ID mapping

Webservice based ID mapping

ID mapping plugin GUI

  • A  GUI mockup according to  BridgeDb Specification
  • electing which ID mapping source to use. Below are options from which the user can choose.
    • Provide custom mapping with a local file. (done)
    • Get ID mappings from a web service.
    • Get ID mappings from a relational database.
  • Selecting which ID type(s) are used in the networks from a list of ID types.
    • Should we limit the supported ID types to the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) to minimize the chance of error?
  • For each ID mapping operation, ID mappings can be retrieved from multiple sources and merged together for a final ID mapping list. This idea has been implemented in the current ID mapping UI in ANM. The figure below is the ID mapping UI: clicking 'Add ID mapping' button will add the retrieved ID mappings to the ID mapping list; ID mappings added multiple times from multiple sources (different files/databases/webservices) will be merged together; clicking 'OK' button will return all the ID mappings.
  • After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide.
    http://web.missouri.edu/~jg722/GSoC/idmapping1.PNG

Caching DB

  • Store ID mappings locally. Data from remote sources (databases and web services) will be cached here.
  • Use Derby embeded db.
  • Use the same db scheme as in  PathVisio.

ID type guessing

ID mapping validation

Port to Cy3


Discussion

  1. Is it possible to implement an identification system for Cytoscape instead of just an ID mapping service. By this identification system, the 'universal' identity of each node (or other entities) will be determined according to data sources selected by the user. A selected set of attributes will be used to determine the identities. Using ID type guessing, the user does not need to specify the ID type for each attribute (but they are required to confirm the guessed types). After the identities are determined, they can be served as the basis of other related services such as ID mapping, network merge (merging nodes with the same universal identity) and link out (automatically using different types of ids when linking out).
    • Note by Gary Bader: the confirmation by the user is very important, because you don't want to make mistakes that will then be propagated.

Project Management

Timeline

  • Stage1 (week 1-4)
    • Define the IDMapper.
    • Implement file-based ID mapper by extracting and refactoring the code in ANM.
    • RDB-based ID mapper has been defined as GDB in BridgeDb. Rename it as IDMapperRDB.
    • Implement webservice-based ID mapper.
  • Stage 2 (week 5-7)
    • Implement IDMapperTask in Cytoscape.
    • Implement ID mapping plugin for mapping node attributes to new attributes of the destination ID types. Note: After this stage, the basic ID mapping API and ID mapping plugin would be ready. An internal alpha version could be released.
  • Stage 3 (week 8-10)
    • Implement ID type guessing.
    • Implement Cache DB.
    • Improve the API and plugin according to feedbacks.
  • Stage4 (week 11-12)
    • Port ID mapping plugin to Cy3.
    • Port Network Merge plugin to Cy3.
  • Stage5 (week 13)
    • User manual and API documents.

Scheduled meetings

Status


Project Summary

The table below lists the related project goals. The most critical goals planned (with a few unplanned ones) have been complemented. There are 6 unfinished goals . Among them, goal 6 was not in the original plan and goad 12 and 14 were planned as optional.

#GoalPlannedAchieved
1IDMapper APIYY
2IDMapperTextYY
3IDMapperBiomartYY
4IDMapperPicrRestNY
5IDMapperSynergizerNY
6DataSource mappingNN
7Cy Thesaurus GUIYY
8Integrating file-, rdb- and webservice-based IDMappersYY
9Cy Thesaurus ID mapping services for other pluginsNY
10Type-guessing in Cy ThesaurusYN
11Reimplement ID mapping in network merge using Cy Thesaurus servicesYN
12Port to Cytoscape 3YN
13Cache DB in Cy ThesaurusYN
14ID mapping data validationYN

For the future work, I would like to work on the goal 6, 10 and 11 in the next stage; I would also try to find some time to work on goal 12  in the future; I would put goal 13 and 14 off the table for now. Please let me know if you have any comments on my plan. In any case, I will keep maintaining CyThesaurs and Network Merge plugins and help developing BridgeDb. Any feedback on the plugins would be highly welcome.


Comments

  • Add your comments here...