GSoC09 project: ID mapper API and plugin for Cytoscape
Student: JianJiong Gao
Mentor: David States
Overview
This project is to develop an ID mapping plugin and publish the corresponding ID mapping API for other plugin developers. Different sources of ID information will be supported including local/remote files, databases and web services.
Implementation Plan
ID mapper API
- Basic ID mapper interface
- Currently defined as follows in ANM plugin
// IDMapper v1 public interface IDMapper { /** * Supports one-to-one mapping and one-to-many mapping. * @param srcIDs source IDs * @param srcType source ID type * @param tgtType target ID type * @return a map from src ID to its target IDs */ public Map<String, Set<String>> mapID(Set<String> srcIDs, String srcType, String tgtType); // Check whether an ID exists in a specific type. public boolean idExistsInSrcIDType(String srcID, String srcType); // returns supported source ID types public Set<String> getSupportedSrcIDType(); // returns supported target ID types public Set<String> getSupportedTgtIDType(); }
- Alternatively, we can use class Xref in BridgeDb.
// IDMapper v2 public interface IDMapper { /** * Supports one-to-one mapping and one-to-many mapping. * @param srcXrefs source Xref, containing ID and ID type/data source * @param tgtDataSource target ID type/data source * @return a map from source Xref to target Xref's */ public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXrefs, DataSource tgtDataSource); // Check whether an Xref exists. public boolean validateXref(Xref xref); // returns supported source ID types public Set<DataSource> getSupportedSrcDataSources(); // returns supported target ID types public Set<DataSource> getSupportedTgtDataSources(); }
- Discussion:
- mapID in v2 supports for multiple source types (which is good because the user may not sure the exact ID types and may select several or the selected node attribute may actually contains multiple id types), whereas mapID in v1 only support single source type.
- In mapID in v2, is it better to use a set of target data sources instead?
public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXref, Set<DataSource> tgtDataSources); - srcXrefs are from user input and it is not necessary that all srcXrefs exist in the real ID system. Therefore we need the method validateXref to check whether an xref is real. For instance, if mapID returns no target ID for a source ID, it is possible that either there is no ID in the target type for that source ID or the source ID are acutually not exist in the source type. validateXref would be used to check which is the actual case.
- Do we need two methods getSupportedSrcDataSource() and getSupportedTgtDataSource(), or one method getSupportedDataSource() would be enough? Would there be any cases that the source DataSources and target DataSources are different?
- IsEqual() and HashCode() in class DataSource in BridgeDb may need to be overridden.
- Therefore, another version of IDMapper interface could be
// IDMapper v3 public interface IDMapper { /** * Supports one-to-one mapping and one-to-many mapping. * @param srcXrefs source Xref, containing ID and ID type/data source * @param tgtDataSources target ID types/data sources * @return a map from source Xref to target Xref's */ public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXrefs, Set<DataSource> tgtDataSources); // Check whether an Xref exists. public boolean validateXref(Xref xref); // returns supported ID types public Set<DataSource> getSupportedDataSources(); }
- Combined with freeSearch() and getCapacities() suggested by Martijn.
// IDMapper v4 public interface IDMapper { /** * Supports one-to-one mapping and one-to-many mapping. * @param srcXrefs source Xref, containing ID and ID type/data source * @param tgtDataSources target ID types/data sources * @return a map from source Xref to target Xref's */ public Map<Xref, Set<Xref>> mapID(Set<Xref> srcXrefs, Set<DataSource> tgtDataSources); // Check whether an Xref exists. public boolean xrefExists(Xref xref); // Free text search public Set<Xref> freeSearch (String text, int limit); // returns capacities of the ID mapper public IDMapperCapabilities getCapabilities(); } public interface IDMapperCapabilities { // get all supported organisms // TODO: to modify--better to keep the bio concept Organism in a high level public Set<Organism> getSupportedOrganisms(); public boolean isFreeSearchSupported(); // returns supported source ID types public Set<DataSource> getSupportedSrcDataSources(); // returns supported target ID types public Set<DataSource> getSupportedTgtDataSources(); }
- Currently defined as follows in ANM plugin
- The basic Mapper interface will be extended to add some more methods for each source of ID mappings. Supported sources are:
- Mapping file
- Relational database
- Web service
public interface IDMapperFile extends IDMapper {} public interface IDMapperRDB extends IDMapper {} public interface IDMapperWebService extends IDMapper {}
- Implementations for each source of ID mappings
public class IDMapperText implements IDMapperFile { // Delimited text file mapper implementation } public class IDMapperExcel implements IDMapperFile { // Excel file mapper implementation } public class IDMapperDerby implements IDMapperRDB { // Apache Derby specific mapper implementation } public class IDMapperMySQL implements IDMapperRDB { // MySQL specific mapper implementation } public class IDMapperUniprotWS implements IDMapperWebService { // Uniprot web service specific mapper implementation } . . .
- Some supporting interfaces
- IDMappingContainer to store the ID mappings
public interface IDMappingContainer { // xrefs map to each other, ie., they represent a identical biological entity public void addIDMapping(Set<Xref> xrefs); // return data sources added public Set<DataSource> getDataSources(); // return all xrefs mapped from xref, ie., all the returned xrefs represent the same biological entity as xref did public Set<Xref> getIDMapping(Xref xref); // return the identifiers (for a particular data source) mapped from xref public Set<String> getIDMapping(Xref xref, DataSource tgtType); }
- IDMappingContainer to store the ID mappings
Local file based ID mapping
- Text delimited and Excel file based mapping has been implemented in ANM. The corresponding code has been extracted, refactored and submitted to BridgeDb.
RDB based ID mapping
Webservice based ID mapping
ID mapping plugin GUI
- A GUI mockup according to BridgeDb Specification
- electing which ID mapping source to use. Below are options from which the user can choose.
- Provide custom mapping with a local file. (done)
- Get ID mappings from a web service.
- Get ID mappings from a relational database.
- Selecting which ID type(s) are used in the networks from a list of ID types.
- Should we limit the supported ID types to the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) to minimize the chance of error?
- For each ID mapping operation, ID mappings can be retrieved from multiple sources and merged together for a final ID mapping list. This idea has been implemented in the current ID mapping UI in ANM. The figure below is the ID mapping UI: clicking 'Add ID mapping' button will add the retrieved ID mappings to the ID mapping list; ID mappings added multiple times from multiple sources (different files/databases/webservices) will be merged together; clicking 'OK' button will return all the ID mappings.
- After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide.
Caching DB
- Store ID mappings locally. Data from remote sources (databases and web services) will be cached here.
- Use Derby embeded db.
- Use the same db scheme as in PathVisio.
ID type guessing
ID mapping validation
Port to Cy3
Discussion
- Is it possible to implement an identification system for Cytoscape instead of just an ID mapping service. By this identification system, the 'universal' identity of each node (or other entities) will be determined according to data sources selected by the user. A selected set of attributes will be used to determine the identities. Using ID type guessing, the user does not need to specify the ID type for each attribute (but they are required to confirm the guessed types). After the identities are determined, they can be served as the basis of other related services such as ID mapping, network merge (merging nodes with the same universal identity) and link out (automatically using different types of ids when linking out).
- Note by Gary Bader: the confirmation by the user is very important, because you don't want to make mistakes that will then be propagated.
Project Management
Timeline
- Stage1 (week 1-4)
- Define the IDMapper.
- Implement file-based ID mapper by extracting and refactoring the code in ANM.
- RDB-based ID mapper has been defined as GDB in BridgeDb. Rename it as IDMapperRDB.
- Implement webservice-based ID mapper.
- Stage 2 (week 5-7)
- Implement IDMapperTask in Cytoscape.
- Implement ID mapping plugin for mapping node attributes to new attributes of the destination ID types. Note: After this stage, the basic ID mapping API and ID mapping plugin would be ready. An internal alpha version could be released.
- Stage 3 (week 8-10)
- Implement ID type guessing.
- Implement Cache DB.
- Improve the API and plugin according to feedbacks.
- Stage4 (week 11-12)
- Port ID mapping plugin to Cy3.
- Port Network Merge plugin to Cy3.
- Stage5 (week 13)
- User manual and API documents.
Scheduled meetings
Status
- 05/29/2009
- IDMapper interface
- IDMapperFile & IDMapperText: delimited-file-based IDMapper.
- 06/19/2009
- Cy Thesaurus UI framework complete (see: http://tinyurl.com/idmappingsvn/)
- 07/18/2009
- IDMapperRDB integrated in Cy Thesaurus
- IDMapperBiomart implemented in BridgeDb and integrated in Cy Thesaurus
- 08/05/2009
- Cy Thesaurus can provide ID mapping services to other Cytoscape plugins now.
Project Summary
The table below lists the related project goals. The most critical goals planned (with a few unplanned ones) have been complemented. There are 6 unfinished goals . Among them, goal 6 was not in the original plan and goad 12 and 14 were planned as optional.
| # | Goal | Planned | Achieved |
| 1 | IDMapper API | Y | Y |
| 2 | IDMapperText | Y | Y |
| 3 | IDMapperBiomart | Y | Y |
| 4 | IDMapperPicrRest | N | Y |
| 5 | IDMapperSynergizer | N | Y |
| 6 | DataSource mapping | N | N |
| 7 | Cy Thesaurus GUI | Y | Y |
| 8 | Integrating file-, rdb- and webservice-based IDMappers | Y | Y |
| 9 | Cy Thesaurus ID mapping services for other plugins | N | Y |
| 10 | Type-guessing in Cy Thesaurus | Y | N |
| 11 | Reimplement ID mapping in network merge using Cy Thesaurus services | Y | N |
| 12 | Port to Cytoscape 3 | Y | N |
| 13 | Cache DB in Cy Thesaurus | Y | N |
| 14 | ID mapping data validation | Y | N |
For the future work, I would like to work on the goal 6, 10 and 11 in the next stage; I would also try to find some time to work on goal 12 in the future; I would put goal 13 and 14 off the table for now. Please let me know if you have any comments on my plan. In any case, I will keep maintaining CyThesaurs and Network Merge plugins and help developing BridgeDb. Any feedback on the plugins would be highly welcome.
Comments
- Add your comments here...
