WP1: Definitions

Objective:
Definitions of standards and concepts to establish a common language between the disciplines involved in this project.

Description of work:
An agreement concerning the interconnections of genomic and GIS databases will be achieved. For this purpose, open standards such as the OpenGIS specification for geographic systems will be used and will contribute to their propagation among the scientific community. These first agreements will prepare the cooperation within WP2, WP3 and WP5.

Secondly, genomic/environmental concepts such as a gene pattern will be defined by the projects`s biologists in order to allow computer scientist to formally describe relationships between genes themselves and between genes and environmental parameters. This will constitute a crucial bridge between biology and computer sciences (for WP4).

WP2: Acquisition

Objectives:
Implementation of an automated method for the retrieval of sequence and habitat specific information from scientific publications, web pages and databases.

Development of automated methods for environmental Information Extraction from heterogeneous data sources.
Description of work:

This work package is the basis for further sequence and environmental data analysis performed by the tools prepared within the framework of other work packages. There are two main goals of the work package, both concerning data retrieval and extraction. The third, important goal is to develop some tools for automated data updating.

It is planned to develop and implement tools for automated sequence extraction from public data sources. This is obviously the first step in the process of data analysis and its results should be information on the habitat specific genes. The second main job is to develop algorithmic methods for automated environmental Information Extraction. Besides sequences such information is necessary for drawing some conclusions concerning habitat specific genes. The methods should allow for extracting the information from unstructured texts, i.e. from various web pages and the literature. Development of suitable algorithms is an open scientific question in the area of Data Mining and Artificial Intelligence to be addressed by METAFUNCTIONS. All these tools will be integrated within the database system developed in WP 3.

WP3: Analysis

Objectives:
Several metagenome sampling sites are lacking basic environmental parameters essential to detect and assign functions to habitat specific gene patterns. In order to complete this missing data, GRID-Geneva will collect and process a considerable number of GIS data: the so-called “Key” environmental data in the work package 3 (WP3).

Description of work:
"Key" environmental data layers constitute the stratum representing main substantial environmental variables mainly for aquatic ecosystems, comprising physical, chemical, geological, and biological parameters (e.g. ocean water temperature and salinity, concentration of pollutants nutrients, organic matter, etc.). They will constitute a base for data analysis in the "Metagenomes Mapserver" in case of missing or incomplete data at the individual site scale. These core GIS data sets will be drafted with different geographic extension (global to sampling site size). Time series layers can also be conceivable (annual and seasonal long term averages and “real time” data). All environmental data will be compatible with OCG standards, and implemented with exhaustive meta-information consistent with the ISO/TC 211-19115 standard.

WP4: Analysis

Objectives:


Development of a data mining tool for the detection of habitat-specific gene patterns. A gene pattern is called habitat-specific if there exist a clear dependency to specific environmental parameters like water-depth, temperature, salinity or other physical-chemical properties.


Description of work:


A gene pattern to be extracted from different genome sequences or metagenomes (WP 2) is a set of gene sequences expected to be functionally related and to have the following features:


For each gene sequence of the set there exist at least one corresponding gene sequence in another genome that is a putative ortholog,

all genes of a pattern are neighboured within a given distance,

most of these putative orthologs have the same order and orientation in different genomes, and

and the genes in the set show a clear dependency to specific environmental parameters.

To find these habitat-specific gene patterns novel software has to be developed, which is able to deal with the vast amount of data and the different kinds of features. The software should include efficient methods for sequence comparison for the detection of orthologous genes and data mining methods able to deal with spatial reasoning and uncertainty to infer gene patterns based on gene order and orientation. The application of this tool to completely sequenced and annotated genomes as well as metagenomes compiled and stored in WP 3 will produce a list of potential gene patterns. In the next step these gene patterns have to be linked to environmental features of the organisms which will also be compiled in WP 3. In order to describe all relevant features explicitly, an additional feature extraction step may be necessary. For representing background knowledge about the domain and including it into the data mining procedure, relational knowledge representation techniques and inductive logic programming algorithms are well suited and have to be adapted to this problem. Using this data mining tool a list of potential habitat specific gene patterns will be produced which will be interpreted and evaluated by domain experts.

WP5: Access

Objectives:
Refine and update the "environmental/genomic" database (from WP 3).

Provide access to information from both genomic and environmental side through a spatial (GIS) approach based on a map server application: the development of the "Metagenomes Mapserver".

Description of work:
The focal point of this package will be the development of the "Metagenomes Mapserver", the Internet-based GIS application to retrieve information from the "environmental/genomic" database. The Metagenomes Mapserver will be a platform independent application, developed in an open source environment, and based on a modular approach: modules are a response to the needs of our partners and will allow future developments, following the evolution of the technology.

Several specialized modules will be created and added during the phase of “Metagenomes Mapserver” development.

Base module: to allow basic functionalities as map visualization, zoom in, zoom out, panning, simple query, add and visualize a new layer, display legends, etc…
Complex queries module: it permits more sophisticated search functions such as showing maps and/or raw data where x, y, w, z… conditions are satisfied. The access to data from both environmental and metagenomic sides can be also implemented.
Graph module: to produce graphs for different selected variables; it can generate trends and perform analysis over time and space
Case study module: it is based on environmental sampling site data availability, it allows to change scale and works at sampling site level (eg. 10 km to 1m)
Download data module: it distributes chosen data sets and metadata in different formats (e. g. CSV, XML, PDF, Html, txt…).
Administrator module: it allows selected users to add and manage data from a web interface.
Optional modules: these may be developed in the future, for example a connection with external databases or new functionalities created from the Open GIS community during the “Metagenomes Mapserver” development.

WP6: Management

Objectives:
Ensuring tight integration of the different areas of expertise into the “Metagenomes Mapserver”, including links to the Networks of Excellence “Marine Genomics” and “MARBEF” mainly as test users.

Organizing management committee and other meetings (kick-off, mid-term, bilateral).

Developing a tailored consortium agreement as well as plans and strategies for:

– Implementing knowledge management,

– Intellectual property right,

– Exploitation and dissemination of METAFUNCTIONS's results.

Description of Work:
This work package will develop effective information exchange between the partners, with the Max Planck Institute Bremen (MPIMM) in the centre. Thus, together with WP 3, it constitutes the backbone of the project by centralizing and organizing the data collection for full integration of different areas of expertise into the “Metagenomes Mapserver”.

The MPIMM dedicates an experienced manager to lead this work package, in order to implement management of knowledge, of intellectual property, exploitation and dissemination plans for the results. The plan for exploitation and dissemination will be developed by this manager and implemented together by all partners. A consortium agreement will regulate IPR and knowledge management issues. It will be specifically developed for this NEST Adventure project which is using open access databases and GIS standards as a basis.

There will be a small management committee which meets every half year. Every partner will be represented by one senior scientist in the management committee and have one vote. The manager will prepare the agenda for the meeting, minute the results and decisions made and communicate them to all involved scientists. After the kick-off meeting with all involved staff which started defining user requirements and specification (see WP 1) a set of bilateral meetings will finish the detailed WP 1 definitions to efficiently develop the details of Work Packages 2, 4 and 5 in relation to WP 3 Storage and Evaluation. Thus all work packages will run in parallel from month 3 on. Yearly meeting of all partners and involved staff are planned, plus one mid-term review meeting.

As the MPIMM is a member of two Networks of Excellence, “Marine Biodiversity” and “Marine Genomics”, in-depth contacts to two related communities of marine biologists and molecular ecologists are available to easily access test users. Furthermore these networks will expand existing databases on marine environmental and biodiversity information which in return can be put to good use by METAFUNCTIONS. Finally, access to unpublished data is important for the full success of METAFUNCTIONS. For this a major source will be the emerging NoE knowledge, which, due to the trust and integration developing in these Networks of Excellence, can be used by the three overlapping communities for mutual benefit.