Google Summer of Code


  • 2012 : Access and visualization of biodiversity data in R
    Organization: R project for statistical computing
    Assigned mentors: Scott Chamberlain
    Abstract: rOpenSci is a collaborative effort to develop R-based tools for facilitating Open Science. Biodiversity occurrence records can be accessed using rOpenSci tools but there is a need to strengthen the functionality, develop visualizations and tutorials for users to learn and make use of the rOpenSci tools. I address this need and also propose a back-end database support to get around memory limits of R while using larger data sets using rOpenSci in this proposal.

  • 2013 : Biodiversity data visualization in R
    Organization: R project for statistical computing
    Assigned mentors: Javier Otegui, Virgilio Gómez-Rubio
    Abstract: R is increasingly being used in Biodiversity information analysis. There are several R packages like rgbif and rvertnet to query, download and to some extent analyze the data within R workflow. We also have packages like dismo and SDMTools for modeling the data. Proposed visualizations would be helpful to understand completeness of biodiversity inventory, extent of geographical, taxonomic and temporal coverage, gaps and biases in data. We propose to develop a package to fill in this gap.

  • 2014 : bdvis: Biodiversity data visualizations
    Organization: R project for statistical computing
    Assigned mentors: Virgilio Gomez-Rubio, Jorge Soberon
    Abstract: Package bdvis is already under development and was part of GSoC 2013. Right now the package has basic functionality to perform biodiversity data visualizations, but with growing user base for the package, requests for features are coming up. We propose to add the user requested functionality and implement some new visualization functions to take bdvis to next level. We also plan to prepare a detailed vignette and submit the package to CRAN.

  • 2015 : Biodiversity Data from Social Networking Sites
    Organization: R project for statistical computing
    Assigned mentors: Jorge Soberon, Javier Otegui, Robert Guralnick
    Abstract: Ecology, Biodiversity, Climate Change, invasive species research etc. use Primary Biodiversity occurrence data as basic unit. Even Global Biodiversity Information Facility, largest repository of this data, which serves ~530M records, has gaps and biases in terms of taxonomy, geography and seasonality. We propose to make a R package to wrap the APIs of SNS like Flicke and Picasa and make the data available to users in prevailing international standard formats like Darwin Core and Audubon Core.

  • Mentor

  • 2016: NicheToolbox: from getting biodiversity data to evaluating species distribution models in a friendly GUI environment.
    Student: luismurao
    Organization: R project for statistical computing
    Co mentors: Jorge Soberon, Narayani Barve
    Abstract: NicheToolBox project will be an R package with a friendly Graphical User Interface (GUI) developed using shiny framework that aims to facilitate the process of building niche models and estimate the species distributions. To do the above it will incorporate functions to curate species occurrence data (clean duplicated records) and build models to estimate species niches (Bioclim, MaxEnt, Ellipsoid model) and distributions. After building a model the user will have the chance to evaluate its performance using Partial Roc, Confusion matrices and the associated metrics to it. Finally in order to make the process of niche modeling transparent, the application will have an option to download a workflow (in html, pdf and .doc) with the code that reproduces all the analysis that the user has made inside the application; this workflow can be shared with users interested to learn how to make a niche model using the R language.

  • 2016: Visualization of powerful boundary detection tools
    Student: nasyrin
    Organization: R project for statistical computing
    Co mentors: Meng Li
    Abstract:This project will add significant functionality to the BayesBD package and increase its efficiency by optimizing code in C++.

  • 2017: Biodiversity Data Cleaning
    Student:Ashwin Agrawal
    Organization: R project for statistical computing
    Co mentors: Tomer Gueta, Yohay Carmel
    Abstract: Biodiversity data cleaning is an essential step in using biodiversity occurrence data for any meaningful analysis or model building. R environment already has several functions to address this, but still some crucial functionality is missing, in order to complete the whole workflow within R environment. This project is an attempt to fill in some of those gaps by taking the workflow to next level.

  • 2017: Integrating biodiversity data curation functionality
    Student:Thiloshon Nagarajah
    Organization: R project for statistical computing
    Co mentors: Tomer Gueta, Yohay Carmel
    Abstract: Biodiversity research is evolving rapidly, progressively changing into a more collaborative and data-intensive science. The integration and analysis of large amounts of data is inevitable, as researchers increasingly address questions at broader spatial, taxonomic and temporal scales than before. Until recently, biodiversity data was scattered in different formats in natural history collections, survey reports, and in literature. In the last fifteen years, lot of efforts are being made to establish standards in the biodiversity database structure (Darwin Core standard, DwC). However, none of the hundreds DwC fields are mandatory or impose strong rules on the content associated with any record; thus, data vary in precision and in quality. To-date, there are several centralized portals that aggregate large volumes of biodiversity records from around the world and publish them in a DwC format. These aggregators are prone to numerous data errors, due to incomplete or erroneous information at the publisher level, errors during the publishing processes (e.g. formatting of date information) as well as errors during the central harvesting and indexing procedures.

  • 2017: Parser for Biodiversity checklists
    Student: Qingyue Xue
    Organization: R project for statistical computing
    Co mentors:Thomas V., Rohit George, Narayani Barve
    Abstract: Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. Data for checklists usually occur within textual formats and significant manual effort is required to extract taxon names from within text into a tabular format. Textual data in sources such as research publications and websites, frequently also contain additional attributes like synonyms, common names, higher taxonomy and distribution. A facility to quickly extract textual data into tabular lists will facilitate easy aggregation of biodiversity data in a structured format that can be used for further processing and upload onto data aggregation initiatives and help in compiling biodiversity data.