Google Summer of Code

Participation

  • 2012 : Access and visualization of biodiversity data in R
    Organization: R project for statistical computing
    Assigned mentors: Scott Chamberlain
    Abstract: rOpenSci is a collaborative effort to develop R-based tools for facilitating Open Science. Biodiversity occurrence records can be accessed using rOpenSci tools but there is a need to strengthen the functionality, develop visualizations and tutorials for users to learn and make use of the rOpenSci tools. I address this need and also propose a back-end database support to get around memory limits of R while using larger data sets using rOpenSci in this proposal.

  • 2013 : Biodiversity data visualization in R
    Organization: R project for statistical computing
    Assigned mentors: Javier Otegui, Virgilio Gómez-Rubio
    Abstract: R is increasingly being used in Biodiversity information analysis. There are several R packages like rgbif and rvertnet to query, download and to some extent analyze the data within R workflow. We also have packages like dismo and SDMTools for modeling the data. Proposed visualizations would be helpful to understand completeness of biodiversity inventory, extent of geographical, taxonomic and temporal coverage, gaps and biases in data. We propose to develop a package to fill in this gap.

  • 2014 : bdvis: Biodiversity data visualizations
    Organization: R project for statistical computing
    Assigned mentors: Virgilio Gomez-Rubio, Jorge Soberon
    Abstract: Package bdvis is already under development and was part of GSoC 2013. Right now the package has basic functionality to perform biodiversity data visualizations, but with growing user base for the package, requests for features are coming up. We propose to add the user requested functionality and implement some new visualization functions to take bdvis to next level. We also plan to prepare a detailed vignette and submit the package to CRAN.

  • 2015 : Biodiversity Data from Social Networking Sites
    Organization: R project for statistical computing
    Assigned mentors: Jorge Soberon, Javier Otegui, Robert Guralnick
    Abstract: Ecology, Biodiversity, Climate Change, invasive species research etc. use Primary Biodiversity occurrence data as basic unit. Even Global Biodiversity Information Facility, largest repository of this data, which serves ~530M records, has gaps and biases in terms of taxonomy, geography and seasonality. We propose to make a R package to wrap the APIs of SNS like Flicke and Picasa and make the data available to users in prevailing international standard formats like Darwin Core and Audubon Core.

  • Mentor

  • 2016: NicheToolbox: from getting biodiversity data to evaluating species distribution models in a friendly GUI environment.
    Student: luismurao
    Organization: R project for statistical computing
    Co mentors: Jorge Soberon, Narayani Barve
    Abstract: NicheToolBox project will be an R package with a friendly Graphical User Interface (GUI) developed using shiny framework that aims to facilitate the process of building niche models and estimate the species distributions. To do the above it will incorporate functions to curate species occurrence data (clean duplicated records) and build models to estimate species niches (Bioclim, MaxEnt, Ellipsoid model) and distributions. After building a model the user will have the chance to evaluate its performance using Partial Roc, Confusion matrices and the associated metrics to it. Finally in order to make the process of niche modeling transparent, the application will have an option to download a workflow (in html, pdf and .doc) with the code that reproduces all the analysis that the user has made inside the application; this workflow can be shared with users interested to learn how to make a niche model using the R language.

  • 2016: Visualization of powerful boundary detection tools
    Student: nasyrin
    Organization: R project for statistical computing
    Co mentors: Meng Li
    Abstract:This project will add significant functionality to the BayesBD package and increase its efficiency by optimizing code in C++.

  • 2017: Biodiversity Data Cleaning
    Student:Ashwin Agrawal
    Organization: R project for statistical computing
    Co mentors: Tomer Gueta, Yohay Carmel
    Abstract: Biodiversity data cleaning is an essential step in using biodiversity occurrence data for any meaningful analysis or model building. R environment already has several functions to address this, but still some crucial functionality is missing, in order to complete the whole workflow within R environment. This project is an attempt to fill in some of those gaps by taking the workflow to next level.

  • 2017: Integrating biodiversity data curation functionality
    Student:Thiloshon Nagarajah
    Organization: R project for statistical computing
    Co mentors: Tomer Gueta, Yohay Carmel
    Abstract: Biodiversity research is evolving rapidly, progressively changing into a more collaborative and data-intensive science. The integration and analysis of large amounts of data is inevitable, as researchers increasingly address questions at broader spatial, taxonomic and temporal scales than before. Until recently, biodiversity data was scattered in different formats in natural history collections, survey reports, and in literature. In the last fifteen years, lot of efforts are being made to establish standards in the biodiversity database structure (Darwin Core standard, DwC). However, none of the hundreds DwC fields are mandatory or impose strong rules on the content associated with any record; thus, data vary in precision and in quality. To-date, there are several centralized portals that aggregate large volumes of biodiversity records from around the world and publish them in a DwC format. These aggregators are prone to numerous data errors, due to incomplete or erroneous information at the publisher level, errors during the publishing processes (e.g. formatting of date information) as well as errors during the central harvesting and indexing procedures.

  • 2017: Parser for Biodiversity checklists
    Student: Qingyue Xue
    Organization: R project for statistical computing
    Co mentors: Thomas V., Rohit George, Narayani Barve
    Abstract: Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. Data for checklists usually occur within textual formats and significant manual effort is required to extract taxon names from within text into a tabular format. Textual data in sources such as research publications and websites, frequently also contain additional attributes like synonyms, common names, higher taxonomy and distribution. A facility to quickly extract textual data into tabular lists will facilitate easy aggregation of biodiversity data in a structured format that can be used for further processing and upload onto data aggregation initiatives and help in compiling biodiversity data.

  • 2018: bdclean: User friendly biodiversity data cleaning pipeline
    Student: Thiloshon Nagarajah
    Organization: R project for statistical computing
    Co mentors:Yohay Carmel, Tomer Gueta
    Abstract: Until recently, biodiversity data was scattered in different formats in natural history collections, survey reports, and in literature. In the last fifteen years, lot of efforts are being made to establish standards in the biodiversity database structure and to centralize the data for better accessibility. But the data gathered by such entities does not enforce strong data quality standards. These sources often tend to be prone to many flaws. Thus the data retrieved from centralized sources needs to go through a well formed quality-control process to be used in researches. Bdclean was created for that same purpose. So far we have been able to create numerous quality checks, work-flows, analyses and visualization functionalities in the taxonomical, spatial and temporal aspects. But all these remain as standalone components without much synchronization or connectivity. We propose to refine the overall data cleaning pipeline of bdclean and bring synergy to all the developed components as well as develop new important functionalities. At the end of this project, users will be able go through the quality control process in a very structured, intuitive and effective way.

  • 2018: Biodiversity Data Utilities
    Student: Ashwin Agrawal
    Organization: R project for statistical computing
    Co mentors: Yohay Carmel, Tomer Gueta
    Abstract:The aim of the project is to improve the current functionality of existing data management and cleaning packages for Biodiversity in R and integrate some new features which would facilitate easier biodiversity data analysis. The project revolves around building some key functionalities like tools for detecting outliers and building robust taxonomic workflows with the help of parallel computing in R.

  • 2018: Darwinazing biodiversity data in R
    Student: Povilas Gibas
    Organization: R project for statistical computing
    Co mentors: Yohay Carmel, Tomer Gueta
    Abstract:Darwin Core (DwC) is a standard maintained by the Darwin Core maintenance group. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information. The Darwinizer Kurator workflow standardizes field names to the DwC standard name. By generating this workflow in R, we will easily input a wider range of data from different publishers.

  • 2018: Species range maps in R
    Student: Marlon E. Cobos
    Organization: R project for statistical computing
    Co mentors: Narayani Barve, Alberto Jiménez Valverde
    Abstract:The species range maps project is motivated by the importance of information about species distribution for processes of conservation planning and the study of spatial patterns of biodiversity. In the face of multiple threats related to Global Change, protection and mitigation actions are crucial for maintaining the health of the planet, and knowing where species are located constitutes in primary information for starting these efforts. Currently, generation of species ranges maps may take several steps and the use of specialized software. Thanks to the recent development of specialized packages, R is rapidly becoming an excellent alternative for analyzing the spatial patterns of biodiversity. Taking advantage of these packages and the versatility of R, the aim of this project is offering handily and robust open source tools to obtain reliable proposals of species distribution ranges and to analyze their geographical patterns. A large community of students, researchers, and conservation managers will be benefited by this project since these tools will be freely available and will improve the way in which studies of species distributions are developed.

  • 2019: Enhancing Visualizations for Biodiversity Data
    Student: Rahul Chauhan
    Organization: R project for statistical computing
    Co mentors: Thiloshon Nagarajah, Tomer Gueta
    Abstract: We plan to incorporate into bdvis two state-of-the-art elements: interactive plotting and dashboards. We plan to develop and test an interface that enables graphics interactivity with ‘drilling down’ capabilities. Diagnostic visualization can unveil hidden patterns and anomalies in the data, and allow quick exploration of massive datasets. Developing novel interactive visualizations coupled with a modular dashboard system for biodiversity data, that can easily be employed by R experts and novices alike; will undoubtedly promote biodiversity research.

  • 2019: Implementing biodiversity data checks for the bdchecks package
    Student: Povilas Gibas
    Organization: R project for statistical computing
    Co mentors: Thiloshon Nagarajah, Tomer Gueta
    Abstract: bdchecks is an infrastructure for performing, filtering and managing various biodiversity data checks using R. Data checks are a key to promoting biodiversity data quality. bdchecks offers various features for different types of R users. An interactive and user-friendly Shiny app for inexperienced R users. Full command line functionality for more experienced R users. Advanced R users can easily edit, add and manage their own collection of data checks, using one single YAML file and only two supporting R functions. Improving the quality of biodiversity research, in some measure, is based on improving user-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data.

  • 2019: Grinnellian ecological niches and ellipsoids in R
    Student: Marlon E. Cobos
    Organization: R project for statistical computing
    Co mentors: Luis Osorio, Narayani Barve
    Abstract: Distributional ecology is a growing field of science dedicated to characterize species distributions based on their ecological niches. Based on early work from Joseph Grinnell and G. Evelyn Hutchinson most of tools in this field consider the environmental characteristics where species are found to model their niches via correlative approaches. Currently, this methods are used widely and their applications include disease risk mapping, climate change risk predictions, conservation biology, among others. Physiological data suggests that Grinnellian niches are convex in nature and they may probably have an ellipsoidal form when multiple dimensions are considered. However, among the available software in the field, algorithms to model ecological niches as ellipsoids in the environmental space are scarce. Several analyses, not currently available, can be performed assuming ellipsoidal niches, especially if recent literature is considered. This project aims to develop an R package of specialized tools to perform multiple analyses of ecological niches using ellipsoids. A broad community of researchers and students will find this open source tools useful in performing their analyses.