PyBDA: a command line tool for automated analysis of big biological data sets.

Dirmeier S; Emmenlauer M; Dehio C; Beerenwinkel N

doi:10.1186/s12859-019-3087-8

Back

Journal article

PyBDA: a command line tool for automated analysis of big biological data sets.

Dirmeier S Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.
Emmenlauer M Biozentrum, University of Basel, Basel, Switzerland.
Dehio C Biozentrum, University of Basel, Basel, Switzerland.
Beerenwinkel N Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland. niko.beerenwinkel@bsse.ethz.ch.

2019-11-14

Published in:

BMC bioinformatics. - 2019

Image Processing, Computer-Assisted

English BACKGROUND
Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points.

RESULTS
We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells.

CONCLUSION
PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.

Language

English

Open access status

gold

Identifiers

DOI 10.1186/s12859-019-3087-8
PMID 31718539

Persistent URL

https://sonar.ch/global/documents/118245

Statistics

Document views: 25 File downloads:

Journal article

PyBDA: a command line tool for automated analysis of big biological data sets.

Big data

Command line

Computing cluster

Data analysis

Grid engine

Machine learning

Pipeline

Algorithms

Automation

Computational Biology

Computing Methodologies

HeLa Cells

Humans

Image Processing, Computer-Assisted

Machine Learning

Statistics