Machine Learning using Big Data: How Apache Spark Can Help

Cleverly designed software makes applications running in clusters more fault-tolerant

Machine learning is the process of automatically building models from data. In the past two decades, researchers in many fields of study have been generating these models from progressively more data. Because this has led to higher quality learned models, researchers are using even greater quantities of data that require more and more complex distributed computing systems.

Beyond Principal Components Analysis (PCA)

Using low rank models to understand big data

In many application areas, researchers seek to understand large collections of tabular data, for example, patient lab test results. The values in the table might be numerical (3.14), Boolean (yes, no), ordinal (never, sometimes, always), or categorical (A, B, O). As a practical matter, some entries in the table might also be missing.


Mutual Information: A Universal Measure of Statistical Dependence

And how mutual information is useful in Big Data settings

A  deluge of data is transforming science and industry. Many hope that this massive flux of information will reveal new vistas of insight and understanding, but extracting knowledge from Big Data requires appropriate statistical tools. Often, very little can be assumed about the types of patterns lurking in large data sets. In these cases it is important to use statistical methods that do not make strong assumptions about the relationships one hopes to identify and measure.


Getting Started with Cloud Services for Biomedical Computation

How to tap into this cost-effective and flexible solution

Biomedical researchers who work with large data sets may run out of both disk space and patience while waiting for a computation to finish. Though buying more hard drives and faster computers may seem tempting, the cloud is now a realistic option.


In 2008, when cloud computing was relatively new, this magazine published a column by Alain Laederach predicting that scientists would be won over to cloud computing, despite some people’s concerns about a loss in performance with the added layer of virtualization.


Prototype to Release: Software Engineering for Scientific Software

Set objectives and follow through

Having engineered several scientific software applications for public consumption, the authors know from experience that the process offers unique challenges. Typically, the algorithms being implemented are complex; the process involves numerous developers with various backgrounds and skill sets; and it all takes place in a fast-paced environment where new methods must be prototyped and tested regularly.


Syndicate content