August 12, 2015

Algorithm clarifies ‘big data’ clusters


 Rice University researchers used their progeny clustering technique to analyze a test data set with
41 characteristics drawn from the 440 cells that fell, roughly, into four shapes. Using those
characteristics, the program accurately split the samples into the proper clusters, matching their
shapes. The researchers expect their data analysis tool to help clinicians obtain meaningful patient
groupings prior to treatment. (Credit: Qutub Systems Biology Lab/Rice University)

Rice University bioengineers advance computing technique for health care and more

(August 12, 2015)  Rice University scientists have developed a big data technique that could have a significant impact on health care.

The Rice lab of bioengineer Amina Qutub designed an algorithm called “progeny clustering” that is being used in a hospital study to identify which treatments should be given to children with leukemia.

Details of the work appear today in Nature’s online journal Scientific Reports.

Clustering is important for its ability to reveal information in complex sets of data like medical records. The technique is used in bioinformatics — a topic of interest to Rice scientists who work closely with fellow Texas Medical Center institutions.

“Doctors who design clinical trials need to know how to group patients so they receive the most appropriate treatment,” Qutub said. “First, they need to estimate the optimal number of clusters in their data.” The more accurate the clusters, the more personalized the treatment can be, she said.

Rice University graduate student Wendy Hu is leading the development of a new technique
to help clinicians obtain meaningful patient groupings when designing trials for treatment
of disease. Progeny clustering could have a significant impact on health care and even
nonhealth-related “big data” problems. (Credit: Jeff Fitlow/Rice University)

Separating groups by a single data point, like eye color, would be easy, she said. But when separating people by the types of proteins in their bloodstreams, it becomes more difficult.

“That’s the kind of data that’s become prevalent everywhere in biology, and it’s good to have,” Qutub said. “We want to know hundreds of features about a single person. The problem is identifying how to use all that data.”

The Rice algorithm provides a way to assure the number of clusters is as accurate as possible, she said. The algorithm extracts characteristics about patients from a data set, mixing and matching them randomly to create artificial populations — the “progeny,” or descendants, of the parent data. The characteristics appear in roughly the same ratios in descendants as they do among the parents.

These characteristics, called dimensions, can be anything: as simple as hair color or place of birth, or as detailed as one’s blood cell count or the proteins expressed by tumor cells. For even a small population, each individual may have hundreds or even thousands of dimensions.

read entire press  release >>