Understanding evolution through big-data RNA-seq processing.

About the project

As gene family experts within the OneKP consortium, we were tasked with finding HRGP gene family members. Performing this is no easy task with a dataset the size of OneKP; comprising over 1200 distinct RNA-seq samples from over 1000 plant and algae species sampled throughout evolution. HRGPs are a diverse and heterogeneous family of glycosylated proteins that includes: arabinogalactan-proteins (AGPs), extensins (EXTs) and proline-rich proteins (PRPs), each of which are themselves a continuum of structures. The common thread that defines this diverse family is the O-glycosylation of hydroxyproline (Hyp, O), a phenomenon that is widespread in higher plants, but absent in animals. Plant HRGPs are rich in Pro (P), Ala (A), Ser (S) and Thr (T) and glycosylated through O residues.

Initial investigation of the proteins predicted by the OneKP consortium showed that few HRGP family members were detected, perhaps due to assembly and pipeline thresholds. To address this, we reassembled all available samples (over 1200) with Oases at four large k-mers: 39, 49, 59 and 69 to try to span tandem repeats that are common to several types of HRGPs (EXTs and PRPs).

The HRGP family comprises numerous sub-families both with recognised domains (chimeric HRGP) and without (non-chimeric HRGP). Our intention was to investigate non-chimeric HRGPs in the first instance, with follow up studies to investigate the chimeric sub-families, and AG-peptides.

Construction of the dataset

The OneKP consortium provides existing datasets of over 1200 species:

  • SOAPdenovo-assembled k25 contigs (designated the k25 dataset). This dataset does not employ scaffolding
  • SOAPdenovo-assembled k25 scaffolds (designated the k25s dataset). This dataset scaffolds the contigs where sufficient pairs exist to define the distance separating the contigs.

Unfortunately neither k25 contigs nor k25 scaffolds uses a large enough k to ensure tandem repeats present in HRGPs are correctly re-constructed with De-Bruijn graph assembly methods. To address this, we use larger k-mers and a multiple-k assembly methodology. For k, values of 39, 49, 59 and 69 were chosen, and all available samples were assembled at each k-mer. Subsequently, predicted proteins were then filtered for indicative HRGP compositional bias and then subject to a hand-crafted decision tree pipeline (motif and amino-acid bias (MAAB)) into one of twenty-four pre-defined MAAB classes.

Other 1KP sites

  1. List of samples
  2. OneKP wiki
  3. OneKP