Parsing human-readable InParanoid output
While my love for InParanoid is somewhat lacking, the application nevertheless remains for the moment the most convenient method for determining orthologous groups between two protein sets. InParanoid provides its output in four formats:
Human-readable text (e.g.,
An HTML file serving as a slightly stylized representation of the previous (e.g.,
A tab-separated list listing each orthologous group on its own line (e.g.,
A tab-separated list listing a given orthologous group across multiple lines (e.g.,
The two tab-separated lists provide the same information, differing only in whether a given orthologous group spans one or multiple lines. While these tab-separated lists are most easily parsed by machine, regrettably, they regrettably provide only a paucity of data.
To enjoy the full range of data InParanoid provides (such as bootstrap support values), or even to merely delineate clearly between which orthologues came from the first input set and which came from the second, one must instead turn to the human-readable output. Parsing this, unfortunately, is not terribly pleasant. For this task, I wrote a state-machine-based parser for InParanoid’s human-readable output. To cope with an output format not intended for machine consumption, this script uses a simple state machine, transitioning from state to state when it comes across various headers, separators, and other sorts of records. In this regard, it is similar to the Ruby script I wrote eight years ago for recording Shoutcast MP3 streams.
You can play with my InParanoid parser by cloning the corresponding Gist, then running this command:
Doing so will produce the following output:
Output.PRJEB506.munged.fa-PRJNA205202.munged.fa: 1_to_1=4494 1_to_n=387 n_to_1=1951 n_to_n=176
To understand the output, you must first know that InParanoid was run on two protein sets, which we will refer to as A and B. It determined orthologous relationships between the two sets—i.e., if proteins A.γ and A.δ are placed in an orthologous group with protein B.λ, this indicates that all three are derived from a gene that was present in A and B’s last commmon ancestor, and that A.γ and A.δ are inparalogues (meaning that one was duplicated from the other after A and B diverged).
With the above established, we can state the following:
1_to_1refers to orthologous groups composed of a single protein from each A and B.
1_to_nrefers to orthologous groups composed of one protein from A and multiple from B.
n_to_1refers to orthologous groups composed of multiple proteins from A and one from B.
n_to_nrefers to orthologous groups composed of multiple proteins from each A and B.
This InParanoid parser can be extended without undue difficulty to recover other information from the InParanoid results, such as data provided in the header regarding the number of orthologous groups. For the moment, however, I am happy simply to recover what data I have. Hooray!