parsing the xml file using pandas

parsing the xml file using pandas - python

So, I'm trying to extract information from this xml --
<bdb:getTargetByCompoundResponse xmlns:bdb="http://ws.bindingdb.org/xsd">
<bdb:smile>C[C#H]1[C#H](C)CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C#](C)(C(=O)O)[C##H]5CC[C#]43C)[C#H]12</bdb:smile>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1 AuxInfo=1/1/N:4,1,8,19,12,33,25,5,30,21,6,20,31,9,10,14,3,2,13,15,29,22,34,17,26,7,18,11,32,24,16,23,27,28/E:(33,34)/it:im/rA:34CC.oC.eCCCC.oCCCC.oCCCCOC.eC.eCCCC.oOC.oCCOOC.eCCC.eCC.e/rB:s1;s2;s3;s3;s5;s6;s7;s7;s9;s10;s11;s11;d-13;s14;d15;s15;s17;s18;s18;s20;s21;s22;s22;s24;s24;d26;s26;s18s24;s29;s30;s11s17s31;s32;s2s7s13;/rC:2.8737,-5.8026,0;2.1037,-7.1363,0;.5637,-7.1363,0;-.2063,-5.8026,0;-.2063,-8.47,0;.5637,-9.8037,0;2.1037,-9.8037,0;1.3337,-11.1374,0;2.8737,-11.1374,0;4.4137,-11.1374,0;5.1837,-9.8037,0;5.9537,-11.1374,0;4.4137,-8.47,0;5.1837,-7.1363,0;6.7237,-7.1363,0;7.4937,-5.8026,0;7.4937,-8.47,0;9.0337,-8.47,0;8.2637,-7.1363,0;9.8037,-7.1363,0;11.3437,-7.1363,0;12.1137,-8.47,0;13.6537,-8.47,0;11.3437,-9.8037,0;11.0763,-11.3203,0;12.7908,-10.3304,0;13.9705,-9.3405,0;13.0582,-11.847,0;9.8037,-9.8037,0;9.0337,-11.1374,0;7.4937,-11.1374,0;6.7237,-9.8037,0;6.8847,-11.3352,0;2.8737,-8.47,0;</bdb:inchi>
<bdb:hit>7</bdb:hit>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Polyunsaturated fatty acid 5-lipoxygenase</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>3000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prolyl endopeptidase</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>36320</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prostaglandin E synthase</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>3000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prostaglandin G/H synthase 1</bdb:target>
<bdb:species>Ovis aries (Sheep)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>>40000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prostaglandin G/H synthase 2</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>>40000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Tyrosine-protein phosphatase non-receptor type 1</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>8040</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Tyrosine-protein phosphatase non-receptor type 2</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>9450</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
</bdb:getTargetByCompoundResponse>
But I'm getting the following error-
xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.
I tried this code
smile = 'C[C#H]1[C#H](C)CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C#](C)(C(=O)O)[C##H]5CC[C#]43C)[C#H]12'
api_binding = requests.get(f'https://bindingdb.org/axis2/services/BDBService/getTargetByCompound?smiles={smile}&cutoff=1')
df = pd.read_xml(api_binding.text, xpath = ".//bdb", namespaces = {"bdb":"https://ws.bindingdb.org/xsd"})
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]

you can build the data frame for all the attributes, and then can the result. for example:
df = pd.read_xml(str_xml, xpath = ".//bdb:*", namespaces = {"bdb":"http://ws.bindingdb.org/xsd"})
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]
result
result would be:
3 Polyunsaturated fatty acid 5-lipoxygenase
7 None
13 Prolyl endopeptidase
17 None

If you remove xpath, you'll read it in almost as expected. We know from the output that there should be 7 rows (i.e. <bdb:hit> shows 7). The first three tags seem to contain only some metadata about what it returned:
<bdb:smile>
<bdb:inchi>
<bdb:hit>
These make up rows 0-2 of the dataframe.
Your actual data starts on row 3, or wherever we see the <bdb:affinities> tag. Based on this pattern, we can keep only the data where target is not missing. You can choose any other column where you know there will always be data, but in this case, we're going to stick with target.
df = pd.read_xml(api_binding.text, namespaces = {"bdb":"https://ws.bindingdb.org/xsd"}).dropna(subset='target')
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]
Output:
3 Polyunsaturated fatty acid 5-lipoxygenase
4 Prolyl endopeptidase
5 Prostaglandin E synthase
7 Prostaglandin G/H synthase 2
8 Tyrosine-protein phosphatase non-receptor type 1
9 Tyrosine-protein phosphatase non-receptor type 2

There are a couple things going on here that I think the other answers missed, so I will try to give a complete overview and explanation.
In your sample of the the code you've tried, you are querying xpath=".//bdb", which will look for any element beneath the root node with a tag of "bdb". There are no such elements (whose tag is just bdb), so you get ValueError: xpath does not return any nodes. simpleApp's answer correctly identifies this issue, and suggests adding an asterisk (i.e. xpath=".//bdb:*") to tell xpath to look for any elements whose tag starts with bdb:. However, this returns all of the various nested elements in the XML, since in your case everything is prefixed with bdb:, so we lose the structural information contained in the hierarchy of the XML content, leading to a mess of a DataFrame.
Stu Sztukowski's answer suggests dropping the xpath kwarg completely, which will make pandas more or less do what you want, except then you get unwanted rows resulting from the <bdb:smile/>, <bdb:inchi/>, and <bdb:hit/> elements at the top that you have to clean up yourself after you create the DataFrame.
Also, you need to use http in your namespaces dictionary, not https, since the namespace mapping for bdb (shown in first line/root of your XML content) is http://ws.bindingdb.org/xsd, with no s.
A clean solution that will get you exactly the DataFrame that you want is
df = pd.read_xml(api_binding.text, xpath = "./bdb:affinities",
namespaces = {"bdb": "http://ws.bindingdb.org/xsd"})
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]
You can use either xpath = "./bdb:affinities" or xpath = ".//bdb:affinities" (single or double slash) since in this case all of the bdb:affinities elements are right below the root anyway, so the two syntaxes do the same thing.
This solution yields a df that looks like:
monomerid inhibitor target species affinity_type affinity smiles inchi tanimoto
0 50241261 BDBM50241261 Polyunsaturated fatty acid 5-lipoxygenase Homo sapiens (Human) IC50 3000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
1 50241261 BDBM50241261 Prolyl endopeptidase Homo sapiens (Human) IC50 36320 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
2 50241261 BDBM50241261 Prostaglandin E synthase Homo sapiens (Human) IC50 3000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
3 50241261 BDBM50241261 Prostaglandin G/H synthase 1 Ovis aries (Sheep) IC50 >40000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
4 50241261 BDBM50241261 Prostaglandin G/H synthase 2 Homo sapiens (Human) IC50 >40000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
5 50241261 BDBM50241261 Tyrosine-protein phosphatase non-receptor type 1 Homo sapiens (Human) IC50 8040 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
6 50241261 BDBM50241261 Tyrosine-protein phosphatase non-receptor type 2 Homo sapiens (Human) IC50 9450 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
And a result that looks like:
0 Polyunsaturated fatty acid 5-lipoxygenase
1 Prolyl endopeptidase
2 Prostaglandin E synthase
4 Prostaglandin G/H synthase 2
5 Tyrosine-protein phosphatase non-receptor type 1
6 Tyrosine-protein phosphatase non-receptor type 2
Name: target, dtype: object

Related

Retrieve many nested statements at the same time for a single object with lxml Python

I am working with big xml where I am retrieving many different properties, and now I am trying to retrieve comment category property and connect it to the text between the tags. However, there are 3 different situations that I need to handle. XML example:
<comment-list>
<comment category="Derived from sampling site"> Peripheral blood </comment>
<comment category="Transformant">
<cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
</comment>
<comment category="Sequence variation"> Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
</comment>
<comment category="Monoclonal antibody target">
<xref-list>
<xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
<property-list>
<property name="gene/protein designation" value="Human BEND3"/>
</property-list>
<url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
</xref>
</xref-list>
</comment>
</comment-list>
When <comment> does not have child tags under. Then I need to retrieve comment category property
and connect it with the text between the tags.
When <comment> has a <cv-term> tag nested underneath. Then I need to retrieve comment category,
cv-term terminology, cv-term accession and the text between the cv-term tags.
When <comment> has several tags nested underneath: <xref-list>-<xref>-<property-list>-
<property>-<url>. In this case I need to retrieve: comment category,
xref database property, xref accession property, and property value property.
I am using lxml to parse this XML, and I am struggling to wrap my head around how to solve case 2. Case 1 and 3 work but when an object has all three cases then the output gets messed up.
I would like to receive following output:
Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
Here is my very messy code which outpus the elements in wrong order. It worked fine for case 1 and 3, but when case 2 comes into play then the output is ordered wrong:
comment_cat = att.xpath('.//comment-list/comment/#category')
comment_text = att.xpath('.//comment-list/comment/text()')
cv_term = att.xpath('.//comment-list/comment/cv-term/text()')
xref = [a + ', ' + b for a,b in zip(att.xpath('.//comment-list/comment/xref-
list/xref/#database'),att.xpath('.//comment-list/comment/xref-list/xref/#accession'))]
property_list = att.xpath('.//comment-list/comment/xref-list/xref/property-list/property/#value')
xref_property_list = [a + ', ' + b for a,b in zip(xref, property_list)]
empty_str_in_text = ['\n ', '\n ', '\n ', '\n ']
comment_texts_all = cv_term+comment_text+xref_property_list
for e in empty_str_in_text:
if e in comment_texts_all:
comment_texts_all.remove(e)
key_values['Comments'] = ';; '.join([i + ': ' + j for i, j in zip(comment_cat,
comment_texts_all)])
Output:
Derived from sampling site: Epstein-Barr virus (EBV);;
Transformant: Peripheral blood ;;
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194) ;;
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3

Here is a slightly alternative approach:
xml = '''<comment-list>
<comment category="Derived from sampling site"> Peripheral blood </comment>
<comment category="Transformant">
<cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
</comment>
<comment category="Sequence variation"> Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)</comment>
<comment category="Monoclonal antibody target">
<xref-list>
<xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
<property-list>
<property name="gene/protein designation" value="Human BEND3"/>
</property-list>
<url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
</xref>
</xref-list>
</comment>
<comment category="Knockout cell">
<method>KO mouse</method>
<xref-list>
<xref database="MGI" category="Organism-specific " accession="MGI:97740">
<property-list>
<property name="gene/protein designation" value="Polb"/>
</property-list>
<url><![CDATA[http://www.informatics.jax.org//MGI:97740]]></url>
</xref>
</xref-list>
</comment>
</comment-list>'''
from lxml import etree as ET
tree = ET.fromstring(xml)
result = ''
for comment in tree.iter('comment'):
result += f"{comment.get('category')}: "
cv_term = comment.find('cv-term')
xref_list = comment.find('xref-list')
method = comment.find('method')
if len(list(comment)) == 0:
result += comment.text
elif cv_term is not None:
result += ', '.join([cv_term.get('terminology'), cv_term.get('accession'), cv_term.text])
elif xref_list is not None and method is None:
result += ', '.join([xref_list.xpath('./xref/#database')[0], xref_list.xpath('./xref/#accession')[0], xref_list.xpath('./xref/property-list/property/#value')[0]])
elif method is not None:
result += method.text
result += '\n'
print(result)
Output:
Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
Knockout cell: KO mouse

Remove duplicated sequences according to their ID names (bash or python)

I'm looking for a python or bash script in order to remove duplicated sequence by their ID.
Here is the input file:
>AJZ73152.1 hypothetical protein [Venturia canescens]
MAGLSTDKTDKTTVLLQYEVSHENYLARCIPGTRLHAKIHGSLPVLASSILTHNLDVKRADVFYLSGSSD
GSYYCDLPVSPQASKRVGDQTRETLRAQC
>AJZ73158.1 hypothetical protein [Venturia canescens]
MAGLSTDKTDKTTVLLQYEVSHENYLARCIPGTRLHAKIHGSLPVLASSILTHNLDVKRADVFYLSGSSD
GSYYCDLPVSPQASKRVGDQTRETLRAQC
>AKH40348.1 putative gp75-like protein [Kallithea virus]
MDQTDLLYTPQFEDYILEFCRAVSTDTTITAISPIIEVLKQSEYLRYLMKDPSNDSAKTCVRNFIVSKSH
LPQDFLYKFLAIVTMKISLAPSNVGFIHQSYNAKVIANNLQPTSRITNLTIAARQDQLRAESKNAITYVK
QTRMPPQILRMKFNDDLLPRCINAIGDLNQVIIEGNRSNGRDVGDFVRTVLK
>AKH40367.1 putative gp75-like protein [Kallithea virus]
MDQTDLLYTPQFEDYILEFCRAVSTDTTITAISPIIEVLKQSEYLRYLMKDPSNDSAKTCVRNFIVSKSH
LPQDFLYKFLAIVTMKISLAPSNVGFIHQSYNAKVIANNLQPTSRITNLTIAARQDQLRAESKNAITYVK
QTRMPPQILRMKFNDDLLPRCINAIGDLNQVIIEGNRSNGRDVGDFVRTVLK
>AZH40350.1 putative gp75-like protein [Kallithea virus]
MDQTDLLYTPQFEDYILEFCRAVSTDTTITAISPIIEVLKQSEYLRYLMKDPSNDSAKTCVRNFIVSKSH
LPQDFLYKFLAIVTMKISLAPSNVGFIHQSYNAKVIANNLQPTSRITNLTIAARQDQLRAESKNAITYVK
QTRMPPQILRMKFNDDLLPRCINAIGDLNQVIIEGNRSNGRDVGDFVRTVLK
>AKH40359.1 putative lef-4 [Kallithea virus]
MEEDNDNIPSTSLKVLNLLNIQVDTQQQQTVISNVANEHVANEYVSSEHVAIANEHVDNVNQSTTNAEFV
QKMPQTEVSMPTPTNPIYDEWESTIAIPITEEQYNIYKQKSHKSDVIFLFKNGTRLSCRTMQKKTTTYCR
NLISFYRNHWYPIRRTTAVESIEQLPPLYACDKVIFRLVVYHQNNIRISYNMEECAQGVKYNVEYEIEYK
RGISYREILIYERRLIRTVLQDNYEIKRQILSLVDLFSYVMTKVQMWHCFDPNKDYIWAYKWNGIKAKFL
ITDKLSDNGSNLTYIWPDANNITIEECHGNNISALVNFCFLVEIMDDCIVLIEAIGASIDQDIYTTEPAT
NSYVLKYLKDQNTSLKVGNKPVIIQEYYPPPLPNSYNREKFDGMIIVQDDMIIKWKIPTIDVKCIAPFKY
KIADDVLDFDFEGIPGKIYEISYKNEILRQRNDRIVASSPQEYAIFLESAKHLQ
>AKH40361.1 putative gp93-like protein [Kallithea virus]
MDLTLEHVTSWSCHLHSKETCVMKYYNGSYYHVIPVKNISVLAQTYNSQKIPDEFWEDLGPTPYMTAIYY
SDCVANVDMFRIILELFRNLDDSFLKFSTSNTPSDFIKRHIITDGIKRITLCNKHLLKSCKTKSNRPQTF
YTKDQWIKAILKGLFPKIDSSDKSGIPTNTPDWAIKLYPRGATSISAVTTPSSQMTHLAN
>AKH40366.1 putative gp93-like protein [Kallithea virus]
MDLTLEHVTSWSCHLHSKETCVMKYYNGSYYHVIPVKNISVLAQTYNSQKIPDEFWEDLGPTPYMTAIYY
SDCVANVDMFRIILELFRNLDDSFLKFSTSNTPSDFIKRHIITDGIKRITLCNKHLLKSCKTKSNRPQTF
YTKDQWIKAILKGLFPKIDSSDKSGIPTNTPDWAIKLYPRGATSISAVTTPSSQMTHLAN
>AKH40367.1 putative gp19-like protein [Kallithea virus]
MGVIRMTWNILSILITVIFVIALIWFVLYPTPIKYVLQCFVPKTEYEPNANYTTVKNYILYTNAKSNHTK
LIVIIPGGAGLLNSIANIYGFMNKLNETLGDDYDILTFSYPVRFKHTIRDSMLRVNEVLSDFTHYEEIHG
IGLSFGSLLLGAFNNKESNILSSQQMQVPQIGIKFKTFTGICGMYQPFFNVKLLTWLFDFYIMRGTPGIK
LYSCYGMPIPKLIITSNSDFLVSQSTKFLQSENAESLSYPTANLPHTFPQYINLPEAQQSIVKIVDFIKQ
NSN
>AKH40369.1 putative gp83-like protein [Kallithea virus]
MSESKLQHLHPEIINYYKSIKANGLKSPKMENNEEFITTLDRVEDDFKIPFISTYVLINNAYRHELSSNR
AKSIKQNIHAIREAKDVKIRTEVTAKVNKFEFIPSHFYTCSSKAIKVAVALFLRPAYTETLKRDFIFSLL
NHHSKTHTVSDVIDLCQKTIGDVRAFIKTVGNLNTTEKQRKQLICGLIECSELLRDRLCSKLAISVSLNG
YISLISLYLKHGHLKNVIPFEPLINLYVKESIAKCTQEEERVKILNQFKVDPVATIDDVIKGLPPAPNKV
SNSSTKSCVFKPDQNYQYYKGAPNYTRDIITTYHIEHGRRYRIQTYNDCLYDVLGYTLEAPNFLEATHSP
TTNGISAIEHEIYDRMSWSDRLNLIRFRTKIRIEDAKGSELNDYHGNSTDITISWFDDNEISCSKTISLK
KSDNKK
So the basic Idea is that I want to remove a sequence if its second name is duplicated.
For instance :
>AJZ73152.1 hypothetical protein [Venturia canescens]
MAGLSTDKTDKTTVLLQYEVSHENYLARCIPGTRLHAKIHGSLPVLASSILTHNLDVKRADVFYLSGSSD
GSYYCDLPVSPQASKRVGDQTRETLRAQC
First name= >AJZ73152.1
Second name= hypothetical protein [Venturia canescens]
So for the above example:
>AJZ73152.1 and >AJZ73158.1 have both the same second name "hypothetical protein [Venturia canescens]"
So I keep only one of them
>AKH40348.1, >AKH40367.1 and >AZH40350.1 have all the same second name "gp75-like protein [Kallithea virus]"
So I keep only one of them
>AKH40359.1, >AKH40367.1 and >AKH40369.1 have all no duplicate second name, I keel them all.
>AKH40366.1 and AKH40361.1 have both the same second name "putative gp93-like protein [Kallithea virus]"
So I keep only one of them
Here is what I should get as output:
>AJZ73152.1 hypothetical protein [Venturia canescens]
MAGLSTDKTDKTTVLLQYEVSHENYLARCIPGTRLHAKIHGSLPVLASSILTHNLDVKRADVFYLSGSSD
GSYYCDLPVSPQASKRVGDQTRETLRAQC
>AKH40348.1 putative gp75-like protein [Kallithea virus]
MDQTDLLYTPQFEDYILEFCRAVSTDTTITAISPIIEVLKQSEYLRYLMKDPSNDSAKTCVRNFIVSKSH
LPQDFLYKFLAIVTMKISLAPSNVGFIHQSYNAKVIANNLQPTSRITNLTIAARQDQLRAESKNAITYVK
QTRMPPQILRMKFNDDLLPRCINAIGDLNQVIIEGNRSNGRDVGDFVRTVLK
>AKH40359.1 putative lef-4 [Kallithea virus]
MEEDNDNIPSTSLKVLNLLNIQVDTQQQQTVISNVANEHVANEYVSSEHVAIANEHVDNVNQSTTNAEFV
QKMPQTEVSMPTPTNPIYDEWESTIAIPITEEQYNIYKQKSHKSDVIFLFKNGTRLSCRTMQKKTTTYCR
NLISFYRNHWYPIRRTTAVESIEQLPPLYACDKVIFRLVVYHQNNIRISYNMEECAQGVKYNVEYEIEYK
RGISYREILIYERRLIRTVLQDNYEIKRQILSLVDLFSYVMTKVQMWHCFDPNKDYIWAYKWNGIKAKFL
ITDKLSDNGSNLTYIWPDANNITIEECHGNNISALVNFCFLVEIMDDCIVLIEAIGASIDQDIYTTEPAT
NSYVLKYLKDQNTSLKVGNKPVIIQEYYPPPLPNSYNREKFDGMIIVQDDMIIKWKIPTIDVKCIAPFKY
KIADDVLDFDFEGIPGKIYEISYKNEILRQRNDRIVASSPQEYAIFLESAKHLQ
>AKH40361.1 putative gp93-like protein [Kallithea virus]
MDLTLEHVTSWSCHLHSKETCVMKYYNGSYYHVIPVKNISVLAQTYNSQKIPDEFWEDLGPTPYMTAIYY
SDCVANVDMFRIILELFRNLDDSFLKFSTSNTPSDFIKRHIITDGIKRITLCNKHLLKSCKTKSNRPQTF
YTKDQWIKAILKGLFPKIDSSDKSGIPTNTPDWAIKLYPRGATSISAVTTPSSQMTHLAN
>AKH40367.1 putative gp19-like protein [Kallithea virus]
MGVIRMTWNILSILITVIFVIALIWFVLYPTPIKYVLQCFVPKTEYEPNANYTTVKNYILYTNAKSNHTK
LIVIIPGGAGLLNSIANIYGFMNKLNETLGDDYDILTFSYPVRFKHTIRDSMLRVNEVLSDFTHYEEIHG
IGLSFGSLLLGAFNNKESNILSSQQMQVPQIGIKFKTFTGICGMYQPFFNVKLLTWLFDFYIMRGTPGIK
LYSCYGMPIPKLIITSNSDFLVSQSTKFLQSENAESLSYPTANLPHTFPQYINLPEAQQSIVKIVDFIKQ
NSN
>AKH40369.1 putative gp83-like protein [Kallithea virus]
MSESKLQHLHPEIINYYKSIKANGLKSPKMENNEEFITTLDRVEDDFKIPFISTYVLINNAYRHELSSNR
AKSIKQNIHAIREAKDVKIRTEVTAKVNKFEFIPSHFYTCSSKAIKVAVALFLRPAYTETLKRDFIFSLL
NHHSKTHTVSDVIDLCQKTIGDVRAFIKTVGNLNTTEKQRKQLICGLIECSELLRDRLCSKLAISVSLNG
YISLISLYLKHGHLKNVIPFEPLINLYVKESIAKCTQEEERVKILNQFKVDPVATIDDVIKGLPPAPNKV
SNSSTKSCVFKPDQNYQYYKGAPNYTRDIITTYHIEHGRRYRIQTYNDCLYDVLGYTLEAPNFLEATHSP
TTNGISAIEHEIYDRMSWSDRLNLIRFRTKIRIEDAKGSELNDYHGNSTDITISWFDDNEISCSKTISLK
KSDNKK

This code should do what you need. Please let me know if it works for you :
awk 'BEGIN{RS="";FS="\n";}
{
split($1,descs," ");
for(i=2; i <= length(descs); i++){
second_names[NR] = second_names[NR] " " descs[i]
}
already_seen = 0
for(c=1; c <= length(seen_names); c++ ) {
if (second_names[NR] == seen_names[c]) {
already_seen = 1
}
}
if(already_seen == 0) {
print $0
print "\n"
seen_names[length(seen_names) + 1] = second_names[NR];
}
}
' input.txt
Regards!

Simple method:
If you don't care about ordering you can simply place all of them in a dictionary where id (second name) is key. That way you are unable to have duplicates but output will not be ordered and in every duplicate only the last entry will be kept. This is as simple as parsing each unit into 3 parts and adding them into a variable.
More complex method:
Create an empty list and go over each unit one by one and check if you need to add it to list or not. This preserves ordered input and allows you to decide which duplicate you will keep.

extracting strings using regular expression

I have the following strings:
LOW QUALITY PROTEIN: cysteine proteinase 5-like [Solanum pennellii]
PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]
XP_019244624.1 PREDICTED: peroxidase 40-like [Nicotiana attenuata]
RVW92024.1 Retrovirus-related Pol polyprotein from transposon TNT 1-94 [Vitis vinifera]
hypothetical protein VITISV_035070 [Vitis vinifera]
How to extract the below strings from the above strings?
cysteine proteinase 5-like
uncharacterized protein LOC107059219
peroxidase 40-like
Retrovirus-related Pol polyprotein from transposon TNT 1-94
hypothetical protein VITISV_035070

I think this problem don't need regex. I would prefer following solution because it is easy to understand
st = "PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]"
st.split(":")[-1].split("[")[0].strip()

splitting strings after certain characters

I would like to split my string after certain characters are found.
identifier = filecontent_id[0].split("SV=")[0]
I have this, but this "deletes" everything before "SV=" and I would like for it to "delete" everything 1 character after it. For example, it would "delete" everything after "SV=1" but I did not put 1 there because it doesn't always equal 1. The string is:
>tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQ
and I am trying to only get:
>tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1

A regex might be better, but the below works
SPLIT="SV="
line=">tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQ"
print line.split(SPLIT)[0] + SPLIT + line.split(SPLIT)[1][0]

How to write multiple lines in a single line?

How can I write multiple lines in a single line? My inputs are like this:
HOXC11
HOXC11, HOX3H, MGC4906
human, Homo sapiens
HOXB6
HOXB6, HOX2, HU-2, HOX2B, Hox-2.2
human, Homo sapiens
HOXB13
HOXB13
human, Homo sapiens
PAX5
PAX5, BSAP
human, Homo sapiens
I need to make it into a single line like this:
HOXC11 HOXC11, HOX3H, MGC4906 human, Homo sapiens
HOXB6 HOXB6, HOX2, HU-2, HOX2B, Hox-2.2 human, Homo sapiens
HOXB13 HOXB13 human, Homo sapiens

Assuming your input is from a file, let's call it homosapiens.txt, you can go from the specified input to the desired output as follow:
with open('homosapiens.txt', 'r') as f:
for line in f:
if line == 'human, Homo sapiens':
print line # this will print and go to a newline
elif line:
print line, # the comma after line suppresses the newline

textInput = textInput.rstrip('\n')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing the xml file using pandas - python

Related

Retrieve many nested statements at the same time for a single object with lxml Python

Remove duplicated sequences according to their ID names (bash or python)

extracting strings using regular expression

splitting strings after certain characters

How to write multiple lines in a single line?

Categories

Resources