Delete everything in file after last appearance string - python

I want to make a program which look through files, finds every incomplete file (without </module> at the end), then it will print last found abnumber in file and delete everyline (including the last with abnumber) after it.
So my file looks like that:
<Module bs="Mainfile_1">
<object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object>
<object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object>
<object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>
<object id="1238" name="name2" abnumber="4">
<item name="item8" value="something12:
<item name="item9" value="233" />
and at the end it should looks like:
<Module bs="Mainfile_1">
<object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object>
<object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object>
<object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>
with printed: 4
I started by doing something like that but I feel like I am doing everything wrong:
import os
Mainfile = 'path'
for filename in os.listdir(Mainfile):
lines = filename.readlines()
if not "</Module>" in lines:
with open(filename, 'r+', encoding="utf-8") as file:
line_list = list(file)
line_list.reverse()
for line in line_list:
if line.find('absno') != -1:
print(line)

You can use re to get your result :
<object([\s\S]*?)<\/object> to get correct <object... </object> tag
abnumber=\"([0-9.]+) to get abnnumber for incorrect tag
<Module.*|<object(?:[\s\S]*?)<\/object> to get correct format of xml data
import re
data = """<Module bs="Mainfile_1">
<object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object>
<object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object>
<object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>
<object id="1238" name="name2" abnumber="4">
<item name="item8" value="something12:
<item name="item9" value="233" />"""
invalid_XML_Tag = re.sub("<object([\s\S]*?)<\/object>", '', data)
abnnumber_value = re.findall("abnumber=\"([0-9.]+)", invalid_XML_Tag)
print("abnumber of invalid tag => {0}".format(abnnumber_value))
correct_xml_format = re.findall("<Module.*|<object(?:[\s\S]*?)<\/object>",data)
print("".join(correct_xml_format))
Output:
abnumber of invalid tag => ['4']
<Module bs="Mainfile_1"><object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object><object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object><object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>

Related

How can i get attribute number

I use BS4 to parser .xml,i want to get resattribute number,but get none
how to do it ?
source xml
`<digitizer id="1" integrated="true" csrmusttouch="falsehardprox="true"
physidcsrs="false" pnpid="49154" kind="MULTI_TOUCH" maxcsrs="10">
<monitor left="0" top="0" right="1920" bottom="1080" />`
<properties>
<property name="x" logmin="0" logmax="16383" res="621.7457275" unit="cm" hidusage="0x00010030" guid="{598A6A8F-52C0-4BA0-93AF-AF357411A561}" />
<property name="y" logmin="0" logmax="16383" res="983.9639893" unit="cm" hidusage="0x00010031" guid="{B53F9F75-04E0-4498-A7EE-C30DBB5A9011}" />
<property name="status" logmin="0" logmax="15" res="0" unit="DEFAULT" hidusage="0x000d0042, 0x000d003c, 0x000d0044" guid="{6E0E07BF-AFE7-4CF7-87D1-AF6446208418}" />
<property name="time" logmin="0" logmax="2147483647" res="1" unit="DEFAULT" guid="{436510C5-FED3-45D1-8B76-71D3EA7A829D}" />
<property name="contactid" logmin="0" logmax="31" res="1.861861944" unit="cm" hidusage="0x000d0051" guid="{02585B91-049B-4750-9615-DF8948AB3C9C}" />`
Python Code
a = data_xml.find('digitizer',id="1")
b = a.find('properties')
print(b.get('res'))
Result
None
I have taken your data as html
html="""<digitizer id="1" integrated="true" csrmusttouch="falsehardprox="true"
physidcsrs="false" pnpid="49154" kind="MULTI_TOUCH" maxcsrs="10">
<monitor left="0" top="0" right="1920" bottom="1080" />`
<properties>
<property name="x" logmin="0" logmax="16383" res="621.7457275" unit="cm" hidusage="0x00010030" guid="{598A6A8F-52C0-4BA0-93AF-AF357411A561}" />
<property name="y" logmin="0" logmax="16383" res="983.9639893" unit="cm" hidusage="0x00010031" guid="{B53F9F75-04E0-4498-A7EE-C30DBB5A9011}" />
<property name="status" logmin="0" logmax="15" res="0" unit="DEFAULT" hidusage="0x000d0042, 0x000d003c, 0x000d0044" guid="{6E0E07BF-AFE7-4CF7-87D1-AF6446208418}" />
<property name="time" logmin="0" logmax="2147483647" res="1" unit="DEFAULT" guid="{436510C5-FED3-45D1-8B76-71D3EA7A829D}" />
<property name="contactid" logmin="0" logmax="31" res="1.861861944" unit="cm" hidusage="0x000d0051" guid="{02585B91-049B-4750-9615-DF8948AB3C9C}" />"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
Code::
You can find all property tag and then find res value associate to it!
a = soup.find('digitizer',attrs={"id":"1"})
properties=a.find_all("property")
res_lst=[i['res'] for i in properties]
Output::
['621.7457275', '983.9639893', '0', '1', '1.861861944']
Your xml seems poorly formatted, after reformatting it:
<digitizer id="1" integrated="true" csrmusttouch="" falsehardprox="true" physidcsrs="false" pnpid="49154" kind="MULTI_TOUCH" maxcsrs="10">
<monitor left="0" top="0" right="1920" bottom="1080"/>
<properties>
<property name="x" logmin="0" logmax="16383" res="621.7457275" unit="cm" hidusage="0x00010030" guid="{598A6A8F-52C0-4BA0-93AF-AF357411A561}" />
<property name="y" logmin="0" logmax="16383" res="983.9639893" unit="cm" hidusage="0x00010031" guid="{B53F9F75-04E0-4498-A7EE-C30DBB5A9011}" />
<property name="status" logmin="0" logmax="15" res="0" unit="DEFAULT" hidusage="0x000d0042, 0x000d003c, 0x000d0044" guid="{6E0E07BF-AFE7-4CF7-87D1-AF6446208418}" />
<property name="time" logmin="0" logmax="2147483647" res="1" unit="DEFAULT" guid="{436510C5-FED3-45D1-8B76-71D3EA7A829D}" />
<property name="contactid" logmin="0" logmax="31" res="1.861861944" unit="cm" hidusage="0x000d0051" guid="{02585B91-049B-4750-9615-DF8948AB3C9C}" />
You can easily parse it like this:
from bs4 import BeautifulSoup
with open('data.xml') as raw_resuls:
results = BeautifulSoup(raw_resuls, 'lxml')
for element in results.find_all("properties"):
for property_tag in element.find_all("property"):
print(property_tag['res'])
Output:
621.7457275
983.9639893
0
1
1.861861944
You can find more info about parsing attribute values from xml in the tutorial where the code is from.
Edit: Note that I slightly modified the code to fit your question.

Change the atribute of the xml tree in python

I have a problem with change of the atribute at the xml file.
My tree looks like that
<Objects>
<BigObj Version="2.2" Name="Something">
<ItemList>
<Item Name="s_1" Selected="false"/>
<Item Name="s_2" Selected="false"/>
<Item Name="s_3" Selected="true"/>
<Item Name="s_4" Selected="false"/>
</ItemList>
</BigObj >
</Objects>
And i need to check if "s_x"is in list of names and if it is then change the value of Selected to true, if it's not to false (or keep it false)
I've tried to do that with this code:
lslist = ["s_1","s_4"]
for child in root.findall("./Objects/BigObj/ItemList/Item"):
for idx in lslist:
if idx in child.find("Name").text:
child.set('Selected', "true")
else:
child.set('Selected', "false")
But i have an AttributeError: 'NoneType' object has no attribute 'text'
The below works
import xml.etree.ElementTree as ET
lslist = ["s_1", "s_4"]
xml = '''<Objects>
<BigObj Version="2.2" Name="Something">
<ItemList>
<Item Name="s_1" Selected="false"/>
<Item Name="s_2" Selected="false"/>
<Item Name="s_3" Selected="true"/>
<Item Name="s_4" Selected="false"/>
</ItemList>
</BigObj ></Objects>'''
root = ET.fromstring(xml)
items = root.findall('.//Item')
for item in items:
item.attrib['Selected'] = str(item.attrib['Name'] in lslist)
ET.dump(root)
output
<Objects>
<BigObj Version="2.2" Name="Something">
<ItemList>
<Item Name="s_1" Selected="True" />
<Item Name="s_2" Selected="False" />
<Item Name="s_3" Selected="False" />
<Item Name="s_4" Selected="True" />
</ItemList>
</BigObj></Objects>

Extract info based on name tag from XML file by beautifulsoup python

In python 3.5 -- I'm using Entrez biopython for extract some info from Database = pmc in pubmed biomedical website. Now I want to from XML file:
<DocSum>
<Id>5412469</Id>
<Item Name="PubDate" Type="Date">2017 Apr 22</Item>
<Item Name="EPubDate" Type="Date">2017 Apr 22</Item>
<Item Name="Source" Type="String">Int J Mol Sci</Item>
<Item Name="AuthorList" Type="List">
<Item Name="Author" Type="String">Guo Y</Item>
<Item Name="Author" Type="String">Bao Y</Item>
<Item Name="Author" Type="String">Yang W</Item>
</Item>
<Item Name="Title" Type="String">Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis</Item>
<Item Name="Volume" Type="String">18</Item>
<Item Name="Issue" Type="String">4</Item>
<Item Name="Pages" Type="String">890</Item>
<Item Name="ArticleIds" Type="List">
<Item Name="pmid" Type="String">28441730</Item>
<Item Name="doi" Type="String">10.3390/ijms18040890</Item>
<Item Name="pmcid" Type="String">PMC5412469</Item>
</Item>
<Item Name="DOI" Type="String">10.3390/ijms18040890</Item>
<Item Name="FullJournalName" Type="String">International Journal of Molecular Sciences</Item>
<Item Name="SO" Type="String">2017 Apr 22;18(4):890</Item>
extract Name=Title {Exact below line} :
<Item Name="Title" Type="String">Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis</Item>
But How can I solve this issue?
Although I've been used this code :
for tag in soup.findAll("docsum"): # I'm working with multiple articles in one file
for a_tag in tag.findAll("item"):
a_recs.append(a_tag.text)
return a_recs
But it returns all the values in one list while I want just title. such as below :
['2017 Apr 22', '2017 Apr 22', 'Int J Mol Sci', '\nGuo Y\nBao Y\nYang W\n', 'Guo Y', 'Bao Y', 'Yang W', 'Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis', '18', '4', '890', '\n28441730\n10.3390/ijms18040890\nPMC5412469\n', '28441730', '10.3390/ijms18040890', 'PMC5412469', '10.3390/ijms18040890', 'International Journal of Molecular Sciences', '2017 Apr 22;18(4):890']
Try:
>>> data = '''
... <DocSum>
... <Id>5412469</Id>
... <Item Name="PubDate" Type="Date">2017 Apr 22</Item>
... <Item Name="EPubDate" Type="Date">2017 Apr 22</Item>
... <Item Name="Source" Type="String">Int J Mol Sci</Item>
... <Item Name="AuthorList" Type="List">
... <Item Name="Author" Type="String">Guo Y</Item>
... <Item Name="Author" Type="String">Bao Y</Item>
... <Item Name="Author" Type="String">Yang W</Item>
... </Item>
... <Item Name="Title" Type="String">Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis</Item>
... <Item Name="Volume" Type="String">18</Item>
... <Item Name="Issue" Type="String">4</Item>
... <Item Name="Pages" Type="String">890</Item>
... <Item Name="ArticleIds" Type="List">
... <Item Name="pmid" Type="String">28441730</Item>
... <Item Name="doi" Type="String">10.3390/ijms18040890</Item>
... <Item Name="pmcid" Type="String">PMC5412469</Item>
... </Item>
... <Item Name="DOI" Type="String">10.3390/ijms18040890</Item>
... <Item Name="FullJournalName" Type="String">International Journal of Molecular Sciences</Item>
... <Item Name="SO" Type="String">2017 Apr 22;18(4):890</Item>'''
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(data, 'xml')
>>> for tag in soup.findAll("DocSum"):
... for a_tag in tag.find("Item", {"Name" : "Title"}):
... a_recs.append(a_tag)
...
>>> a_recs
['Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis']

ElementTree better way to search out nodes (XPATH) using AND and 'parent'

I need to find tag=ITEM that match 2 criteria, and then get the parent tag=NODE#name based on this find.
Two issues:
I can't find a way for XPath to do an 'and', for example
item = node.findall('./ITEM[#name="toppas_type" and #value="output file list"]')
Getting the parent NODE info without having to explicitely search and save it in advance of finding the ITEM, for example something like
parent_name = item.parent.attrib['name']
This is the code I have now:
node_names = []
for node in tree.findall('NODE[#name="vertices"]/NODE'):
for item in node.findall('./ITEM[#name="toppas_type"]'):
if item.attrib['name'] == 'toppas_type' and item.attrib['value'] == 'output file list':
node_names.append(node.attrib['name'])
...to parse a file like this (snippet only) ...
<?xml version="1.0" encoding="ISO-8859-1"?>
<PARAMETERS version="1.6.2" xsi:noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<NODE name="vertices" description="">
<NODE name="23" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
</NODE>
<NODE name="24" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
<ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
</NODE>
<NODE name="33" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
<ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
</NODE>
<!--(snip)-->
</NODE>
</PARAMETERS>
UPDATE:
#Mathias Müller
Great suggestion - unfortunately when I try to load the XML file, I get an error. I'm not familiar with lxml...so I'm not sure if I'm using it right.
from lxml import etree
root = etree.DTD("/Users/mikes/Documents/Eclipseworkspace/Bioproximity/Assay-Workflows-Mikes/protein_lfq/protein_lfq-1.1.2.toppas")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/dtd.pxi", line 294, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:187024)
lxml.etree.DTDParseError: Content error in the external subset, line 2, column 1
Unfortunately, ElementTree will not accept that xpath in its tree.find(xpath) or tree.findall(xpath)
Perhaps you do not need nested loops at all, a single XPath expression would suffice. I am not exactly sure what you would like the final result to be, but here is an example with lxml:
>>> import lxml.etree
>>> s = '''<NODE name="vertices" description="">
...
... <NODE name="23" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="24" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
... <ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="33" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
... <ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
... </NODE>
... <!--(snip)-->
... </NODE>'''
>>> root = lxml.etree.fromstring(s)
>>> root.xpath('/NODE[#name="vertices"]/NODE/ITEM[#name = "toppas_type" and #value = "output file list"]')
[<Element ITEM at 0x102b5f788>]
And if you actually need the name of the parent element, you can move to the parent node with ..:
>>> root.xpath('/NODE[#name="vertices"]/NODE/ITEM[#name = "toppas_type" and #value = "output file list"]/../#name')
['24']
Parsing an XML document from a file
The function etree.DTD is the wrong choice if you would like to parse an XML document from a file. A DTD is not an XML document. Here is how you can do it with lxml:
>>> import lxml.etree
>>> root = lxml.etree.parse("example.xml")
>>> root
<lxml.etree._ElementTree object at 0x106593b00>
Second Update
If the outermost element is PARAMETERS, you need to search like this:
>>> root.xpath('/PARAMETERS/NODE[#name="vertices"]/NODE/ITEM[#name = "toppas_type" and #value = "output file list"]')
[<Element ITEM at 0x106593e18>]

Element is no longer valid

I face a problem with Python and Selenium.
I want to click a link, this is my Py code:
toubao_luru_xpath='//div[87]/xml/items/item[2]/item[#path=policynewbiz/inputapplication/chooseproduct.jsp]'
#url=policynewbiz/inputapplication/chooseproduct.jsp
print WebDriverWait(browser,10).until(EC.presence_of_element_located((By.XPATH,toubao_luru_xpath)))
print browser.find_element_by_xpath(toubao_luru_xpath)
#print browser.find_element_by_xpath(toubao_luru_xpath).click()
The error is:
File
"C:\Python27\lib\site-packages\selenium\webdriver\support\wait.py",
line 80, in until
raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message: Yes,
And this is the HTML code:
<html>
<DIV style="DISPLAY: none"><xml id=__menu>
<items>
<item name="quotation" label="散单报价" >
<item name="input" label="录入" path="policyquotation/createquotation/chooseproduct.jsp" icon="../image/icon/1.gif" visible="false" ></item>
<item name="quotationinput" label="录入" icon="../image/icon/1.gif" visible="false" command="commandCreateQuotationTemplate" ></item>
<item name="quotationinput2014" label="录入" path="policyquotation_v2/chooseproduct2014.jsp" icon="../image/icon/1.gif" ></item>
<item name="enterquotation" label="enterquotation" visible="false" ></item>
<item name="queryQuotation" label="查询" path="policyquotation/qryquotationlist.jsp" icon="../image/icon/2.gif" visible="false" ></item>
<item name="queryQuotation2" label="查询" path="policyquotation_v2/qryquotationlist.jsp" icon="../image/icon/2.gif" ></item>
<item name="packageWork" label="套餐指定" path="policyquotation/package-manage-work.jsp" icon="../image/icon/3.gif" visible="false" ></item>
<item name="querypackage" label="套餐管理" path="policyquotation/query-package-list.jsp" icon="../image/icon/4.gif" visible="false" ></item>
<item name="quotationfollow" label="报价跟进" path="policyquotation/followquotation.jsp" icon="../image/icon/5.gif" visible="false" ></item>
<item name="entererror" label="entererror" path="error.jsp" visible="false" ></item>
<item name="quotationfollownew" label="报价跟进" path="policyquotation_v2/followquotation.jsp" icon="../image/icon/5.gif" visible="false" ></item>
</item>
<item name="application" label="投保" >
<item name="input" label="录入" path="policynewbiz/inputapplication/chooseproduct.jsp" icon="../image/icon/1.gif" ></item>
</html>
The last <item> I want to click
This should work. It's unique given the HTML you provided and I'm assuming that there aren't two links to the same URL so it should be good.
driver.find_element_by_css_selector("item[path='policynewbiz/inputapplication/chooseproduct.jsp']").click()

Categories

Resources