Extracting Data from a .txt file using python - python

I many, many .xml files and i need to extract some co-ordinates from them.
Extracting data straight from .xml files seems to be very, very complicated - so i am working saving the .xml files as .txt files and extracting the data that way. However, when i open the .txt file, my data is all bunched together on about 6 lines.. And all the scripts i have found so far select the data by reading the first word on each line.. but obviously that won't work for me!
I need to extract the numbers inbetween these comments:
<gml:lowerCorner>137796 483752</gml:lowerCorner> <gml:upperCorner>138178 484222</gml:upperCorner>
In the text file they are all grouped together! Does anyone know how to extract this data? Thank you!

This is absolutely the wrong approach. Leave it alone and improve your ways :-)
Seriously, if the file is XML, then just use an XML parser to read it. Learning how to do it in Python isn't hard and will make your life easier now and much easier in the future, when you may find yourself facing more complex parsing needs, and you won't have to re-learn it.
Look at xml.etree.ElementTree.ElementTree. Here's some sample code:
>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("your_xml_file.xml")
Now just read the documentation of the module and see what you can do with tree. You'll be surprised to find out how simple it is to get to information this way. If you have specific questions about extracting data, I suggest you open another question in which you specify the format of the XML file you have to parse, and what data you have to take out of there. I'm sure you will have working code suggested to you in matters of minutes.

You can also open through the python script .xml file as you open a .txt file.
data = open("file.xml")
xml = data.read()
Then you can use regular expressions to find those numbers you want so badly.

The top answer is still the top answer. However, I've been doing just this with HTML and this link lxml and xpath ideal.
Open your browser to the site (or data) which is of interest. In Chrome, right click and 'Inspect Element'. In the Developer window on the highlighted text right click again and 'Copy XPath'. For google.com and clicking on the main search box I get the following XPath.
//*[#id="lst-ib"]
You can use lxml to grab various data from this item. See what you get when you append 'text()' or '#value' or '#href' on the end.

For really simple xml i just use a regex, can't be botherd to start an slow xml parser for a simple xml packet.
In [1]: data = open("file.txt","r").read()
In [2]: import re
In [3]: re.compile("([\d]+)").findall(data)
Out[3]: ['137796', '483752', '138178', '484222']

Related

Saving boards in a file

I would love to have literary batch records on my computer, unfortunately this is not an option. I decided to write a program that records the layout of the letters on the board and the available letters on each line in a text file. My idea is to get it out of html with python, maybe with selenium. I know Python pretty well, but I can't say that about working with pages. Need an idea how to do it in the most efficient way (ie how to extract this data from an html page into an array or a list). The website is https://www.kurnik.pl/literaki.
I tried static methods, but it turned out to be a completely wrong approach and nothing came of it.
Thank you in advance.
If you have the algorithms and logic to convert the page into a list, I may have a solution. Try making a csv file (comma separated values), which makes a table sort-of layout that can also be converted back to an array with ease. E.g.
with open('board.csv', 'w') as f:
f.write(board data here)
f.close()

Can you modify only a text string in an XML file and still maintain integrity and functionality of .docx encasement?

I want to enter data into a Microsoft Excel Spreadsheet, and for that data to interact and write itself to other documents and webforms.
With success, I am pulling data from an Excel spreadsheet using xlwings. Right now, I’m stuck working with .docx files. The goal here is to write the Excel data into specific parts of a Microsoft Word .docx file template and create a new file.
My specific question is:
Can you modify just a text string(s) in a word/document.xml file and still maintain the integrity and functionality of its .docx encasement? It seems that there are numerous things that can change in the XML code when making even the slightest change to a Word document. I've been working with python-docx and lxml, but I'm not sure if what I seek to do is possible via this route.
Any suggestions or experiences to share would be greatly appreciated. I feel I've read every article that is easily discoverable through a google search at least 5 times.
Let me know if anything needs clarification.
Some things to note:
I started getting into coding about 2 months ago. I’ve been doing it intensively for that time and I feel I’m picking up the essential concepts, but there are severe gaps in my knowledge.
Here are my tools:
Yosemite 10.10,
Microsoft Office 2011 for Mac
You probably need to be more specific, but the short answer is, in principle, yes.
At a certain level, all python-docx does is modify strings in the XML. A couple things though:
The XML you create needs to remain well-formed and valid according to the schema. So if you change the text enclosed in a <w:t> element, for example, that works fine. Conversely, if you inject a bunch of random XML at an arbitrary point in one of the .xml parts, that will corrupt the file.
The XML "files", known as parts that make up a .docx file are contained in a Zip archive known as a package. You must unpackage and repackage that set of parts properly in order to have a valid .docx file afterward. python-docx takes care of all those details for you, but if you're going directly at the .docx file you'll need to take care of that yourself.

What's the best way to handle XML in Python?

I am working on an XML file that is very large (I think that since about it is 45 GB) I have to search through the document of OpenStreetMap data and find a particular path that satisfies a certain criteria. I am currently using XML element tree however I think it is a little slow since I need to search the XML for specific Geographic coordinates. Is there a better way to do this ?
I have also seen a little bit of lxml however, I'd like to know if there's a better choice between the two ? Thank you!
The sax package is well suited for parsing huge xml files that can't be loaded into memory all at once. The parser will go "line-by-line" and notify you when it encountered the beginning or ending of an element, instead of loading the file into memory, parsing it, and giving you the whole tree.

Remove ASCII-encoded binary blobs from .txt files

I want to parse 10-K files (financial statements of firms). Example of Apple's can be found here (look for the .txt file). Now, I was reading this research paper (look on page 30-31) on how to parse these files. The step one is described as removing all ASCII-Encoded segments ... that's what I want to figure out on how to remove them.
I see several questions on StackOverflow on how to remove non-ASCII codes, but this is different. ASCII-Encoded segments are: All document segments with <TYPE> tags of GRAPHIC, ZIP, EXCEL and PDF - I want to delete them.
So if I load a txt file as follow:
fil = open('F:\\file.txt','r')
x = fil.read()
How can I remove all ASCII Encoded segments from this txt file? To remove HTML tags, I use the procedure here, but what about ASCII Encoded segments?
If I understand you correctly, the format you are processing is somehow related to the SEC EDGAR process.
I have not taken the time to look it up formally. Perhaps you should.
From inspecting the Apple statement you link to, it looks like you want to replace anything matching the regular expression <DOCUMENT>\s*<TYPE>(?:GRAPHIC|ZIP|EXCEL|PDF).*?</DOCUMENT> with an empty string.
Disclaimer: A proper implementation would use an XML parser and extract the elements you want, instead of attempting to lexically zap things you don't want. This should not be hard in lxml.
I first thought this was XBLR but it's not. Attempting to parse it with ETree throws an exception because the close tags for some elements (including <TYPE>) appear to be optional. The best way forward would be to find out what format this is (the EDGAR site has several specifications; one of them, perhaps?) and locate a proper DTD, then proceed from there.
Once you have that sorted out, you want to see how to remove elements with XPath and perhaps how to use regex in (lxml) XPath. Then probably reimplement the other extractions you have already done using XML and XPath.

Map xml file with extracted information

I am trying to map one xml file to another, based on a configuration file (that too can be an xml file).
Input
<ia>
<ib>...</ib>
<ic>...</ic>
</ia>
Output
<oa>
<ob>...</ob>
<oc>...</oc>
</oa>
Config
<config>
<conf>
<input>ia</input>
<output>oa</output>
</conf>
<conf>
<input>ib</input>
<output>ob</output>
</conf>
.....
</config>
So, the intention is to parse an xml file and retrieve the information interesting to me, and write into another xml file where the mapping information is specified in the config file.
Due to the scripting nature (and extending with plugins lateron), and the support for xml processing I was considering python. I just learned the syntax and basics of language, and came to know about lxml
One way of doing this
parse the config file (where , tag can have xpath to the node I am interested in)
read the input file
write into output, using etbuilder based on the config file
Being new to python, and not seeing xpath support for etbuilder I wonder is this the best approach. Also not sure all the exceptional cases. Is there an easier way, or native support in any other libraries. If possible, I do not want to spend too much time on this task as I could focus on the core task.
thanks ins advance.
If you wish to transform an XML file into another XML file then XSLT was made for this purpose. You have to define a .xslt file that describes the transformation of XML content and what the eventual output should look like. That's one way to do it.
You can also read the XML file using lxml and generate the output XML with lxml.etree.ElementTree. I'm not familiar with etbuilder but I don't think generating the desired output is that difficult. Once you have parsed the input files, you can build the config XML and write it to a file.
XPath is primarily for reading XML content, you don't need it for constructing XML files. In fact, if you use a proper XML parser then you don't need XPath either to read the file contents, although XPath could make life a bit easier.

Categories

Resources