Problems isolating xml from a string - python

In my python script, I am struggling with xml files. I am using urllib to download xml files and convert them to a string. Next, Id like to parse the xml-file.
Sample link of a typical file
import urllib
data = urllib.request.urlopen(link).read()
data = str(data)
data2 = data.replace('\n', '')
I wanted to strip data of \n, but data2 is not stripped of \n characters, sample output looks like this for data2:
SwapInvolved>\n </transactionCoding>\n <transactionTimeliness>\n <value></value>\n
Why?
Also, since the file I pull is xml, I would like to go though ElementTree to parse it but I get an error.
e = xml.etree.ElementTree.parse(data).getroot()
OSError: [Errno 36] File name too long:
In the end, I want the xml from the link and parse it. I am doing it wrong though.

Your first problem is that you need to escape the '\n' in string.replace() because your string contains literal sequences of \ and n. Your code is looking for linefeeds but your data contains string representations of linefeeds.
Do this instead: data2 = data.replace(r"\n","")
Your second problem is that xml.etree.ElementTree.parse() is expecting a filename, not a string. Use xml.etree.ElementTree.fromstring() instead.

Related

Parse Binary String from a file previously written from C++ parse it in Python

I have a file the output of it has serialized string from a protobuf in C++. I want to read it in python file. Here is the content of the file.
\n\013\010\202\331\242\233\006\020\200\256\263F\022$\022\024\r\255\314\355>\025\303\255&?\035<V\312=%\014H\223>\035\200\242\376>\"\007\022\005Pants\022\"\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\341G\360>\"\005\022\003Man\022\"\022\024\r\332K\335>\025a\261\252>\035\\\3534>%z\330\"?\035\324\334\354>\"\005\022\003Man\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\2778\352>\"\007\022\005Pants\022\"\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035\256\301\277>\"\005\022\003Man\022\'\022\024\r\262\271\336>\025\364\261\261>\035T\3310>%E\344\035?\035\261l\253>\"\n\022\010Clothing\022\'\022\024\rs\255\037?\025\023\207\226>\035\3347\'>%\222\3721?\035m\364\247>\"\n\022\010Clothing\022\"\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\273\336\240>\"\005\022\003Top\022$\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\036\035\226>\"\007\022\005Shirt\022\"\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035v2\225>\"\005\022\003Top\022%\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035*\204\217>\"\010\022\006Person\022&\022\024\rB\264!?\025\016r\314>\035\354t\036>%\260\373\213>\035\037/\217>\"\t\022\007T-shirt\022%\022\024\rJp\325=\025\212\260\241>\035V\361\344=%\224\036i>\035K\253\213>\"\010\022\006Person\022\'\022\024\r\307{\262>\025\332\360\"<\035\214\307\252=%\027hP>\035\033D\211>\"\n\022\010Lighting\022$\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\0352t\207>\"\007\022\005Shirt\022\'\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\0355\014\207>\"\n\022\010Lighting\022\"\022\024\r\370Q\330=\025NH\247>\035\354;\260=%\270gQ>\035\215\304\203>\"\005\022\003Man\0220\022\024\r\3123\217<\025\272\252#?\035\250\340\244>%z\257\267>\035\217\353~>\"\023\022\021Electronic device\022\'\022\024\r\330>\340>\025R\352\345>\035(\000%>%l\255l>\035\265\002~>\"\n\022\010Clothing\022$\022\024\r\302\3779>\025\206)\244>\035|\337\342=%\360jK>\035\244t{>\"\007\022\005Table\022%\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\342E{>\"\010\022\006Person\022\'\022\024\r \323\330=\025\334\200\262>\035d\307\267=%\250._>\035-\224u>\"\n\022\010Clothing\022\'\022\024\re\017/?\025,\240;?\035`!\032>%P\333\210>\035\221up>\"\n\022\010Clothing\022\'\022\024\r\007\315$?\025\306E%?\035\240\312\307=%\362\014\237>\035\203)d>\"\n\022\010Clothing\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\010\261a>\"\007\022\005Jeans\022$\022\024\rk\247\355>\025NT&?\035\2240\313=%\272\272\223>\035\243\227a>\"\007\022\005Jeans\022\'\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035#_`>\"\n\022\010Clothing\0220\022\024\rF\'\027?\025dn}>\035\360J\232=%l\244;>\035m%\\>\"\023\022\021Electronic device\022,\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\035F\276R>\"\017\022\rLight fixture\022(\022\024\r\247s=>\025\337a\251>\035\216\365\321=%zY\r>\035LIN>\"\013\022\tTable top
If I take the above string and assign it as a binary string, I am able to Parse it and interpret values correctly.
object_reader = objreader.ObjectDetectionPredictionResult()
orig = b'\n\013\010\202\331\242\233\006\020\200\256\263F\022$\022\024\r\255\314\355>\025\303\255&?\035<V\312=%\014H\223>\035\200\242\376>\"\007\022\005Pants\022\"\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\341G\360>\"\005\022\003Man\022\"\022\024\r\332K\335>\025a\261\252>\035\\\3534>%z\330\"?\035\324\334\354>\"\005\022\003Man\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\2778\352>\"\007\022\005Pants\022\"\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035\256\301\277>\"\005\022\003Man\022\'\022\024\r\262\271\336>\025\364\261\261>\035T\3310>%E\344\035?\035\261l\253>\"\n\022\010Clothing\022\'\022\024\rs\255\037?\025\023\207\226>\035\3347\'>%\222\3721?\035m\364\247>\"\n\022\010Clothing\022\"\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\273\336\240>\"\005\022\003Top\022$\022\024\r0\341\337>\025\224\032\346>\035\354q%>%D%m>\035\036\035\226>\"\007\022\005Shirt\022\"\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035v2\225>\"\005\022\003Top\022%\022\024\r^\350-?\025\246\3034?\035\240\250E>%h~\226>\035*\204\217>\"\010\022\006Person\022&\022\024\rB\264!?\025\016r\314>\035\354t\036>%\260\373\213>\035\037/\217>\"\t\022\007T-shirt\022%\022\024\rJp\325=\025\212\260\241>\035V\361\344=%\224\036i>\035K\253\213>\"\010\022\006Person\022\'\022\024\r\307{\262>\025\332\360\"<\035\214\307\252=%\027hP>\035\033D\211>\"\n\022\010Lighting\022$\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\0352t\207>\"\007\022\005Shirt\022\'\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\0355\014\207>\"\n\022\010Lighting\022\"\022\024\r\370Q\330=\025NH\247>\035\354;\260=%\270gQ>\035\215\304\203>\"\005\022\003Man\0220\022\024\r\3123\217<\025\272\252#?\035\250\340\244>%z\257\267>\035\217\353~>\"\023\022\021Electronic device\022\'\022\024\r\330>\340>\025R\352\345>\035(\000%>%l\255l>\035\265\002~>\"\n\022\010Clothing\022$\022\024\r\302\3779>\025\206)\244>\035|\337\342=%\360jK>\035\244t{>\"\007\022\005Table\022%\022\024\rmv\036?\025|\032\224>\0354z->%\334\3354?\035\342E{>\"\010\022\006Person\022\'\022\024\r \323\330=\025\334\200\262>\035d\307\267=%\250._>\035-\224u>\"\n\022\010Clothing\022\'\022\024\re\017/?\025,\240;?\035`!\032>%P\333\210>\035\221up>\"\n\022\010Clothing\022\'\022\024\r\007\315$?\025\306E%?\035\240\312\307=%\362\014\237>\035\203)d>\"\n\022\010Clothing\022$\022\024\r-\202%?\025*\331$?\035\300\240\310=%\364\r\236>\035\010\261a>\"\007\022\005Jeans\022$\022\024\rk\247\355>\025NT&?\035\2240\313=%\272\272\223>\035\243\227a>\"\007\022\005Jeans\022\'\022\024\r\245G!?\025e\340\314>\0354\222 >%;\003\213>\035#_`>\"\n\022\010Clothing\0220\022\024\rF\'\027?\025dn}>\035\360J\232=%l\244;>\035m%\\>\"\023\022\021Electronic device\022,\022\024\r\222\034\207<\025\037`\266>\035D\351\254=%\262BN>\035F\276R>\"\017\022\rLight fixture\022(\022\024\r\247s=>\025\337a\251>\035\216\365\321=%zY\r>\035LIN>\"\013\022\tTable top'
object_reader.ParseFromString(orig)
print(object_reader)
If the above string is in a file and I try and read it is as binary file and parse it, I get an error.
with open("./payload", "rb") as fd:
val = fd.read()
object_reader.ParseFromString(val)
print(object_reader)
Error:
google.protobuf.message.DecodeError: Error parsing message
On looking further it seems while reading the file in binary format it is adding extra '\' escape characters. Which looks like below.
b'\\n\\013\\010\\202\\331\\242\\233\\006\\020\\200\\256\\263F\\022$\\022\\024\\r\\255\\314\\355>\\025\\303\\255&?\\035<V\\312=%\\014H\\223>\\035\\200\\242\\376>\\"\\007\\022\\005Pants\\022\\"\\022\\024\\rmv\\036?\\025|\\032\\224>\\0354z->%\\334\\3354?\\035\\341G\\360>\\"\\005\\022\\003Man\\022\\"\\022\\024\\r\\332K\\335>\\025a\\261\\252>\\035\\\\\\3534>%z\\330\\"?\\035\\324\\334\\354>\\"\\005\\022\\003Man\\022$\\022\\024\\r-\\202%?\\025*\\331$?\\035\\300\\240\\310=%\\364\\r\\236>\\035\\2778\\352>\\"\\007\\022\\005Pants\\022\\"\\022\\024\\r^\\350-?\\025\\246\\3034?\\035\\240\\250E>%h~\\226>\\035\\256\\301\\277>\\"\\005\\022\\003Man\\022\\\'\\022\\024\\r\\262\\271\\336>\\025\\364\\261\\261>\\035T\\3310>%E\\344\\035?\\035\\261l\\253>\\"\\n\\022\\010Clothing\\022\\\'\\022\\024\\rs\\255\\037?\\025\\023\\207\\226>\\035\\3347\\\'>%\\222\\3721?\\035m\\364\\247>\\"\\n\\022\\010Clothing\\022\\"\\022\\024\\r0\\341\\337>\\025\\224\\032\\346>\\035\\354q%>%D%m>\\035\\273\\336\\240>\\"\\005\\022\\003Top\\022$\\022\\024\\r0\\341\\337>\\025\\224\\032\\346>\\035\\354q%>%D%m>\\035\\036\\035\\226>\\"\\007\\022\\005Shirt\\022\\"\\022\\024\\r\\245G!?\\025e\\340\\314>\\0354\\222 >%;\\003\\213>\\035v2\\225>\\"\\005\\022\\003Top\\022%\\022\\024\\r^\\350-?\\025\\246\\3034?\\035\\240\\250E>%h~\\226>\\035*\\204\\217>\\"\\010\\022\\006Person\\022&\\022\\024\\rB\\264!?\\025\\016r\\314>\\035\\354t\\036>%\\260\\373\\213>\\035\\037/\\217>\\"\\t\\022\\007T-shirt\\022%\\022\\024\\rJp\\325=\\025\\212\\260\\241>\\035V\\361\\344=%\\224\\036i>\\035K\\253\\213>\\"\\010\\022\\006Person\\022\\\'\\022\\024\\r\\307{\\262>\\025\\332\\360\\"<\\035\\214\\307\\252=%\\027hP>\\035\\033D\\211>\\"\\n\\022\\010Lighting\\022$\\022\\024\\r\\245G!?\\025e\\340\\314>\\0354\\222 >%;\\003\\213>\\0352t\\207>\\"\\007\\022\\005Shirt\\022\\\'\\022\\024\\r\\222\\034\\207<\\025\\037`\\266>\\035D\\351\\254=%\\262BN>\\0355\\014\\207>\\"\\n\\022\\010Lighting\\022\\"\\022\\024\\r\\370Q\\330=\\025NH\\247>\\035\\354;\\260=%\\270gQ>\\035\\215\\304\\203>\\"\\005\\022\\003Man\\0220\\022\\024\\r\\3123\\217<\\025\\272\\252#?\\035\\250\\340\\244>%z\\257\\267>\\035\\217\\353~>\\"\\023\\022\\021Electronic device\\022\\\'\\022\\024\\r\\330>\\340>\\025R\\352\\345>\\035(\\000%>%l\\255l>\\035\\265\\002~>\\"\\n\\022\\010Clothing\\022$\\022\\024\\r\\302\\3779>\\025\\206)\\244>\\035|\\337\\342=%\\360jK>\\035\\244t{>\\"\\007\\022\\005Table\\022%\\022\\024\\rmv\\036?\\025|\\032\\224>\\0354z->%\\334\\3354?\\035\\342E{>\\"\\010\\022\\006Person\\022\\\'\\022\\024\\r \\323\\330=\\025\\334\\200\\262>\\035d\\307\\267=%\\250._>\\035-\\224u>\\"\\n\\022\\010Clothing\\022\\\'\\022\\024\\re\\017/?\\025,\\240;?\\035`!\\032>%P\\333\\210>\\035\\221up>\\"\\n\\022\\010Clothing\\022\\\'\\022\\024\\r\\007\\315$?\\025\\306E%?\\035\\240\\312\\307=%\\362\\014\\237>\\035\\203)d>\\"\\n\\022\\010Clothing\\022$\\022\\024\\r-\\202%?\\025*\\331$?\\035\\300\\240\\310=%\\364\\r\\236>\\035\\010\\261a>\\"\\007\\022\\005Jeans\\022$\\022\\024\\rk\\247\\355>\\025NT&?\\035\\2240\\313=%\\272\\272\\223>\\035\\243\\227a>\\"\\007\\022\\005Jeans\\022\\\'\\022\\024\\r\\245G!?\\025e\\340\\314>\\0354\\222 >%;\\003\\213>\\035#_`>\\"\\n\\022\\010Clothing\\0220\\022\\024\\rF\\\'\\027?\\025dn}>\\035\\360J\\232=%l\\244;>\\035m%\\\\>\\"\\023\\022\\021Electronic device\\022,\\022\\024\\r\\222\\034\\207<\\025\\037`\\266>\\035D\\351\\254=%\\262BN>\\035F\\276R>\\"\\017\\022\\rLight fixture\\022(\\022\\024\\r\\247s=>\\025\\337a\\251>\\035\\216\\365\\321=%zY\\r>\\035LIN>\\"\\013\\022\\tTable top:'
I am looking for some help in reading the above file while escaping the extra escape characters. Or any other ways of reading file and loading into protobuf to parse the values.

Searching for and manipulating the content of a keyword in a huge file

I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.

python tool to generate txt file by coping only directory/folder names but not the other file names

This is my
import os
filenames= os.listdir (".")
file = open("XML.txt", "w")
result = []
for filename in filenames:
result = "<capltestcase name =\""+filename+"\"\n"
file.write(result)
result = "title = \""+filename+"\"\n"
file.write(result)
result = "/>\n"
file.write(result)
file.close()
My Question /help needed
I want to add standard text ""
to the txt generated, but i cant add it, it says sytax errors can somebody help with code please.
2) how can i just copy foldernames from directory instead of file names , since with my code , it copies all file names in into txt.
Thank you frnds ..
file.write("\\")
use the escape () to write special characters
print("\\<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\\")
Rather than escaping all those double-quotes, why not embed your string inside single quotes instead? In Python (unlike many other languages) there is no difference between using single or double quotes, provided they are balanced (the same at each end).
If you need the backslashes in the text then use a raw string
file.write(r'"\<?xml version="1.0" encoding="iso-8859-1"?>\"')
That will preserve all the double-quotes and the back-slashes.

Handling Python Requests JSON Response

Trying to handle some JSON response to a Python Requests call to an API, in Python--a language I'm still learning.
Here's the structure of the sample returned JSON data:
{"sports":[{"searchtype":"seasonal", "sports":["'baseball','football','softball','soccer','summer','warm'","'hockey','curling','luge','snowshoe','winter','cold'"]}]}
Currently, I'm parsing and writing output to a file like this:
output = response.json
results = output['sports'][0]['sports']
if results:
with open (filename, "w") as fileout:
fileout.write(pprint.pformat(results))
Giving me this as my file:
[u"'baseball','football','softball','soccer','summer','warm'",
"'hockey','curling','luge','snowshoe','winter','cold'"]
Since I'm basically creating double-quoted JSON Arrays, consisting of comma separated strings--how can I manipulate the array to print only the comma separated values I want? In this case, everything except the fifth column which represents seasons.
[u"'baseball','football','softball','soccer','warm'",
"'hockey','curling','luge','snowshoe','cold'"]
Ultimately, I'd like to strip away the unicode too, since I have no non-ascii characters. I currently do this manually with a language I'm more familiar with (AWK) after the fact. My desired output is really:
'baseball','football','softball','soccer','warm'
'hockey','curling','luge','snowshoe','cold'
your results is actually a list of strings, to get your desired output you can do it like this for example:
if results:
with open (filename, "w") as fileout:
for line in results
fileout.write(line)

lxml adds urlencoding in xml?

I'll preface this by indicating I'm using Python 2.7.3 (x64) on Windows 7, with lxml 2.3.6.
I have a little, odd, problem I'm hoping somebody can help with. I haven't find a solution online, perhaps I'm not searching for the right thing.
Anyway, I have a problem where I'm programmatically building some XML with lxml, then outputting this to a text file, the problem is lxml is converting carriage returns to the text 
, almost like urlencoding - but I'm not using HTML I'm using XML.
For example, I have a simple text file created in Notepad, like this:
This
is
my
text
I then build some xml and add this text into the xml:
from lxml import etree
textstr = ""
fh = open("mytext.txt", "rb")
for line in fh:
textstr += line
root = etree.Element("root")
a = etree.SubElement(root, "some_element")
a.text = textstr
print etree.tostring(root)
The problem here is the output of the print looks like this:
<root><some_element>This
is
my
text</some_element></root>
For my purposes the line breaks are fine, but the 
 elements are not.
What I have been able to figure out is that this is happening because I'm opening the text file in binary mode "rb" (which I actually need to do as my app is indexing a large text file). If I don't open the file in binary mode "r", then the output does not contain 
 (but of course, then my indexing doesn't work).
I've also tried changing the etree.tostring to:
print etree.tostring(root, method="xml")
However there is no difference in the output.
Now, I CAN dump the xml text to a string then do a replace of the $#13; artifacts, however, I was hoping for a more elegant solution - because the text files I parse are not under my control and I'm worried that other elements of the text file might be converted to url style encoding without my knowledge.
Does anyone know a way of preventing this encoding from happening?
Windows uses \r\n to represent a line ending, Unix uses \n.
This will remove the \r at the end of the line, if there is one there (so the code will work with unix text files too.) It will remove at most one \r, so if there is an \r somewhere else in the line it will be preserved.
import re
textstr = ""
with open("mytext.txt", "rb") as fh:
for line in fh:
textstr += re.sub(r'\r$', '', line)
print(repr(textstr))

Categories

Resources