I have a bunch of XML files dumped to disk in batches.
When I tried to prase them I found that some hade a control character inserted into an attribute.
It looked like this:
<root ^KIND="A"></root>
When it was supposed to look like this:
<root KIND="A"></root>
Now in this case it was easily fixed, just some regexp magic:
import re
xml = re.sub(r'<([^>]*)\v([^>]*)>', r'<\1K\2>', xml)
But then the requirements changed, I had to dump the docs out to disk, individually.
Naturally I raw the substitution before saving so i wouldn't have that problem again.
There are alot of these documents you see, many millions...
And so, I was getting ready to extract some data from them again.
This time however I got a new error:
<root KIND="A"><CLASSIFICATION></CLASSIFICATIO^N></root>
When it was supposed to look like this:
<root KIND="A"><CLASSIFICATION></CLASSIFICATION></root>
I am not sure why I keep getting these errors not why its always 'ctrl-characters` that are inserted. It migth be that its pure luck so far.
The regexp I used in hte first case wont wore in general, ^K translates to vertical tab so I could match agains that. But is there some what to filter out any ctrl-character?
Try using a translate table to get rid of ctrl-A through ctrl-Z:
in_chars = ''.join([chr(x) for x in range(1, 27)])
out_chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
tr_table = str.maketrans(in_chars, out_chars)
# pass all strings through the translate table:
x = input('Enter text: ')
print(x.translate(tr_table))
Prints:
Enter text: abc^Kdef
abcKdef
Related
I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.
I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.
I am trying to learn to use Whoosh. I have a large collection of html documents I want to search. I discovered that the text_content() method creates some interesting problems for example I might have some text that is organized in a table that looks like
<html><table><tr><td>banana</td><td>republic</td></tr><tr><td>stateless</td><td>person</td></table></html>
When I take the original string and and get the tree and then use text_content to get the text in the following manner
mytree = html.fromstring(myString)
text = mytree.text_content()
The results have no spaces (as should be expected)
'bananarepublicstatelessperson'
I tried to insert new lines using string.replace()
myString = myString.replace('</tr>','</tr>\n')
I confirmed that the new line was present
'<html><table><tr><td>banana</td><td>republic</td></tr>\n<tr><td>stateless</td><td>person</td></table></html>'
but when I run the same code from above the line feeds are not present. Thus the resulting text_content() looks just like above.
This is a problem from me because I need to be able to separate words, I thought I could add non-breaking spaces after each td and line breaks after rows as well asd line breaks after body elements etc to get text that reasonably conforms to my original source.
I will note that I did some more testing and found that line breaks inserted after paragraph tag closes were preserved. But there is a lot of text in the tables that I need to be able to search.
Thanks for any assistance
You could use this solution:
import re
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
>>> striphtml('I Want This <b>text!</b>')
>>> 'I Want This text!'
Found here: using python, Remove HTML tags/formatting from a string
I am parsing an XML file with the minidom parser, where I'm iterating over the XML and output specific information that stands between the tags into a dictionary.
Like this:
d={}
dom = parseString(data)
macro=dom.getElementsByTagName('macro')
for node in macro:
d={}
id_name=node.getElementsByTagName('id')[0].toxml()
id_data=id_name.replace('<id>','').replace('</id>','')
print (id_data)
cl_name=node.getElementsByTagName('cl')[1].toxml()
cl_data=cl_name.replace('<cl>','').replace('</cl>','')
print (cl_data)
d_source[id_data]=(cl_data)
Now, my problem is that the data where I'm looking for in cl_name=node.getElementsByTagName('cl')[1].toxml() is sometimes non-existent!
In this case the part of the XML looks like this:
<cl>blabla</cl>
<cl></cl>
Because of this I receive an "index is out of range"-error.
However, I really need this "nothing" in my dictionary. My dictionary should look like this:
d={blabla:'',xyz:'abc'}
I have to look for the empty text node, which I tried by doing this:
if node.getElementsByTagName('cl')[1].toxml is None:
print ('')
else:
cl_name=node.getElementsByTagName('cl')[1].toxml()
cl_data=cl_name.replace('<cl>','').replace('</cl>','')
print (cl_data)
d_target[id_data]=(cl_data)
print(d_target)
I still receive that indexing error...I also thought about inserting a white space into the original source file, but am not sure if this would solve the issue. Any ideas?
If the minidom is not dictated somehow, I suggest to change your mind and use the standard xml.etree.ElementTree. It is much easier.
I figured out it's working when adding a white space into the original source file. This looks a bit messy though. So if anyone has a better idea, I'm looking forward to it!
Im working on a data packet retrieval system which will take a packet, and process the various parts of the packet, based on a system of tags [similar to HTML tags].
[text based files only, no binary files].
Each part of the packet is contained between two identical tags, and here is a sample packet:
"<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
The entire packet is contained within the <PACKET><PACKET> tags.
All meta-data is contained within the <HEAD><HEAD> tags and the filename from which the packet is part of is contained within the, you guessed it, the <FILENAME><FILENAME> tags.
Lets say, for example, a single packet is received and stored in a temporary string variable called sTemp.
How do you efficiently retrieve, for example, only the contents of a single pair of tags, for example the contents of the <FILENAME><FILENAME> tags?
I was hoping for such functionality as saying getTagFILENAME( packetX ), which would return the textual string contents of the <FILENAME><FILENAME> tags of the packet.
Is this possible using Python?
Any suggestions or comments appreciated.
If the packet format effectively uses XML-looking syntax (i.e., if the "closing tags" actually include a slash), the xml.etree.ElementTree could be used.
This libray is part of Python Standard Library, starting in Py2.5. I find it a very convenient one to deal with this kind of data. It provides many ways to read and to modify this kind of tree structure. Thanks to the generic nature of XML languages and to the XML awareness built-in the ElementTree library, the packet syntax could evolve easily for example to support repeating elements, element attributes.
Example:
>>> import xml.etree.ElementTree
>>> myPacket = '<PACKET><HEAD><ID>123</ID><SEQ>1</SEQ><FILENAME>Test99.txt</FILE
NAME></HEAD><DATA>spam and cheese</DATA></PACKET>'
>>> xt = xml.etree.ElementTree.fromstring(myPacket)
>>> wrk_ele = xt.find('HEAD/FILENAME')
>>> wrk_ele.text
'Test99.txt'
>>>
Something like this?
import re
def getPacketContent ( code, packetName ):
match = re.search( '<' + packetName + '>(.*?)<' + packetName + '>', code )
return match.group( 1 ) if match else ''
# usage
code = "<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
print( getPacketContent( code, 'HEAD' ) )
print( getPacketContent( code, 'SEQ' ) )
As mjv points out, there's not the least sense in inventing an XML-like format if you can just use XML.
But: If you're going to use XML for your packet format, you need to really use XML for it. You should use an XML library to create your packets, not just to parse them. Otherwise you will come to grief the first time one of your field values contains an XML markup character.
You can, of course, write your own code to do the necessary escaping, filter out illegal characters, guarantee well-formedness, etc. For a format this simple, that may be all you need to do. But going down that path is a way to learn things about XML that you perhaps would rather not have to learn.
If using an XML library to create your packets is a problem, you're probably better off defining a custom format (and I'd define one that didn't look anything like XML, to keep people from getting ideas) and building a parser for it using pyparsing.