quick question... i can create/parse a chunk of html using libxml2dom, etc...
however, is there a way to somehow display the xpath used to generate/extract the html chunk.. i'm assuming that there's some method/way of doing this that i can't find..
ex:
import libxml2dom
d = libxml2dom.parseString(s, html=1)
##
hdr="//div[3]/table[1]/tr/th"
thdr_ = d.xpath(hdr)
print "lent = ",len(thdr_)
at this point, thdr_ is an array/list of objects.. each of which points to a chunk of html (if you will)
i'm trying to figure out if there's a way to get, say, the xpath for say, the thdr_[x] element/item of the list...
ie:
thdr_[0]=//div[3]/table[1]/tr[0]/th
thdr_[1]=//div[3]/table[1]/tr[1]/th
thdr_[2]=//div[3]/table[1]/tr[2]/th
.
.
.
any thoughts/comments..
thanks
-tom
I did this by iterating each node and comparing the textContent with my expected text. For fuzzy comparisons I used the SequenceMatcher class from difflib.
Related
I need to filter a rather long (but very regular) set of .html files to modify a few constructs only if they appear in text elements.
One good example is to change <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>.
I can easily parse my files with html.parser, but it's unclear how to generate result file, which should be as similar to input as possible (no reformatting).
I had a look to beautiful-soup, but it really seems too big for this (supposedly?) simple task.
Note: I do not need/want to serve .html files to a browser of any kind; I just need them updated (possibli in-place) with (slightly) changed content.
UPDATE:
Following #soundstripe advice Iwrote the following code:
import bs4
from re import sub
def handle_html(html):
sp = bs4.BeautifulSoup(html, features='html.parser')
for e in list(sp.strings):
s = sub(r'"([^"]+)"', r'“\1”', e)
if s != e:
e.replace_with(s)
return str(sp).encode()
raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)
Unfortunately BeautifulSoup tries to be too smart from its (and my) own good:
b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to “find” his “good” side! He has <i>none</i>!<div></div></div></p>'
i.e.: it transforms plain & to & thus breaking “ entity (notice I'm working with bytearrays, not strings. Is it relevant?).
How can I fix this?
I don't know why you wouldn't use BeautifulSoup. Here's an example that replaces your quotes like you're asking.
import re
import bs4
raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')
def replace_quotes(s):
return re.sub(r'"([^"]+)"', r'“\1”', e)
for e in list(soup.strings):
# wrapping the new string in BeautifulSoup() call to correctly parse entities
new_string = bs4.BeautifulSoup(replace_quotes(e))
e.replace_with(new_string)
# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')
print(raw)
print(new)
I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.
<graphiceditor>
<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">
<parent>DS_Autobahn 1</parent>
...
<curve name="" isXTopAxis="0" markerSize="8" symbol="-1"
<point x="19.986891478960015" y="-0.00020825890723451596"/>
<point ....
Hello, I want to open the .xml file, find "curve" and import the y-coordinate of the curve into a list. I know that "curve" has the index [16] so I am using this right now:
tree = ET.parse(file_name)
root = tree.getroot()
curvature = [float(i) for i in[x["y"] for x in [root[0][16][i].attrib for i in range(len(root[0][16]))]]]
But how do I do it, if curve is not at the 16th position? How do I find curve in any xml file then? I have been trying for several hours now but I simply do not get it. Thank you very much in advance.
You could use XPath for instance.
This would then essentially look like:
root.findall(xpath)
where your xpath would be './/curve' if you are just interested in all childs of tag-type curve.
For more inofrmation regarding xpath see w3schools
I recommend learning about Regular Expressions (more commonly referred to as Regex), I use them all the time for problems like this.
This is a good place to reference the different aspects of Regex:
Regex
Regex is a way to match text, its a lot like if "substring" in string: except a million times more powerful. The entire purpose of regex is to find that "substring" even when you don't know what it is.
So lets take a closer look at your example in particular, first thing to do is figure out exactly which rules need to be true in order to "match" the y value.
I don't know exactly how you are actually reading in your data, but am reading it in as a single string.
string = '<graphiceditor>' \
'<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">' \
'<parent>DS_Autobahn 1</parent>' \
'<curve name="" isXTopAxis="0" markerSize="8" symbol="-1"' \
'<point x="19.986891478960015" y="-0.00020825890723451596"/>' \
'<point ....'
You can see I split the sting into multiple lines to make it more readable. If you are reading it from a file with open() make sure to remove the "\n" meta-characters or my regex wont work (not that you cant write regex that would!)
The first thing I want to do is find the curve tag, then I want to continue on to find the y= section, then grab just the number. Let's simplify that out into really defined steps:
Find where the curve section begins
Continue until the next y= section begins
Get the value from inside the quotes after the y= section.
Now for the regex, I could explain how exactly it works but we would be here all day. Go back to that Doc I linked at the start and read-up.
import re
string = "[see above]"
y_val = re.search('<curve.*?y="(.*?)"', string).group(1)
That's it! Cast your y_val to a float() and you are ready to go!
Use an XML parser to parse XML; not regex.
Like mentioned in another answer, I would also use XPath. If you need to use complex XPaths, I'd recommend using lxml. In your example though ElementTree will suffice.
For example, this Python...
import xml.etree.ElementTree as ET
tree = ET.parse("file_name.xml")
root = tree.getroot()
curvature = [float(y) for y in [point.attrib["y"] for point in root.findall(".//curve/point[#y]")]]
print(curvature)
using this XML ("file_name.xml")...
<graphiceditor>
<plot name="DS_Autobahn 1.Track: Curvature <78.4204 km>" type="CurvePlot">
<parent>DS_Autobahn 1</parent>
<curve name="" isXTopAxis="0" markerSize="8" symbol="-1">
<point x="19.986891478960015" y="-0.00020825890723451596"/>
<point x="19.986891478960015" y="-0.00030825690983451678"/>
</curve>
</plot>
</graphiceditor>
will print...
[-0.00020825890723451596, -0.0003082569098345168]
Note: Notice the difference between the second y coordinate in the list and what's in the XML. That's because you're casting the value to a float. Consider casting to a decimal if you need to maintain precision.
I have a simple code like:
p = soup.find_all("p")
paragraphs = []
for x in p:
paragraphs.append(str(x))
I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:
"RuntimeError: maximum recursion depth exceeded while calling a Python object"
I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?
The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.
Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:
# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()
p = soup.find_all('p')
paragraphs = []
for x in p:
paragraphs.append(str(x))
I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().
For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.
import re
import urllib
from urllib.request import Request, urlopen
url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'
response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)
paragraphs = []
for x in p:
paragraphs.append(str(x))
I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell':
here is my HTML snippet:
headerRows[0][10].contents
[<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3">
</font></font></font>]
Note that this is a list item from Python [].
I need the value Apples Produced but can't get to it.
Any suggestions would be appreciated
Suggestions on a good book that explains this would earn my eternal gratitude
Thanks for that answer. However-isn't there a more general answer. What happens if my cell doesn't have a bold attribute
say it is:
[<font size="+0"><font face="serif" size="1"><I>Apples Produced</I><font size="3">
</font></font></font>]
Apples Produced
I am trying to learn to read/understand the documentation and your response will help
I really appreciate this help. The best thing about these answers is that it is a lot easier to generalize from them then I have been able to do so from the BeautifulSoup documentation. I learned to program in the Fortran era and now I am learning python and I am amazed at its power - BeautifulSoup is an example. Making a coherent whole of the documentation is tough for me.
Cheers
The BeautifulSoup documentation should cover everything you need - in this case it looks like you want to use findNext:
headerRows[0][10].findNext('b').string
A more generic solution which doesn't rely on the <b> tag would be to use the text argument to findAll, which allows you to search only for NavigableString objects:
>>> s = BeautifulSoup(u'<p>Test 1 <span>More</span> Test 2</p>')
>>> u''.join([s.string for s in s.findAll(text=True)])
u'Test 1 More Test 2'
headerRows[0][10].contents[0].find('b').string
I have a base class that I extend all Beautiful Soup classes with a bunch of methods that help me get at text within a group of elements that I don't necessarily want to rely on the structure of. One of those methods is the following:
def clean(self, val):
if type(val) is not StringType: val = str(val)
val = re.sub(r'<.*?>', '', s) #remove tags
val = re.sub("\s+" , " ", val) #collapse internal whitespace
return val.strip() #remove leading & trailing whitespace