I'm trying to extract the data on the mouseovers on the map at the bottom of this webpage with what planes are in the air atm fighting bush fires, link o web page http://dsewebapps.dse.vic.gov.au/fires/updates/report/aircraft/aircraftlist.htm
Now I can extract the beginning and end of the map and also I can exract the area's, for example this is the code I have tired and the results.
from bs4 import BeautifulSoup
import urllib2
url = "http://dsewebapps.dse.vic.gov.au/fires/updates/report/aircraft/aircraftlist.htm"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
#find the map at the bottom of the page with all the codinates in it
findmap = soup.find_all("area")
print(findmap[1]).prettitfy
this code results in printing out just one of the planes, which is fine here.
<bound method Tag.prettify of <area coords="87,369,6" href="javascript:void(0)" onmouseout=" return nd()" onmouseover="return overlib('<p class=PopupText><STRONG>BOMBER 352</STRONG><br><STRONG>Last Observation: </STRONG>Feb 10 2014 10:26AM<br><STRONG>Speed: </STRONG>0 Knots<br><STRONG>Course: </STRONG>0 Deg True<br><STRONG>Latitude: </STRONG>-37.6074 <STRONG>Longitude: </STRONG>141.362 <br><br><STRONG>Bomber 362</STRONG><br><STRONG>Last Observation: </STRONG>Feb 10 2014 10:29AM<br><STRONG>Speed: </STRONG>0 Knots<br><STRONG>Course: </STRONG>0 Deg True<br><STRONG>Latitude: </STRONG>-37.6072 <STRONG>Longitude: </STRONG>141.362 <br></p>',ABOVE)" shape="circle"></area>>
I would idealy like to convert these paragraphs into json, so i can feed it into something else, so am i better off doing a lot of regex's? or can BeautifulSoup work with this data and parse it into JSON, as from what i have read it can't because of the javascript. Or is there another option?
Thx.
You can do it using BeautifulSoup.
The example here after follows the following algorithm:
Iterate over all <area> elements
Use the coords attribute as index to store the area's data in the result dicitonary
Parse the onmouseover attribute of the <area> elements using the following rules:
The html to parse starts after the return overlib(' string and ends before the ',ABOVE string
Every plane record starts with the plane's name enclosed in <strong> html element, followed by non-text element (<p/> in this case but I test it as element.name != None), followed by another <strong element>
Bellow is my sample code:
from bs4 import BeautifulSoup
import urllib2
import pprint
pp = pprint.PrettyPrinter(indent=4)
url = "http://dsewebapps.dse.vic.gov.au/fires/updates/report/aircraft/aircraftlist.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
fields = ['Last Observation', 'Speed', 'Course', 'Latitude', 'Longitude']
areas = dict()
for area in soup.find_all("area"):
area_coords = area.get('coords')
print area_coords
data_soup = BeautifulSoup(area.get('onmouseover')[len("return overlib('"):
-len("',ABOVE")])
planes = list()
elements = data_soup.find_all('p')[0].contents
for i in range(len(elements) - 2):
if elements[i].name == 'strong' and \
elements[i+1].name and \
elements[i+2].name == 'strong':
plane = dict()
plane[u'Name'] = elements[i].contents[0]
planes.append(plane)
if hasattr(elements[i], 'contents') and len(elements[i].contents) > 0:
field_name = elements[i].contents[0].strip(' :')
if field_name in fields:
plane[field_name] = elements[i+1]
areas[area_coords] = planes
pp.pprint(areas)
using lxml, might be a little better than regex...
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> t1 = etree.parse(url, parser)
# use xpath to extract onmouseover
>>> o = t1.xpath('//area[2]/#onmouseover')[0]
# delete javascript function call from both sides, yep, that's the ugly part...
>>> h = o[len("return overlib('"):-len("',ABOVE)")]
>>> t2 = etree.fromstring(h, parser)
# note the [1:] to remove first unwanted strong tag
# also note the use of zip here
>>> {k:v for k,v in zip(t2.xpath('//strong/text()')[1:], t2.xpath('//p/text()'))}
{'Latitude: ': '-34.232 ', 'Last Observation: ': 'Feb 9 2014 6:36PM',
'Speed: ': '3 Knots', 'Course: ': '337 Deg True', 'Longitude: ': '142.086 '}
Related
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
I've been trying my best to access the ns1:AreaId which is (10YDK-1--------W) through ns1:AffectedAreas by using B = soup.find('ns1:area') and then B.next_element but all I get is an empty string.
Try this method,
import bs4
import re
data = """
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
"""
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
bs = bs4.BeautifulSoup(data, "html.parser")
areaid = bs.find_all('ns1:areaid')
print((striphtml(str(areaid))))
Here, striphtml function will remove all the tags containing <>.So the output will be,
[10YDK-1--------W]
If you have defined namespaces in your HTML/XML document, you can use xml parser and CSS selectors.
For example:
txt = '''<root xmlns:ns1="some namespace">
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
</root>'''
soup = BeautifulSoup(txt, 'xml')
area_id = soup.select_one('ns1|AffectedAreas ns1|AreaId').text
print(area_id)
Prints:
10YDK-1--------W
You can try to iterate over the soup.find('ns1:area') childrens to find the ns1:areaid tag and then to get his text.
for i in soup.find('ns1:area').children:
if i.name == "ns1:areaid":
b = i.text
print(b)
And from ns1:AffectedAreas it will look like
for i in soup.find_all('ns1:AffectedAreas'.lower()):
for child in i.children:
if child.name == "ns1:area":
for y in child.children:
if y.name == "ns1:areaid":
print(y.text)
Or to search the tag ns1:AreaId in lower case and to get text of him. this way you can get all the text values from all ns1:AreaId tags.
soup.find_all("ns1:AreaId".lower())[0].text
Both cases will output
"10YDK-1--------W"
Another method.
from simplified_scrapy import SimplifiedDoc, req, utils
html = '''
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
<ns1:Area>
<ns1:AreaId>10YDK-2--------W</ns1:AreaId>
<ns1:AreaName>DK2</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
'''
doc = SimplifiedDoc(html)
AffectedArea = doc.select('ns1:AffectedAreas')
Areas = AffectedArea.selects('ns1:Area')
AreaIds = Areas.select('ns1:AreaId').html
print (AreaIds)
# or
# print (doc.select('ns1:AffectedAreas').selects('ns1:Area').select('ns1:AreaId').html)
Result:
['10YDK-1--------W', '10YDK-2--------W']
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
I am trying to extract paragraph elements from a Wikipedia page, under the ID = 'See', all into a list.
Using :
import bs4
import requests
response = requests.get("https://wikitravel.org/en/Bhopal")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
plot = []
# find the node with id of "Plot"
mark = html.find(id="See")
# walk through the siblings of the parent (H2) node
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
if elt.name == "h2":
break
if hasattr(elt, "text"):
plot.append(elt.text)
Now I want to be extracting only the paragraphs which contain a bold element inside them, How can I achieve this here ?
Is this what you are looking for?
I added few lines to your code. I have used lxml parser.(html is fine as well).
from bs4 import BeautifulSoup as bs
import lxml
import ssl
import requests
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://wikitravel.org/en/Bhopal'
content = requests.get('https://wikitravel.org/en/Bhopal').text
soup = bs(content, 'lxml')
plot =[]
mark = soup.find(id="See")
# # # walk through the siblings of the parent (H2) node
# # # until we reach the next H2 node
for elt in mark.parent.next_siblings:
if elt.name == "h2":
break
if hasattr(elt, "text") and (elt.find('b')):
plot.append(elt.text)
print(*plot,sep=('\n')) #Just to print the list in a readable way
First few lines of the output on my jupyter notebook:
The script used to work, but no longer and I can't figure out why. I am trying to go to the link and extract/print the religion field. Using firebug, the religion field entry is within the 'tbody' then 'td' tag-structure. But now the script find "none" when searching for these tags. And I also look at the lxml by 'print Soup_FamSearch' and I couldn't see any 'tbody' and 'td' tags appeared on firebug.
Please let me know what I am missing?
import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time
from unicodedata import normalize
FamSearchURL = 'https://familysearch.org/pal:/MM9.1.1/KH21-211'
OpenFamSearchURL = urllib2.urlopen(FamSearchURL)
Soup_FamSearch = BeautifulSoup(OpenFamSearchURL, 'lxml')
OpenFamSearchURL.close()
tbodyTags = Soup_FamSearch.find('tbody')
trTags = tbodyTags.find_all('tr', class_='result-item ')
for trTags in trTags:
tdTags_label = trTag.find('td', class_='result-label ')
if tdTags_label:
tdTags_label_string = tdTags_label.get_text(strip=True)
if tdTags_label_string == 'Religion: ':
print trTags.find('td', class_='result-value ')
Find the Religion: label by text and get the next td sibling:
soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Demo:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> response = requests.get('https://familysearch.org/pal:/MM9.1.1/KH21-211')
>>> soup = BeautifulSoup(response.content, 'lxml')
>>>
>>> soup.find(text='Religion:').parent.find_next_sibling('td').get_text(strip=True)
Methodist
Then, you can make a nice reusable function and reuse:
def get_field_value(soup, field):
return soup.find(text='%s:' % field).parent.find_next_sibling('td').get_text(strip=True)
print get_field_value(soup, 'Religion')
print get_field_value(soup, 'Nationality')
print get_field_value(soup, 'Birthplace')
Prints:
Methodist
Canadian
Ontario
I am parsing html from the following website: http://www.asusparts.eu/partfinder/Asus/All In One/E Series I was just wondering if there was any way i could explore a parsed attribute in python?
For example.. The code below outputs the following:
datas = s.find(id='accordion')
a = datas.findAll('a')
for data in a:
if(data.has_attr('onclick')):
model_info.append(data['onclick'])
print data
[OUTPUT]
Bracket
These are the values i would like to retrieve:
nCategoryID = Bracket
nModelID = ET10B
family = E Series
As the page is rendered from AJAX, They are using a script source resulting in the following url from the script file:
url = 'http://json.zandparts.com/api/category/GetCategories/' + country + '/' + currency + '/' + nModelID + '/' + family + '/' + nCategoryID + '/' + brandName + '/' + null
How can i retrieve only the 3 values listed above?
[EDIT]
import string, urllib2, urlparse, csv, sys
from urllib import quote
from urlparse import urljoin
from bs4 import BeautifulSoup
from ast import literal_eval
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
#Array to hold all options
redirects = []
#Array to hold all data
model_info = []
print "FETCHING OPTIONS"
select = soup.find(id='myselectListModel')
#print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
#print s
print "FETCHING MAIN TITLE"
#Finding all the headings for each specific Model
maintitle = s.find(id='puffBreadCrumbs')
print maintitle.get_text()
#Find entire HTML container holding all data, rendered by AJAX
datas = s.find(id='accordion')
#Find all 'a' tags inside data container
a = datas.findAll('a')
#Find all 'span' tags inside data container
content = datas.findAll('span')
print "FETCHING CATEGORY"
#Find all 'a' tags which have an attribute of 'onclick' Error:(doesn't display anything, can't seem to find
#'onclick' attr
if(hasattr(a, 'onclick')):
arguments = literal_eval('(' + a['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4]
print "FETCHING DATA"
for complete in content:
#Find all 'class' attributes inside 'span' tags
if(complete.has_attr('class')):
model_info.append(complete['class'])
print complete.get_text()
#Find all 'table data cells' inside table held in data container
print "FETCHING IMAGES"
img = s.find('td')
#Find all 'img' tags held inside these 'td' cells and print out
images = img.findAll('img')
print images
I have added an Error line where the problem lays...
Similar to Martijn's answer, but makes primitive use of pyparsing (ie, it could be refined to recognise the function and only take quoted strings with the parentheses):
from bs4 import BeautifulSoup
from pyparsing import QuotedString
from itertools import chain
s = '''Bracket'''
soup = BeautifulSoup(s)
for a in soup('a', onclick=True):
print list(chain.from_iterable(QuotedString("'", unquoteResults=True).searchString(a['onclick'])))
# ['Asus', 'Bracket', 'ET10B', '7138', 'E Series']
You could parse that as a Python literal, if you remove the this, part from it, and only take everything between the parenthesis:
from ast import literal_eval
if data.has_attr('onclick'):
arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments
We remove the this argument because it is not a valid python string literal and you don't want to have it anyway.
Demo:
>>> literal_eval('(' + "getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')".replace(', this', '').split('(', 1)[1])
('Asus', 'Bracket', 'ET10B', '7138', 'E Series')
Now you have a Python tuple and can pick out any value you like.
You want the values at indices 1, 2 and 4, for example:
nCategoryID, nModelID, family = arguments[1], arguments[3], arguments[4]
html = """
...
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="#all" title="Permalink to this definition">ΒΆ</a>
...
"""
I want to get all text between starting tag big upto before the first occurrence of a tag. This means if I take this example, then i must get (iterable) as a string.
An iterative approach.
from BeautifulSoup import BeautifulSoup as bs
from itertools import takewhile, chain
def get_text(html, from_tag, until_tag):
soup = bs(html)
for big in soup(from_tag):
until = big.findNext(until_tag)
strings = (node for node in big.nextSiblingGenerator() if getattr(node, 'text', '').strip())
selected = takewhile(lambda node: node != until, strings)
try:
yield ''.join(getattr(node, 'text', '') for node in chain([big, next(selected)], selected))
except StopIteration as e:
pass
for text in get_text(html, 'big', 'a'):
print text
I would avoid nextSibling, as from your question, you want to include everything up until the next <a>, regardless of whether that is in a sibling, parent or child element.
Therefore I think the best approach is to find the node that is the next <a> element and loop recursively until then, adding each string as encountered. You may need to tidy up the below if your HTML is vastly different from the sample, but something like this should work:
from bs4 import BeautifulSoup
#by taking the `html` variable from the question.
html = BeautifulSoup(html)
firstBigTag = html.find_all('big')[0]
nextATag = firstBigTag.find_next('a')
def loopUntilA(text, firstElement):
text += firstElement.string
if (firstElement.next.next == nextATag):
return text
else:
#Using double next to skip the string nodes themselves
return loopUntilA(text, firstElement.next.next)
targetString = loopUntilA('', firstBigTag)
print targetString
you can do like this :
from BeautifulSoup import BeautifulSoup
html = """
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="test" title="Permalink to this definition"></a>
"""
soup = BeautifulSoup(html)
print soup.find('big').nextSibling.next.text
For details check dom traversing with BeautifulSoup from here
>>> from BeautifulSoup import BeautifulSoup as bs
>>> parsed = bs(html)
>>> txt = []
>>> for i in parsed.findAll('big'):
... txt.append(i.text)
... if i.nextSibling.name != u'a':
... txt.append(i.nextSibling.text)
...
>>> ''.join(txt)
u'(iterable)'