I am parsing html from the following website: http://www.asusparts.eu/partfinder/Asus/All In One/E Series I was just wondering if there was any way i could explore a parsed attribute in python?
For example.. The code below outputs the following:
datas = s.find(id='accordion')
a = datas.findAll('a')
for data in a:
if(data.has_attr('onclick')):
model_info.append(data['onclick'])
print data
[OUTPUT]
Bracket
These are the values i would like to retrieve:
nCategoryID = Bracket
nModelID = ET10B
family = E Series
As the page is rendered from AJAX, They are using a script source resulting in the following url from the script file:
url = 'http://json.zandparts.com/api/category/GetCategories/' + country + '/' + currency + '/' + nModelID + '/' + family + '/' + nCategoryID + '/' + brandName + '/' + null
How can i retrieve only the 3 values listed above?
[EDIT]
import string, urllib2, urlparse, csv, sys
from urllib import quote
from urlparse import urljoin
from bs4 import BeautifulSoup
from ast import literal_eval
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
#Array to hold all options
redirects = []
#Array to hold all data
model_info = []
print "FETCHING OPTIONS"
select = soup.find(id='myselectListModel')
#print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
#print s
print "FETCHING MAIN TITLE"
#Finding all the headings for each specific Model
maintitle = s.find(id='puffBreadCrumbs')
print maintitle.get_text()
#Find entire HTML container holding all data, rendered by AJAX
datas = s.find(id='accordion')
#Find all 'a' tags inside data container
a = datas.findAll('a')
#Find all 'span' tags inside data container
content = datas.findAll('span')
print "FETCHING CATEGORY"
#Find all 'a' tags which have an attribute of 'onclick' Error:(doesn't display anything, can't seem to find
#'onclick' attr
if(hasattr(a, 'onclick')):
arguments = literal_eval('(' + a['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4]
print "FETCHING DATA"
for complete in content:
#Find all 'class' attributes inside 'span' tags
if(complete.has_attr('class')):
model_info.append(complete['class'])
print complete.get_text()
#Find all 'table data cells' inside table held in data container
print "FETCHING IMAGES"
img = s.find('td')
#Find all 'img' tags held inside these 'td' cells and print out
images = img.findAll('img')
print images
I have added an Error line where the problem lays...
Similar to Martijn's answer, but makes primitive use of pyparsing (ie, it could be refined to recognise the function and only take quoted strings with the parentheses):
from bs4 import BeautifulSoup
from pyparsing import QuotedString
from itertools import chain
s = '''Bracket'''
soup = BeautifulSoup(s)
for a in soup('a', onclick=True):
print list(chain.from_iterable(QuotedString("'", unquoteResults=True).searchString(a['onclick'])))
# ['Asus', 'Bracket', 'ET10B', '7138', 'E Series']
You could parse that as a Python literal, if you remove the this, part from it, and only take everything between the parenthesis:
from ast import literal_eval
if data.has_attr('onclick'):
arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments
We remove the this argument because it is not a valid python string literal and you don't want to have it anyway.
Demo:
>>> literal_eval('(' + "getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')".replace(', this', '').split('(', 1)[1])
('Asus', 'Bracket', 'ET10B', '7138', 'E Series')
Now you have a Python tuple and can pick out any value you like.
You want the values at indices 1, 2 and 4, for example:
nCategoryID, nModelID, family = arguments[1], arguments[3], arguments[4]
Related
Python code:
url = 'https://www.basketball-reference.com/players/'
initial = list(string.ascii_lowercase)
initial_url = [url + i for i in initial]
html_initial = [urllib.request.urlopen(i).read() for i in initial_url]
soup_initial = [BeautifulSoup(i, 'html.parser') for i in html_initial]
tags_initial = [i('a') for i in soup_initial]
print(tags_initial[0][50])
Results example:
Shareef Abdur-Rahim
From the example above, I want to extract the name of the players which is 'Shareef Abdur-Rahim', but I want to do it for all the tags_initial lists,
Does anyone have an idea?
Could you modify your post by adding your code so that we can help you better?
Maybe that could help you :
name = soup.findAll(YOUR_SELECTOR)[0].string
UPDATE
import re
import string
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.basketball-reference.com/players/'
# Alphabet
initial = list(string.ascii_lowercase)
datas = []
# URLS
urls = [url + i for i in initial]
for url in urls:
# Soup Object
soup = BeautifulSoup(urlopen(url), 'html.parser')
# Players link
url_links = soup.findAll("a", href=re.compile("players"))
for link in url_links:
# Player name
datas.append(link.string)
print("datas : ", datas)
Then, "datas" contains all the names of the players, but I advise you to do a little processing afterwards to remove some erroneous information like "..." or perhaps duplicates
There are probably better ways but I'd do it like this:
html = "a href=\"/teams/LAL/2021.html\">Los Angeles Lakers</a"
index = html.find("a href")
index = html.find(">", index) + 1
index_end = html.find("<", index)
print(html[index:index_end])
If you're using a scraper library it probably has a similar function built-in.
I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);
I have some HTML formatted text I've got with BeautifulSoup. I'd like to convert all italic (tag i), bold (b) and links (a href) to Word format via docx run command.
I can make a paragraph:
p = document.add_paragraph('text')
I can ADD next sequence as bold/italic:
p.add_run('bold').bold = True
p.add_run('italic.').italic = True
Intuitively, I could find all particular tags (ie. soup.find_all('i')) and then watch indices and then concatenate partial strings...
...but maybe there's a better, more elegant way?
I don't want libraries or solutions that just convert a html page to word and save them. I want a little more control.
I got nowhere with a dictionary. Here is the code and visual wrong (from code) and right (desired) result:
from docx import Document
import os
from bs4 import BeautifulSoup
html = 'hi, I am link this is some nice regular text. <i> oooh, but I am italic</i> ' \
' or I can be <b>bold</b> '\
' or even <i><b>bold and italic</b></i>'
def get_tags(text):
soup = BeautifulSoup(text, "html.parser")
tags = {}
tags["i"] = soup.find_all("i")
tags["b"] = soup.find_all("b")
return tags
def make_test_word():
document = Document()
document.add_heading('Demo HTML', 0)
soup = BeautifulSoup(html, "html.parser")
p = document.add_paragraph(html)
# p.add_run('bold').bold = True
# p.add_run(' and some ')
# p.add_run('italic.').italic = True
file_name="demo_html.docx"
document.save(file_name)
os.startfile(file_name)
make_test_word()
I just wrote a bit of code to convert the text from a tkinter Text widget over to a word document, including any bold tags that the user can add. This isn't a complete solution for you, but it may help you to start toward a working solution. I think you're going to have to do some regex work to get the hyperlinks transferred to the word document. Stacked formatting tags may also get tricky. I hope this helps:
from docx import Document
html = 'HTML string <b>here</b>.'
html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
doc = Document()
p = doc.add_paragraph()
for run in html:
if run.startswith('<b>'):
run = run.lstrip('<b>')
runner = p.add_run(run)
runner.bold = True
elif run.startswith('</b>'):
run = run.lstrip('</b>')
runner = p.add_run(run)
else:
p.add_run(run)
doc.save('test.docx')
I came back to it and made it possible to parse out multiple formatting tags. This will keep a tally of what formatting tags are in play in a list. At each tag, a new run is created, and formatting for the run is set by the current tags in play.
from docx import Document
import re
import docx
from docx.shared import Pt
from docx.enum.dml import MSO_THEME_COLOR_INDEX
def add_hyperlink(paragraph, text, url):
# This gets access to the document.xml.rels file and gets a new relation id value
part = paragraph.part
r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )
# Create a w:r element and a new w:rPr element
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# Join all the xml elements together add add the required text to the w:r element
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
# Create a new Run object and add the hyperlink into it
r = paragraph.add_run ()
r._r.append (hyperlink)
# A workaround for the lack of a hyperlink style (doesn't go purple after using the link)
# Delete this if using a template that has the hyperlink style in it
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True
return hyperlink
html = '<H1>I want to</H1> <u>convert HTML to docx in <b>bold and <i>bold italic</i></b>.</u>'
html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
tags = []
doc = Document()
p = doc.add_paragraph()
for run in html:
tag_change = re.match('(?:<)(.*?)(?:>)', run)
if tag_change != None:
tag_strip = tag_change.group(0)
tag_change = tag_change.group(1)
if tag_change.startswith('/'):
if tag_change.startswith('/a'):
tag_change = next(tag for tag in tags if tag.startswith('a '))
tag_change = tag_change.strip('/')
tags.remove(tag_change)
else:
tags.append(tag_change)
else:
tag_strip = ''
hyperlink = [tag for tag in tags if tag.startswith('a ')]
if run.startswith('<'):
run = run.replace(tag_strip, '')
if hyperlink:
hyperlink = hyperlink[0]
hyperlink = re.match('.*?(?:href=")(.*?)(?:").*?', hyperlink).group(1)
add_hyperlink(p, run, hyperlink)
else:
runner = p.add_run(run)
if 'b' in tags:
runner.bold = True
if 'u' in tags:
runner.underline = True
if 'i' in tags:
runner.italic = True
if 'H1' in tags:
runner.font.size = Pt(24)
else:
p.add_run(run)
doc.save('test.docx')
Hyperlink function thanks to this question. My concern here is that you will need to manually code for every HTML tag that you want to carry over to the docx. I imagine that could be a large number. I've given some examples of tags you may want to account for.
Alternatively, you can just save your html code as a string and do:
from htmldocx import HtmlToDocx
new_parser = HtmlToDocx()
new_parser.parse_html_file("html_filename", "docx_filename")
#Files extensions not needed, but tolerated
I'm having a problem with a for loop. In the script, I use a text list to build a URL and then ran a for loop for each element of the list. After having all the URLs I want to extract information from the website. That's where I have a problem.
I checked the program and it's building the correct URL but I don't know how to extract the information for all elements of the look using just the 1st URL.
Please, anyone have an idea where I'm going wrong?
import urllib2
import re
from bs4 import BeautifulSoup
import time
date = date = (time.strftime('%Y%m%d'))
symbolslist = open('pistas.txt').read().split()
for symbol in symbolslist:
url = "http://trackinfo.com/entries-race.jsp?raceid=" + symbol + "$" + date +"A01"
htmltext = BeautifulSoup(urllib2.urlopen(url).read())
names=soup.findAll('a',{'href':re.compile("dog")})
for name in names:
results = ' '.join(name.string.split())
print results
and that is the text list:
GBM
GBR
GCA
GDB
GSP
GDQ
GEB
Hei man, try this:
import urllib2
import re
from bs4 import BeautifulSoup
import time
date = (time.strftime('%Y%m%d'))
symbolslist = open('pistas.txt').read().split()
for symbol in symbolslist:
url = "http://trackinfo.com/entries-race.jsp?raceid=" + symbol + "$" + date +"A01"
htmltext = BeautifulSoup(urllib2.urlopen(url).read())
names=htmltext.findAll('a',{'href':re.compile("dog")})
for name in names:
results = ' '.join(name.string.split())
print results
I'm trying to extract the data on the mouseovers on the map at the bottom of this webpage with what planes are in the air atm fighting bush fires, link o web page http://dsewebapps.dse.vic.gov.au/fires/updates/report/aircraft/aircraftlist.htm
Now I can extract the beginning and end of the map and also I can exract the area's, for example this is the code I have tired and the results.
from bs4 import BeautifulSoup
import urllib2
url = "http://dsewebapps.dse.vic.gov.au/fires/updates/report/aircraft/aircraftlist.htm"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
#find the map at the bottom of the page with all the codinates in it
findmap = soup.find_all("area")
print(findmap[1]).prettitfy
this code results in printing out just one of the planes, which is fine here.
<bound method Tag.prettify of <area coords="87,369,6" href="javascript:void(0)" onmouseout=" return nd()" onmouseover="return overlib('<p class=PopupText><STRONG>BOMBER 352</STRONG><br><STRONG>Last Observation: </STRONG>Feb 10 2014 10:26AM<br><STRONG>Speed: </STRONG>0 Knots<br><STRONG>Course: </STRONG>0 Deg True<br><STRONG>Latitude: </STRONG>-37.6074 <STRONG>Longitude: </STRONG>141.362 <br><br><STRONG>Bomber 362</STRONG><br><STRONG>Last Observation: </STRONG>Feb 10 2014 10:29AM<br><STRONG>Speed: </STRONG>0 Knots<br><STRONG>Course: </STRONG>0 Deg True<br><STRONG>Latitude: </STRONG>-37.6072 <STRONG>Longitude: </STRONG>141.362 <br></p>',ABOVE)" shape="circle"></area>>
I would idealy like to convert these paragraphs into json, so i can feed it into something else, so am i better off doing a lot of regex's? or can BeautifulSoup work with this data and parse it into JSON, as from what i have read it can't because of the javascript. Or is there another option?
Thx.
You can do it using BeautifulSoup.
The example here after follows the following algorithm:
Iterate over all <area> elements
Use the coords attribute as index to store the area's data in the result dicitonary
Parse the onmouseover attribute of the <area> elements using the following rules:
The html to parse starts after the return overlib(' string and ends before the ',ABOVE string
Every plane record starts with the plane's name enclosed in <strong> html element, followed by non-text element (<p/> in this case but I test it as element.name != None), followed by another <strong element>
Bellow is my sample code:
from bs4 import BeautifulSoup
import urllib2
import pprint
pp = pprint.PrettyPrinter(indent=4)
url = "http://dsewebapps.dse.vic.gov.au/fires/updates/report/aircraft/aircraftlist.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
fields = ['Last Observation', 'Speed', 'Course', 'Latitude', 'Longitude']
areas = dict()
for area in soup.find_all("area"):
area_coords = area.get('coords')
print area_coords
data_soup = BeautifulSoup(area.get('onmouseover')[len("return overlib('"):
-len("',ABOVE")])
planes = list()
elements = data_soup.find_all('p')[0].contents
for i in range(len(elements) - 2):
if elements[i].name == 'strong' and \
elements[i+1].name and \
elements[i+2].name == 'strong':
plane = dict()
plane[u'Name'] = elements[i].contents[0]
planes.append(plane)
if hasattr(elements[i], 'contents') and len(elements[i].contents) > 0:
field_name = elements[i].contents[0].strip(' :')
if field_name in fields:
plane[field_name] = elements[i+1]
areas[area_coords] = planes
pp.pprint(areas)
using lxml, might be a little better than regex...
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> t1 = etree.parse(url, parser)
# use xpath to extract onmouseover
>>> o = t1.xpath('//area[2]/#onmouseover')[0]
# delete javascript function call from both sides, yep, that's the ugly part...
>>> h = o[len("return overlib('"):-len("',ABOVE)")]
>>> t2 = etree.fromstring(h, parser)
# note the [1:] to remove first unwanted strong tag
# also note the use of zip here
>>> {k:v for k,v in zip(t2.xpath('//strong/text()')[1:], t2.xpath('//p/text()'))}
{'Latitude: ': '-34.232 ', 'Last Observation: ': 'Feb 9 2014 6:36PM',
'Speed: ': '3 Knots', 'Course: ': '337 Deg True', 'Longitude: ': '142.086 '}