Get text from webpage as iterable object in python 3.3

Get text from webpage as iterable object in python 3.3 - python

Im trying to get the text from a webpage with Python 3.3 and then search through that text for certain strings. When I find a matching string I need to save the following text. For example I take this page: http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy
and I need to save the text after each category (card text, rarity, etc) in the card info.
Currently Im using beautiful Soup but get_text causes a UnicodeEncodeError and doesnt return an iterable object. Here is the relevant code:
urlStr = urllib.request.urlopen(
'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName
).read()
htmlRaw = BeautifulSoup(urlStr)
htmlText = htmlRaw.get_text
for line in htmlText:
line = line.strip()
if "Converted Mana Cost:" in line:
cmc = line.next()
message += "*Converted Mana Cost: " + cmc +"* \n\n"
elif "Types:" in line:
type = line.next()
message += "*Type: " + type +"* \n\n"
elif "Card Text:" in line:
rulesText = line.next()
message += "*Rules Text: " + rulesText +"* \n\n"
elif "Flavor Text:" in line:
flavor = line.next()
message += "*Flavor Text: " + flavor +"* \n\n"
elif "Rarity:" in line:
rarity == line.next()
message += "*Rarity: " + rarity +"* \n\n"

This is incorrect:
htmlText = htmlRaw.get_text
As get_text is a method of the BeautifulSoup class, you're assigning the method to htmlText and not its result. There is a property variant of it that will do what you want here:
htmlText = htmlRaw.text
You're also using a HTML parser to simply strip tags, when you could use it to target the data you want:
# unique id for the html section containing the card info
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol'
# grab the html section with the card info
card_data = htmlRaw.find(id=card_id)
# create a generator to iterate over the rows
card_rows = ( row for row in card_data.find_all('div', 'row') )
# create a generator to produce functions for retrieving the values
card_rows_getters = ( lambda x: row.find('div', x).text.strip() for row in card_rows )
# create a generator to get the values
card_values = ( (get('label'), get('value')) for get in card_rows_getters )
# dump them into a dictionary
cards = dict( card_values )
print cards
{u'Artist:': u'Scott Chou',
u'Card Name:': u'Dark Prophecy',
u'Card Number:': u'93',
u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.',
u'Community Rating:': u'Community Rating: 3.617 / 5\xa0\xa0(64 votes)',
u'Converted Mana Cost:': u'3',
u'Expansion:': u'Magic 2014 Core Set',
u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.',
u'Mana Cost:': u'',
u'Rarity:': u'Rare',
u'Types:': u'Enchantment'}
Now you have a dictionary of the information you want (plus a few extra) which will be a lot easier to deal with.

Related

Extract specific text from a list in Python

I am trying to extract certain information from a long list of text do display it nicely but i cannot seem to figure out how exactly to tackle this problem.
My text is as follows:
"(Craw...Crawley\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\n**Hotstage**\n **248236**\n\n\n\n\n\n\n\n\n\n\n\n\n\nCosta Collect...Costa Coffee (Bedf...Bedford\n\n\n\n\n\n\n08:00\n\n\n\n \n\n\n**Hotstage**\n **247962**\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Sheffield Qu...Sheffield\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\nHotstage\n 247971\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Brentford...BRENTFORD\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\nHotstage\n 248382\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Newport"
I would like to extract what is highlighted.
I'm thinking the solution is simple and maybe I am not storing the information properly or not extracting it properly.
This is my code
from bs4 import BeautifulSoup
import requests
import re
import time
def main():
url = "http://antares.platinum-computers.com/schedule.htm"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
response.close()
# Get
tech_count = 0
technicians = [] #List to hold technicians names
xcount = 0
test = 0
name_links = soup.find_all('td', {"class": "resouce_on"}) #Get all table data with class name "resource on".
# iterate through html data and add them to "technicians = []"
for i in name_links:
technicians.append(str(i.text.strip())) # append value to dictionary
tech_count += 1
print("Found: " + str(tech_count) + " technicians + 1 default unallocated.")
for t in technicians:
print(xcount,t)
xcount += 1
test = int(input("choose technician: "))
for link in name_links:
if link.find(text=re.compile(technicians[test])):
jobs = []
numbers = []
unique_cr = []
jobs.append(link.parent.text.strip())
for item in jobs:
for subitem in item.split():
if(subitem.isdigit()):
numbers.append(subitem)
for number in numbers:
if number not in unique_cr:
unique_cr.append(number)
print ("tasks for technician " + str(technicians[test]) + " are as follows")
for cr in unique_cr:
print (jobs)
if __name__ == '__main__':
main()

It's fairly simple:
myStr = "your complicated text"
words = mystr.split("\n")
niceWords = []
for word in words:
If "**"in word:
niceWords.append(word.replace("**", "")
print(niceWords)

Scraping Yahoo Finance Balance Sheet with Python

My question is a follow-up question to one asked here.
The function:
periodic_figure_values()
seems to work fine except in the case where the name of a line item being searched appears twice. The specific case I am referring to is trying to get data for "Long Term Debt". The function in the link above will return the following error:
Traceback (most recent call last):
File "test.py", line 31, in <module>
LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
File "test.py", line 21, in periodic_figure_values
value = int(str_value)
ValueError: invalid literal for int() with base 10: 'Short/Current Long Term Debt'
because it seems to get tripped up on "Short/Current Long Term Debt". You see, the page has both "Short/Current Long Term Debt" and "Long Term Debt". You can see an example of the source page using Apple's balance sheet here.
I'm trying to find a way for the function to return data for "Long Term Debt" without getting tripped up on "Short/Current Long Term Debt".
Here is the function and an example that fetches "Cash and Cash Equivalents", which works fine, and "Long Term Debt", which does not work:
import requests, bs4, re
def periodic_figure_values(soup, yahoo_figure):
values = []
pattern = re.compile(yahoo_figure)
title = soup.find("strong", text=pattern) # works for the figures printed in bold
if title:
row = title.parent.parent
else:
title = soup.find("td", text=pattern) # works for any other available figure
if title:
row = title.parent
else:
sys.exit("Invalid figure '" + yahoo_figure + "' passed.")
cells = row.find_all("td")[1:] # exclude the <td> with figure name
for cell in cells:
if cell.text.strip() != yahoo_figure: # needed because some figures are indented
str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
if str_value == "-":
str_value = 0
value = int(str_value)
values.append(value)
return values
res = requests.get('https://ca.finance.yahoo.com/q/bs?s=AAPL')
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
Cash=(periodic_figure_values(soup, "Cash And Cash Equivalents"))
print(Cash)
LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
print(LongTermDebt)

The easiest would be to use a try/except combination using the raised ValueError:
import requests, bs4, re
def periodic_figure_values(soup, yahoo_figure):
values = []
pattern = re.compile(yahoo_figure)
title = soup.find("strong", text=pattern) # works for the figures printed in bold
if title:
row = title.parent.parent
else:
title = soup.find("td", text=pattern) # works for any other available figure
if title:
row = title.parent
else:
sys.exit("Invalid figure '" + yahoo_figure + "' passed.")
cells = row.find_all("td")[1:] # exclude the <td> with figure name
for cell in cells:
if cell.text.strip() != yahoo_figure: # needed because some figures are indented
str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
if str_value == "-":
str_value = 0
### from here
try:
value = int(str_value)
values.append(value)
except ValueError:
continue
### to here
return values
res = requests.get('https://ca.finance.yahoo.com/q/bs?s=AAPL')
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
Cash=(periodic_figure_values(soup, "Cash And Cash Equivalents"))
print(Cash)
LongTermDebt=(periodic_figure_values(soup, "Long Term Debt"))
print(LongTermDebt)
This one prints out your numbers quite fine.
Note, that you do not really need the re module in this situation here as you're checking for literals only (no wildcards, no boundaries), etc.

You could change the function so that it accepts a regular expression instead of a plain string. Then you can search for ^Long Term Debt to make sure there's no text before that. All you need to do is to change
if cell.text.strip() != yahoo_figure:
to
if not re.match(yahoo_figure, cell.text.strip()):

My if statement to get first element from list in python is not working

Hi i'm new to python and creating my first program using beautiful soup I have
an if statement that i want to use. I want to see if I pull back the first element is true. I"m getting a syntax error and i have'nt figured out how to fix it. This is my lastest iteration of my code where i add a header before i iterate over the remaining elements
import requests
from bs4 import BeautifulSoup
req = requests.get("http://espn.go.com/college-football/rankings")
soup = BeautifulSoup(req.content, "lxml")
g_data = soup.find_all("tr")
myCaptions = soup.find_all("caption")
print "Here are the college-football Rankings for 2015"
if myCaptions[0]:
print "College Football Playoff Rankings" #Lets print the first header
for rows in g_data:
contro = 0
try:
ranking = rows.contents[0].find_all("span")[0].text
myTeam = rows.contents[1].find_all("span", {"class": "team-names"})[0].text
myRecord = rows.find_all("td")[2].text
print ranking + " " + myTeam + " " + myRecord
except:
pass
contro = contro + 1

First of all, you have an indentation error - contro = 0 is strangely indented for some reason.
Also, if you want to check the truthiness of a value, there is no need to explicitly compare it to True, just use if myCaptions[0]:.
Anyway, there is no need for find_all() in the first place - if you want to check if an element found. Instead, use find() and check if the result is not None:
caption = rows.find("caption")
if caption is not None: # "if caption:" would also work
print "College Football Playoff Rankings"
Here is how would I approach the problem of locating multiple tables with captions:
for table in soup.find_all("table", class_="rankings"):
caption = table.find("caption").get_text(strip=True)
print caption
for row in table.find_all("tr")[1:-1]:
ranking = row.find("span", class_="number")
ranking = ranking.get_text() if ranking else "Unknown ranking"
myTeam = row.find("span", class_="team-names").get_text()
myRecord = row.find_all("td")[2].text
print ranking + " " + myTeam + " " + myRecord
print "----"
Works for me, printing:
College Football Playoff Rankings
1 Clemson 13-0
2 Alabama 12-1
3 Michigan State 12-1
...
24 Temple 10-3
25 USC 8-5
----
AP Top 25
1 Clemson 13-0
2 Alabama 12-1
...

HTTPError: HTTP Error 400: Bad request urllib2

I am beginner to python. I am the developer of Easy APIs Project (http://gcdc2013-easyapisproject.appspot.com) and was doing a Python implementation of weather API using my project. Visit http://gcdc2013-easyapisproject.appspot.com/APIs_Doc.html to see Weather API. The below is my implementation but it returns HTTPError: HTTP Error 400: Bad request error.
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q='+a+' in celsius')
htmlx = responsex.read()
responsex.close()
htmlx = html[1:] #remove first {
htmlx = html[:-1] #remove last }
htmlx = html.split('}{') #split and put each resutls to array
return str(htmlx[1]);
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+q)
html = response.read()
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3])
Please help me..

The query string should be quoted using urllib.quote or urllib.quote_plus:
import urllib
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q=' + urllib.quote(a + ' in celsius'))
html = responsex.read()
responsex.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
return html[0]
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+urllib.quote(q))
html = response.read()
print repr(html)
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3].split()[0])
In addition to that, I modified celsius to use html instead of htmlx. The original code mixed use of html and htmlx.

I have found the answer. The query should be quoted with urllib2.quote(q)

Exploring an attribute in Python

I am parsing html from the following website: http://www.asusparts.eu/partfinder/Asus/All In One/E Series I was just wondering if there was any way i could explore a parsed attribute in python?
For example.. The code below outputs the following:
datas = s.find(id='accordion')
a = datas.findAll('a')
for data in a:
if(data.has_attr('onclick')):
model_info.append(data['onclick'])
print data
[OUTPUT]
Bracket
These are the values i would like to retrieve:
nCategoryID = Bracket
nModelID = ET10B
family = E Series
As the page is rendered from AJAX, They are using a script source resulting in the following url from the script file:
url = 'http://json.zandparts.com/api/category/GetCategories/' + country + '/' + currency + '/' + nModelID + '/' + family + '/' + nCategoryID + '/' + brandName + '/' + null
How can i retrieve only the 3 values listed above?
[EDIT]
import string, urllib2, urlparse, csv, sys
from urllib import quote
from urlparse import urljoin
from bs4 import BeautifulSoup
from ast import literal_eval
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
#Array to hold all options
redirects = []
#Array to hold all data
model_info = []
print "FETCHING OPTIONS"
select = soup.find(id='myselectListModel')
#print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
#print s
print "FETCHING MAIN TITLE"
#Finding all the headings for each specific Model
maintitle = s.find(id='puffBreadCrumbs')
print maintitle.get_text()
#Find entire HTML container holding all data, rendered by AJAX
datas = s.find(id='accordion')
#Find all 'a' tags inside data container
a = datas.findAll('a')
#Find all 'span' tags inside data container
content = datas.findAll('span')
print "FETCHING CATEGORY"
#Find all 'a' tags which have an attribute of 'onclick' Error:(doesn't display anything, can't seem to find
#'onclick' attr
if(hasattr(a, 'onclick')):
arguments = literal_eval('(' + a['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4]
print "FETCHING DATA"
for complete in content:
#Find all 'class' attributes inside 'span' tags
if(complete.has_attr('class')):
model_info.append(complete['class'])
print complete.get_text()
#Find all 'table data cells' inside table held in data container
print "FETCHING IMAGES"
img = s.find('td')
#Find all 'img' tags held inside these 'td' cells and print out
images = img.findAll('img')
print images
I have added an Error line where the problem lays...

Similar to Martijn's answer, but makes primitive use of pyparsing (ie, it could be refined to recognise the function and only take quoted strings with the parentheses):
from bs4 import BeautifulSoup
from pyparsing import QuotedString
from itertools import chain
s = '''Bracket'''
soup = BeautifulSoup(s)
for a in soup('a', onclick=True):
print list(chain.from_iterable(QuotedString("'", unquoteResults=True).searchString(a['onclick'])))
# ['Asus', 'Bracket', 'ET10B', '7138', 'E Series']

You could parse that as a Python literal, if you remove the this, part from it, and only take everything between the parenthesis:
from ast import literal_eval
if data.has_attr('onclick'):
arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments
We remove the this argument because it is not a valid python string literal and you don't want to have it anyway.
Demo:
>>> literal_eval('(' + "getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')".replace(', this', '').split('(', 1)[1])
('Asus', 'Bracket', 'ET10B', '7138', 'E Series')
Now you have a Python tuple and can pick out any value you like.
You want the values at indices 1, 2 and 4, for example:
nCategoryID, nModelID, family = arguments[1], arguments[3], arguments[4]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get text from webpage as iterable object in python 3.3 - python

Related

Extract specific text from a list in Python

Scraping Yahoo Finance Balance Sheet with Python

My if statement to get first element from list in python is not working

HTTPError: HTTP Error 400: Bad request urllib2

Exploring an attribute in Python

Categories

Resources