I'm trying to scrape temperatures from a weather site using the following:
import urllib2
from BeautifulSoup import BeautifulSoup
f = open('airport_temp.tsv', 'w')
f.write("Location" + "\t" + "High Temp (F)" + "\t" + "Low Temp (F)" + "\t" + "Mean Humidity" + "\n" )
eventually parse from http://www.wunderground.com/history/airport/\w{4}/2012/\d{2}/1/DailyHistory.html
for x in range(10):
locationstamp = "Location " + str(x)
print "Getting data for " + locationstamp
url = 'http://www.wunderground.com/history/airport/KAPA/2013/3/1/DailyHistory.html'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
location = soup.findAll('h1').text
locsent = location.split()
loc = str(locsent[3,6])
hightemp = soup.findAll('nobr')[6].text
htemp = hightemp.split()
ht = str(htemp[1])
lowtemp = soup.findAll('nobr')[10].text
ltemp = lowtemp.split()
lt = str(ltemp[1])
avghum = soup.findAll('td')[23].text
f.write(loc + "\t|" + ht + "\t|" + lt + "\t|" + avghum + "\n" )
f.close()
Unfortunately, I get an error saying:
Getting data for Location 0
Traceback (most recent call last):
File "airportweather.py", line 18, in <module>
location = soup.findAll('H1').text
AttributeError: 'list' object has no attribute 'text'
I've looked through BS and Python documentation, but am still pretty green, so I couldn't figure it out. Please help this newbie!
The .findAll() method returns a list of matches. If you wanted one result, use the .find() method instead. Alternatively, pick out a specific element like the rest of the code does, or loop over the results:
location = soup.find('h1').text
or
locations = [el.text for el in soup.findAll('h1')]
or
location = soup.findAll('h1')[2].text
This is quite simple. findAll returns list, so if you are sure that there is only one interesting you element then: soup.findAll('H1')[0].text should work
Related
This might be a pretty obvious error since I'm pretty new to coding, but I'm trying to read a file for a certain value which I'll gather by using re.search and splice since I only know the text before and after it.
I'm running into a bit of an annoying bug. When I use re.search(r"firstPart(.*?)secondPart", data).group(1) it returns
Traceback (most recent call last):
File "<stdin>", line 10, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
Which is a problem with this line:
englishWord = re.search(r"<i>(.*?)</i>", str(englishWord)).group(1)
If you read the code, you can see that I've made some unnecessary lines where instead of writing the entire string in the re.search function I'll use only a little bit then add or remove text in another function. This is because if I do it all in the re.search function normally it doesn't work.
Possibly the most annoying part and confusing about this is that if I run everything before "englishWord = re.search(r"(.*?)", str(englishWord)).group(1)" then I run it, it works, but if I run all of the code at once I get that error. Any idea why? How can I fix this? Thanks! (I am using python 3.6)
My Code vvv
#!/Library/Frameworks/Python.framework/Versions/3.6/bin/python3
import re
import itertools
with open('Desktop/data.txt', 'r') as myfile:
data=myfile.read().replace('\n', '')
num = 0
for x in itertools.repeat(None, 8):
num = int(num) + 1
if int(num) < 10:
num = '0' + str(num)
firstString = re.search(r"id=\"question_" + num + "_whole_question\" data-sidebar-reference=\"\"> (.*?) <input", data).group(1)
secondString = re.search(r"id=\"question_" + num + "_wol_1\"(.*?) </div>", data).group(1)
secondString = secondString.replace(" name=\"question_" + num + "_wol_1\" onchange=\"has_unsaved_work();\" size=\"10\" type=\"text\" />", "")
finalString = firstString + " _" + secondString
englishWord = re.search(r"(<i><span lang=\"en-US\">(.*?)</span></i>)", finalString)
englishWord = re.search(r"<i>(.*?)</i>", str(englishWord)).group(1)
englishWord = "<i>" + englishWord + "</i>"
finalString = finalString.replace(englishWord, "")
finalString = finalString.replace("()", "")
print (finalString)
call group only if there is a match.
res = re.search(r"<i>(.*?)</i>", str(englishWord))
# if there is a match
if res:
englishWord = res.group(1)
As pointed out in the comments, re.search returns None when no match is found. Link : https://docs.python.org/3/library/re.html#re.search
I'm working on a web parser for a webpage containing mathematical constants. I need to replace some characters in order to have it on a specific format, but I dont know why if I print it, i seems to be working fine; but when I open the output file the format achieved by replace() doesn't seems to have took effect.
That's the code
#!/usr/bin/env python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.ebyte.it/library/educards/constants/ConstantsOfPhysicsAndMath.html"
soup = BeautifulSoup(urlopen(url).read(), "html5lib")
f = open("ebyteParse-output.txt", "w")
table = soup.find("table", attrs={"class": "grid9"})
rows = table.findAll("tr")
for tr in rows:
# If its a category of constants we write that as a comment
if tr.has_attr("bgcolor"):
f.write("\n\n# " + tr.find(text=True) + "\n")
continue
cols = tr.findAll("td")
if (len(cols) >= 2):
if (cols[0]["class"][0] == "box" or cols[0]["class"][0] == "boxi" and cols[1]["class"][0] == "boxa"):
constant = str(cols[0].find(text=True)).replace(" ", "-")
value = str(cols[1].find(text=True))
value = value.replace(" ", "").replace("...", "").replace("[", "").replace("]", "")
print(constant + "\t" + value)
f.write(constant + "\t" + value)
f.write("\n")
f.close()
That is what print shows:
That is what I get on the output file
Thanks you,
Salva
File i was looking for was catched so no changes where seen. Thanks for answering
I'm new to Python and any type of coding ...I hope this is not too easy question.
I'm trying to make a csv file from the scrape data from the web.
AttributeError: 'Doctype' object has no attribute 'find_all'
But this error wont go away!
here's the whole code
import bs4 as bs
import urllib.request
req = urllib.request.Request('http://www.mobygames.com/game/tom-clancys-rainbow-six-siege',headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
scores = soup.find_all("div")
filename = "scores1.csv"
f = open(filename, "w")
headers = "Hi, Med, Low\n"
f.write(headers)
for scores in soup:
scoreHi = scores.find_all("div", {"class":"scoreHi"})
Hi = scoreHi[0].text
scoreMed = scores.find_all("div", {"class":"scoreMed"})
Med = scoreMed[0].text
scoreLow = scores.find_all("div", {"class":"scoreLow"})
Low = scoreLow[0].text
print ("Hi: " + Hi)
print ("Med: " + Med)
print ("Low: "+ Low)
f.write(Hi + "," + Med.replace(",","|") + "," + Low + "\n")
f.close()
You first assign to scores:
scores = soup.find_all("div")
which is fine, but you then should walk over those scores:
for score in scores:
scoreHi = score.find_all("div", {"class":"scoreHi"})
Hi = scoreHi[0].text
scoreMed = score.find_all("div", {"class":"scoreMed"})
Med = scoreMed[0].text
scoreLow = score.find_all("div", {"class":"scoreLow"})
Low = scoreLow[0].text
Trying to iterate over the Doc (i.e. soup) using:
for scores in soup:
makes no sense.
So I have this python script. Right now, I run the script and it gives me an output file in CSV.
What I want: When it finishes to restart and to check for changes to those output values (not refresh the output file when it restarts and erase all the previously collected data)
As well, it takes about 3 seconds per line of data to get retrieved. Does anyone know how I can get it going fast to handle large data sets?
import urllib2,re,urllib,urlparse,csv,sys,time,threading,codecs
from bs4 import BeautifulSoup
def extract(url):
try:
sys.stdout.write('0')
global file
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, 'html.parser')
product = soup.find("div", {"class": "js-product-price"})
price = product.findNext('div',{'class':'js-price-display'}).getText().strip()
oos = product.findNext('p', attrs={'class': "price-oos"})
if oos is None:
oos = 'In Stock'
else:
oos = oos.getText()
val = url + "," + price + "," + oos + "," + time.ctime() + '\n'
ifile.write(val)
sys.stdout.write('1')
except Exception as e:
print e
#pass
return
ifile = open('output.csv', "a", 0)
ifile.write('URL' + "," + 'Price' + "," + 'Stock' + "," + "Time" + '\n')
inputs = csv.reader(open('input.csv'))
#inputs = csv.reader(codecs.open('input.csv', 'rU', 'utf-16'))
for i in inputs:
extract(i[0])
ifile.close()
print("finished")
I seem to have hit a wall with my script. I'm trying to make it grab the text of a commentary from a website and put in some basic XML tags. It grabs everything on a page and that needs to be fixed, but that's a secondary concern right now. I've gotten the script to split the text into chapters, but I can't figure out how to further divide it into the verses. I'm trying to replace every occurrence of "Verse" in a chapter with </verse><verse name = "n">, with "n" being the verse number. I've tried a few things, including for loops and ElementTree, but it either doesn't work or makes every verse name the same.
I tried putting in the following code, but it never seemed to complete when I try it:
x = "Verse"
for x in para:
para = para.replace (x, '</verse><verse name = " ' +str(n+1) + ' " >' )
n = n + 1
The code below seems to be the most...functional that I've managed to make it. Any advice on how I should fix this or what else I might try?
from lxml import html
import requests
name = open("new.txt", "a")
name.write("""<?xml version="1.0"?>""")
name.write("<data>")
n = 0
for i in range(0, 17):
url_base = "http://www.studylight.org/commentaries/acc/view.cgi?bk=45&ch="
url_norm = url_base + str(i)
page = requests.get(url_norm)
tree = html.fromstring(page.text)
para = tree.xpath('/html/body/div[2]//table//text()')
name.write("<chapter name =\"" + str(i) + "\" >")
para = str(para)
para = para.replace("&", " ")
para = para.replace ("Verse", '</verse><verse name = " ' +str(n+1) + ' " >' )
name.write(str(para))
name.write("</chapter>")
name.write("</data>")
name.close()
print "done"
you shouldn't be changing texts, when manipulating xhtml document use xslt