I'm new to Python and any type of coding ...I hope this is not too easy question.
I'm trying to make a csv file from the scrape data from the web.
AttributeError: 'Doctype' object has no attribute 'find_all'
But this error wont go away!
here's the whole code
import bs4 as bs
import urllib.request
req = urllib.request.Request('http://www.mobygames.com/game/tom-clancys-rainbow-six-siege',headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
scores = soup.find_all("div")
filename = "scores1.csv"
f = open(filename, "w")
headers = "Hi, Med, Low\n"
f.write(headers)
for scores in soup:
scoreHi = scores.find_all("div", {"class":"scoreHi"})
Hi = scoreHi[0].text
scoreMed = scores.find_all("div", {"class":"scoreMed"})
Med = scoreMed[0].text
scoreLow = scores.find_all("div", {"class":"scoreLow"})
Low = scoreLow[0].text
print ("Hi: " + Hi)
print ("Med: " + Med)
print ("Low: "+ Low)
f.write(Hi + "," + Med.replace(",","|") + "," + Low + "\n")
f.close()
You first assign to scores:
scores = soup.find_all("div")
which is fine, but you then should walk over those scores:
for score in scores:
scoreHi = score.find_all("div", {"class":"scoreHi"})
Hi = scoreHi[0].text
scoreMed = score.find_all("div", {"class":"scoreMed"})
Med = scoreMed[0].text
scoreLow = score.find_all("div", {"class":"scoreLow"})
Low = scoreLow[0].text
Trying to iterate over the Doc (i.e. soup) using:
for scores in soup:
makes no sense.
Related
Running a program in cmd; the print function
with open('test1.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for index, url in enumerate(URL_LIST):
page = requests.get(url)
print '\r' 'Scraping URL ' + str(index+1) + ' of ' + str(len(URL_LIST)),
if text2search in page.text:
tree = html.fromstring(page.content)
(title,) = (x.text_content() for x in tree.xpath('//title'))
(price,) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold,) = (x.text_content().strip() for x in tree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
Which returns: Scraping URL 1 of 400
Over and over till count ends.
What i'm trying to learn today, is printing 2 outcomes on 2 separate lines, over and over till loop ends.
Example:
Scraping URL 1 of 400 Where bold character is only thing changing
Then if the scraper finds a result in the list;
Adding Result 1 to CSV Where bold character is only thing changing
So far i have tried a few print commands, but it either overwrites the entire sentence on the same line;
with open('test1.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for index, url in enumerate(URL_LIST):
page = requests.get(url)
print '\r' 'Scraping URL ' + str(index+1) + ' of ' + str(len(URL_LIST)),
if text2search in page.text:
tree = html.fromstring(page.content)
(title,) = (x.text_content() for x in tree.xpath('//title'))
(price,) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold,) = (x.text_content().strip() for x in tree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
print '\r' 'URL_FOUND' + str(index+1) + 'adding to CSV',
If i try to link to two print functions to an else argument, it will only print the first statement and the second is not acknowledged.
with open('test1.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for index, url in enumerate(URL_LIST):
page = requests.get(url)
print '\r' 'Scraping URL ' + str(index+1) + ' of ' + str(len(URL_LIST)),
else:
if text2search in page.text:
tree = html.fromstring(page.content)
(title,) = (x.text_content() for x in tree.xpath('//title'))
(price,) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold,) = (x.text_content().strip() for x in tree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
print '\n' 'title'
Just wondering if anyone could point me in the right direction for printing two outcomes on 2 lines.
Full code below if required:
import requests
import csv
import datetime
import pandas as pd
import csv
from lxml import html
df = pd.read_excel("C:\Python27\Projects\REA_SCRAPER\\REA.xlsx", sheetname="REA")
dnc = df['Property']
dnc_list = list(dnc)
url_base = "https://www.realestate.com.au/property/"
URL_LIST = []
for nd in dnc_list:
nd = nd.strip()
nd = nd.lower()
nd = nd.replace(" ", "-")
URL_LIST.append(url_base + nd)
text2search = '''RECENTLY SOLD'''
with open('test1.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for index, url in enumerate(URL_LIST):
page = requests.get(url)
print '\r' 'Scraping URL ' + str(index+1) + ' of ' + str(len(URL_LIST)),
if text2search in page.text:
tree = html.fromstring(page.content)
(title,) = (x.text_content() for x in tree.xpath('//title'))
(price,) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold,) = (x.text_content().strip() for x in tree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
I would have recommended curses, but you're on Windows and just writing what appears to be a small script; reason enough to not go down that rabbit hole.
The reason you are seeing your lines overwrite each other is because you are printing carriage returns \r, which moves the cursor to the start of the line. Any text written thereafter will overwrite previous printed text.
I found this with a quick Google, which may be of interest to you.
I'm attempting to scrape data for all the quarterbacks who have been drafted. http://www.nfl.com/draft/history/fulldraft?type=position
I'm able to scrape the data. However, there are blank lines that I cannot get rid of. Excel file output
Here is the code that I used.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdata = playerdatasaved = ""
soup = make_soup("http://www.nfl.com/draft/history/fulldraft?type=position")
for record in soup.findAll('tr'):
playerdata = ""
for data in record.findAll('td'):
playerdata = playerdata + "," + data.text
if len(playerdata)!= 0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header = "Round, Selection #, Player, Position, School, Team Drafted" + "\n"
file = open("Quarterbacks.csv","wb")
file.write(bytes(header, encoding = "ascii", errors = 'igonore'))
file.write(bytes(playerdatasaved, encoding = "ascii", errors = 'igonore'))
I've tried to use an if statement to check for \n breaks and remove the breaks. Also, I've tried to turn the data into a string and use a replace or split command. None of these corrected the issue.
Thanks for any help that you can give me!
I am a complete programming beginner, so please forgive me if I am not able to express my problem very well. I am trying to write a script that will look through a series of pages of news and will record the article titles and their links. I have managed to get that done for the first page, the problem is getting the content of the subsequent pages. By searching in stackoverflow, I think I managed to find a solution that will make the script access more than one URL BUT it seems to be overwriting the content extracted from each page it accesses so I always end up with the same number of recorded articles in the file. Something that might help: I know that URLs follow the following model: "/ultimas/?page=1", "/ultimas/?page=2", etc. and it appears to be using AJAX to request new articles
Here is my code:
import csv
import requests
from bs4 import BeautifulSoup as Soup
import urllib
r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="
for page in range(1, 4):
url = "%s%d" % (program_url, page)
soup = Soup(urllib.urlopen(url))
letters = soup.find_all("div", class_="titulo-noticia")
letters[0]
lobbying = {}
for element in letters:
lobbying[element.a.get_text()] = {}
letters[0].a["href"]
prefix = "http://agenciabrasil.ebc.com.br"
for element in letters:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
for item in lobbying.keys():
print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"
import os, csv
os.chdir("...")
with open("lobbying.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["name", "link",])
for a in lobbying.keys():
writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])
import json
with open("lobbying.json", "w") as writeJSON:
json.dump(lobbying, writeJSON)
print "Fim"
Any help on how I might go about adding the content of each page to the final file would be very appreciated. Thank you!
How about this one if serving the same purpose:
import csv, requests
from lxml import html
base_url = "http://agenciabrasil.ebc.com.br"
program_url = base_url + "/ultimas/?page={0}"
outfile = open('scraped_data.csv', 'w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Caption","Link"])
for url in [program_url.format(page) for page in range(1, 4)]:
response = requests.get(url)
tree = html.fromstring(response.text)
for title in tree.xpath("//div[#class='noticia']"):
caption = title.xpath('.//span[#class="field-content"]/a/text()')[0]
policy = title.xpath('.//span[#class="field-content"]/a/#href')[0]
writer.writerow([caption , base_url + policy])
It looks like the code in your for loop (for page in range(1, 4):) isn't been called due to your file not been correctly indented:
If you tidy up your code, it works:
import csv, requests, os, json, urllib
from bs4 import BeautifulSoup as Soup
r = base_url = "http://agenciabrasil.ebc.com.br/"
program_url = base_url + "/ultimas/?page="
for page in range(1, 4):
url = "%s%d" % (program_url, page)
soup = Soup(urllib.urlopen(url))
letters = soup.find_all("div", class_="titulo-noticia")
lobbying = {}
for element in letters:
lobbying[element.a.get_text()] = {}
prefix = "http://agenciabrasil.ebc.com.br"
for element in letters:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
for item in lobbying.keys():
print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t"
#os.chdir("...")
with open("lobbying.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["name", "link",])
for a in lobbying.keys():
writer.writerow([a.encode("utf-8"), lobbying[a]["link"]])
with open("lobbying.json", "w") as writeJSON:
json.dump(lobbying, writeJSON)
print "Fim"
So I have this python script. Right now, I run the script and it gives me an output file in CSV.
What I want: When it finishes to restart and to check for changes to those output values (not refresh the output file when it restarts and erase all the previously collected data)
As well, it takes about 3 seconds per line of data to get retrieved. Does anyone know how I can get it going fast to handle large data sets?
import urllib2,re,urllib,urlparse,csv,sys,time,threading,codecs
from bs4 import BeautifulSoup
def extract(url):
try:
sys.stdout.write('0')
global file
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, 'html.parser')
product = soup.find("div", {"class": "js-product-price"})
price = product.findNext('div',{'class':'js-price-display'}).getText().strip()
oos = product.findNext('p', attrs={'class': "price-oos"})
if oos is None:
oos = 'In Stock'
else:
oos = oos.getText()
val = url + "," + price + "," + oos + "," + time.ctime() + '\n'
ifile.write(val)
sys.stdout.write('1')
except Exception as e:
print e
#pass
return
ifile = open('output.csv', "a", 0)
ifile.write('URL' + "," + 'Price' + "," + 'Stock' + "," + "Time" + '\n')
inputs = csv.reader(open('input.csv'))
#inputs = csv.reader(codecs.open('input.csv', 'rU', 'utf-16'))
for i in inputs:
extract(i[0])
ifile.close()
print("finished")
I'm trying to scrape temperatures from a weather site using the following:
import urllib2
from BeautifulSoup import BeautifulSoup
f = open('airport_temp.tsv', 'w')
f.write("Location" + "\t" + "High Temp (F)" + "\t" + "Low Temp (F)" + "\t" + "Mean Humidity" + "\n" )
eventually parse from http://www.wunderground.com/history/airport/\w{4}/2012/\d{2}/1/DailyHistory.html
for x in range(10):
locationstamp = "Location " + str(x)
print "Getting data for " + locationstamp
url = 'http://www.wunderground.com/history/airport/KAPA/2013/3/1/DailyHistory.html'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
location = soup.findAll('h1').text
locsent = location.split()
loc = str(locsent[3,6])
hightemp = soup.findAll('nobr')[6].text
htemp = hightemp.split()
ht = str(htemp[1])
lowtemp = soup.findAll('nobr')[10].text
ltemp = lowtemp.split()
lt = str(ltemp[1])
avghum = soup.findAll('td')[23].text
f.write(loc + "\t|" + ht + "\t|" + lt + "\t|" + avghum + "\n" )
f.close()
Unfortunately, I get an error saying:
Getting data for Location 0
Traceback (most recent call last):
File "airportweather.py", line 18, in <module>
location = soup.findAll('H1').text
AttributeError: 'list' object has no attribute 'text'
I've looked through BS and Python documentation, but am still pretty green, so I couldn't figure it out. Please help this newbie!
The .findAll() method returns a list of matches. If you wanted one result, use the .find() method instead. Alternatively, pick out a specific element like the rest of the code does, or loop over the results:
location = soup.find('h1').text
or
locations = [el.text for el in soup.findAll('h1')]
or
location = soup.findAll('h1')[2].text
This is quite simple. findAll returns list, so if you are sure that there is only one interesting you element then: soup.findAll('H1')[0].text should work