I created some code to download a CSV file from an URL. The code downloads the HTML code of the link, but when I copy the url that I created in a browser it works, but it does not in the code.
I tried os, response, and urllib, but all these options provided the same result.
This is the link that I ultimately want to download as CSV:
https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf/1506575576011.ajax?fileType=csv&fileName=IAPD_holdings&dataType=fund
import requests
#this is the url where the csv is
url='https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf?switchLocale=y&siteEntryPassthrough=true'
r = requests.get(url, allow_redirects=True)
response = requests.get(url)
if response.status_code == 200:
print("Success")
else:
print("Failure")
#find the url for the CSV
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content,'lxml')
for i in soup.find_all('a',{'class':"icon-xls-export"}):
print(i.get('href'))
# I get two types of files, one CSV and the other xls.
link_list=[]
for i in soup.find_all('a', {'class':"icon-xls-export"}):
link_list.append(i.get('href'))
# I create the link with the CSV
url_csv = "https://www.ishares.com//"+link_list[0]
response_csv = requests.get(url_csv)
if response_csv.status_code == 200:
print("Success")
else:
print("Failure")
#Here I want to download the file
import urllib.request
with urllib.request.urlopen(url_csv) as holdings1, open('dataset.csv', 'w') as f:
f.write(holdings1.read().decode())
I would like to get the CSV data downloaded.
It needs cookies to work correctly
I use requests.Session() to get and keep cookies automatically.
And I write in file response_csv.content because I already have it after second requests - so I don't have to make another requests. And because using urllib.request I will create requests without cookies and it may not works.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
url='https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf?switchLocale=y&siteEntryPassthrough=true'
response = s.get(url, allow_redirects=True)
if response.status_code == 200:
print("Success")
else:
print("Failure")
#find the url for the CSV
soup = BeautifulSoup(response.content,'lxml')
for i in soup.find_all('a',{'class':"icon-xls-export"}):
print(i.get('href'))
# I get two types of files, one CSV and the other xls.
link_list=[]
for i in soup.find_all('a', {'class':"icon-xls-export"}):
link_list.append(i.get('href'))
# I create the link with the CSV
url_csv = "https://www.ishares.com//"+link_list[0]
response_csv = s.get(url_csv)
if response_csv.status_code == 200:
print("Success")
f = open('dataset.csv', 'wb')
f.write(response_csv.content)
f.close()
else:
print("Failure")
Related
I found some code online that allows you to download all the PDF found from a url and it works, but it fails on the website I need it for. Im trying to download the PDF of the menu for each day of the week and I can't seem to figure out how to narrow it down to only those 7 pdf files.
from bs4 import BeautifulSoup
import requests
url = "https://calbaptist.edu/dining/alumni-dining-commons"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
for link in links:
if (".pdf" in link.get('href', [])):
i += 1
print("Downloading file: ", i)
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")
I tried to change the if-statement to instead of looking for .pdf to look for /dining/menus-and-hours/adc-menus/. This gave me an error on the line that gets the responce object for the link.
Check the href values, they are relative and not absolute, so you have to prepend the "base url".
You could also select your elements more specific with css selector like contains something:
soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]')
or ends with .pdf
soup.select('a[href$=".pdf"]')
May also take a look at enumerat():
for i,e in enumerate(soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]'),start=1):
Checking content type of reponse header:
requests.get('https://calbaptist.edu'+e.get('href')).headers['Content-Type']
Example
from bs4 import BeautifulSoup
import requests
url = "https://calbaptist.edu/dining/alumni-dining-commons"
soup = BeautifulSoup(requests.get(url).text)
for i,e in enumerate(soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]'),start=1):
r = requests.get('https://calbaptist.edu'+e.get('href'))
if r.headers['Content-Type'] == 'application/pdf':
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(r.content)
pdf.close()
print("File ", i, " downloaded")
How do I make it so that each image I garnered from web scraping is then stored to a folder? I use Google Colab currently since I am just practicing stuff. I want to store them in my Google Drive folder.
This is my code for web scraping:
import requests
from bs4 import BeautifulSoup
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata('https://www.yahoo.com/')
soup = BeautifulSoup(htmldata, 'html.parser')
imgdata = []
for i in soup.find_all('img'):
imgdata = i['src']
print(imgdata)
I created a pics folder manually in the folder where the script is running to store the pictures in it. Than i changed your code in the for loop so its appending urls to the imgdata list. The try exceptblock is there because not every url in the list is valid.
import requests
from bs4 import BeautifulSoup
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata('https://www.yahoo.com/')
soup = BeautifulSoup(htmldata, 'html.parser')
imgdata = []
for i in soup.find_all('img'):
imgdata.append(i['src']) # made a change here so its appendig to the list
filename = "pics/picture{}.jpg"
for i in range(len(imgdata)):
print(f"img {i+1} / {len(imgdata)+1}")
# try block because not everything in the imgdata list is a valid url
try:
r = requests.get(imgdata[i], stream=True)
with open(filename.format(i), "wb") as f:
f.write(r.content)
except:
print("Url is not an valid")
foo.write('whatever')
foo.close()
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text)
table = soup.find_all(class_='ag-header-cell-text')
this give me None value any idea how to scrape data from this site would appreciate.
BeautifulSoup can only see what's directly baked into the HTML of a resource at the time it is initially requested. The content you're trying to scrape isn't baked into the page, because normally, when you view this particular page in a browser, the DOM is populated asynchronously using JavaScript. Fortunately, logging your browser's network traffic reveals requests to a REST API, which serves the contents of the table as JSON. The following script makes an HTTP GET request to that API, given a desired "dataset_id" (you can change the key-value pair in the params dict as desired). The response is then dumped into a CSV file:
def main():
import requests
import csv
url = "https://portal.karandaaz.com.pk/api/table"
params = {
"dataset_id": "1000"
}
response = requests.get(url, params=params)
response.raise_for_status()
content = response.json()
filename = "dataset_{}.csv".format(params["dataset_id"])
with open(filename, "w", newline="") as file:
fieldnames = content["data"]["columns"]
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for row in content["data"]["rows"]:
writer.writerow(dict(zip(fieldnames, row)))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The tag you're searching for isn't in the source code, which is why you're returning no data. Is there some reason you expect this to be there? You may be seeing different source code in a browser than you do when pulling it with the requests library.
You can view the code being pulled via:
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text, "lxml")
print( soup )
I'm trying to use BeautifulSoup to scrape .xls tables which are available for download from Xcel Energy's website (https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports).
This function gets the URL links of the tables and attempts to download them:
url = 'https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports'
dir = 'C:/Users/aobrien/PycharmProjects/xceldatascraper/'
def scraper(page):
from bs4 import BeautifulSoup as bs
import urllib.request
import requests
import os
import re
tld = r'https://www.xcelenergy.com'
pageobj = requests.get(page, verify=False)
sp = bs(pageobj.content, 'html.parser')
xlst, fnms = [], []
links = [a['href'] for a in sp.find_all('a', attrs={'href': re.compile("/staticfiles/")})]
for idx, a in enumerate(links):
if a.endswith('.xls'):
furl = tld + str(a)
xlst.append(furl)
fnms.append(a.split('/')[4])
naur = zip(fnms, xlst)
if not os.path.exists(dir + 'tables'):
os.makedirs(dir + 'tables')
for name, url in naur:
print(url)
res = urllib.request.urlopen(url)
xls = open(dir + 'tables/' + name, 'wb')
xls.write(res.read())
xls.close()
scraper(url)
The scripts fails when urllib.request.urlopen(url) attempts to access the file, returning "urllib.error.HTTPError: HTTP Error 404: Not Found". The "print(url)" statement prints the url that I had the script construct (https://www.xcelenergy.com/staticfiles/xe-responsive/Working With Us/MI-City-Forest-Lake-2016.xls), and manually pasting that url into a browser downloads the file just fine.
What am I missing?
Extracting the "2016-Annual" table in http://www.americashealthrankings.org/api/v1/downloads/131 to a csv. The table has 3 fields- STATE, RANK, VALUE. Getting error with the following:
import urllib2
from bs4 import BeautifulSoup
import csv
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find('2016-Annual', {'class': 'STATE-RANK-VALUE'})
f = open('output.csv', 'w')
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 3:
STATE = cells[0].find(text=True)
RANK = cells[1].find(text=True)
VALUE = cells[2].find(text=True)
print write_to_file
f.write(write_to_file)
f.close()
What am I missing here? Using python 2.7
you code is wrong
this 'http://www.americashealthrankings.org/api/v1/downloads/131' download
csv file.
download csv file to local computer, you can use this file.
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
import urllib2
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
html = urllib2.urlopen(url).read()
with open('output.csv', 'w') as output:
output.write(html)
According to the Beautifulsoup docs, you need to pass a string to be parsed on initialization. However, page = urllib2.urlopen(req) returns a pointer to a page.
Try using soup = BeautifulSoup(page.read(), 'html.parser') instead.
Also, the variable write_to_file doesn't exist.
If this doesn't solve it, please also post which error you get.
The reason its not working is because your pointing to a file that is already a csv - you can literally load that URL in your browser and it will download in CSV file format ---- the table your expecting though, is not at that endpoint - it is at this URL:
http://www.americashealthrankings.org/explore/2016-annual-report
Also - I dont see a class called STATE-RANK-VALUE I only see th headers called state,rank, and ,value