Extracting the "2016-Annual" table in http://www.americashealthrankings.org/api/v1/downloads/131 to a csv. The table has 3 fields- STATE, RANK, VALUE. Getting error with the following:
import urllib2
from bs4 import BeautifulSoup
import csv
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find('2016-Annual', {'class': 'STATE-RANK-VALUE'})
f = open('output.csv', 'w')
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 3:
STATE = cells[0].find(text=True)
RANK = cells[1].find(text=True)
VALUE = cells[2].find(text=True)
print write_to_file
f.write(write_to_file)
f.close()
What am I missing here? Using python 2.7
you code is wrong
this 'http://www.americashealthrankings.org/api/v1/downloads/131' download
csv file.
download csv file to local computer, you can use this file.
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
import urllib2
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
html = urllib2.urlopen(url).read()
with open('output.csv', 'w') as output:
output.write(html)
According to the Beautifulsoup docs, you need to pass a string to be parsed on initialization. However, page = urllib2.urlopen(req) returns a pointer to a page.
Try using soup = BeautifulSoup(page.read(), 'html.parser') instead.
Also, the variable write_to_file doesn't exist.
If this doesn't solve it, please also post which error you get.
The reason its not working is because your pointing to a file that is already a csv - you can literally load that URL in your browser and it will download in CSV file format ---- the table your expecting though, is not at that endpoint - it is at this URL:
http://www.americashealthrankings.org/explore/2016-annual-report
Also - I dont see a class called STATE-RANK-VALUE I only see th headers called state,rank, and ,value
Related
I'm new in python world and wondering how can I scrape data from github into CSV file, e.g.
https://gist.github.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19#file-pokemon-csv
I'm trying with this code, however it is not very successful. Definitely there should be an easier way how to do it.
Thank you in advance!
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://gist.github.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pokemon_table = soup.find('table', class_= 'highlight tab-size js-file-line-container')
for pokemon in pokemon_table.find_all('tr'):
name = [pokemon.find('td', class_= 'blob-code blob-code-inner js-file-line').text]
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(name)
If you do not mind taking the 'Raw version' of the CSV file, the following code would work:
import requests
response = requests.get("https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/bd584ee6c307cc9fab5ba38916e98a85de9c2ba7/pokemon.csv")
with open("output.csv", "w") as file:
file.write(response.text)
The URL you are using does not link to the CSV's Raw version, but the following does:
https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/bd584ee6c307cc9fab5ba38916e98a85de9c2ba7/pokemon.csv
Edit 1: Just to clarify it, you can access the Raw version of that CSV file by pressing the 'Raw' button on the top-right side of the CSV file shown in the link you provided.
Edit 2: Also, it looks like the following URL would also work, and it is shorter and easier to 'build' based in the original URL:
https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/pokemon.csv
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text)
table = soup.find_all(class_='ag-header-cell-text')
this give me None value any idea how to scrape data from this site would appreciate.
BeautifulSoup can only see what's directly baked into the HTML of a resource at the time it is initially requested. The content you're trying to scrape isn't baked into the page, because normally, when you view this particular page in a browser, the DOM is populated asynchronously using JavaScript. Fortunately, logging your browser's network traffic reveals requests to a REST API, which serves the contents of the table as JSON. The following script makes an HTTP GET request to that API, given a desired "dataset_id" (you can change the key-value pair in the params dict as desired). The response is then dumped into a CSV file:
def main():
import requests
import csv
url = "https://portal.karandaaz.com.pk/api/table"
params = {
"dataset_id": "1000"
}
response = requests.get(url, params=params)
response.raise_for_status()
content = response.json()
filename = "dataset_{}.csv".format(params["dataset_id"])
with open(filename, "w", newline="") as file:
fieldnames = content["data"]["columns"]
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for row in content["data"]["rows"]:
writer.writerow(dict(zip(fieldnames, row)))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The tag you're searching for isn't in the source code, which is why you're returning no data. Is there some reason you expect this to be there? You may be seeing different source code in a browser than you do when pulling it with the requests library.
You can view the code being pulled via:
import requests
from bs4 import BeautifulSoup as bs
import csv
r = requests.get('https://portal.karandaaz.com.pk/dataset/total-population/1000')
soup = bs(r.text, "lxml")
print( soup )
import requests
from bs4 import BeautifulSoup
#get website, trick website into thinking you're using chrome.
url = 'http://www.espn.com/mlb/stats/pitching/_/sort/wins/league/al/year/2019/seasontype/2'
headers ={'User-Agent': 'Mozilla/5.0'}
res = requests.get(url, headers)
#sets soup to res and spits it out in html parsing format
soup = BeautifulSoup(res.content, 'html.parser')
#finds table from website
stats = soup.find_all('table', class_='tablehead')
stats = stats[0]
#saves the table into a text file.
with open('pitchers_stats.txt', 'w') as r:
for row in stats.find_all('tr'):
r.write(row.text.ljust(5))
#delete next two lines. the program sort of works, for those trying to help me on stackOverflow.
for cell in stats.find_all('td'):
r.write('\n')
# divide each row by a new line
r.write('\n')
simply prints out "Sortable pitching"
and i'm not sure why.
the program sort of works as expected by deleting the lines:
for cell in stats.find_all('td'):
r.write('\n')
^doing this shows the following:
.
I assume you are trying to parse the table and post each column on a new line.
The issue is that you're trying to parse the content of the <tr> tag. What you should do instead is iterating the <td> elements inside it.
import requests
from bs4 import BeautifulSoup
#get website, trick website into thinking you're using chrome.
url = 'http://www.espn.com/mlb/stats/pitching/_/sort/wins/league/al/year/2019/seasontype/2'
headers ={'User-Agent': 'Mozilla/5.0'}
res = requests.get(url, headers)
#sets soup to res and spits it out in html parsing format
soup = BeautifulSoup(res.content, 'html.parser')
#finds table from website
stats = soup.find_all('table', class_='tablehead')
stats = stats[0]
#saves the table into a text file.
with open('pitchers_stats.txt', 'w') as r:
for row in stats.find_all('tr'):
for data in row.find_all('td'):
r.write(data.text.ljust(5))
r.write('\n')
r.write('\n')
I'm working on a scraper for a number of chinese documents. As part of the project I'm trying to scrape the body of the document into a list and then write an html version of the document from that list (the final version will include metadata as well as the text, along with a folder full of individual html files for the documents).
I've managed to scrape the body of the document into a list and then use the contents of that list to create a new HTML document. I can even view the contents when I output the list to a csv (so far so good....).
Unfortunately the HTML document that is output is all "\u6d88\u9664\u8d2b\u56f0\u3001\".
Is there a way to encode the output so that this won't happen? Do I just need to grow up and scrape the page for real (parsing and organizing it <p> by <p> instead of just copying all of the exiting HTML as is) and then build the new HTML page element by element?
Any thoughts would be most appreciated.
from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv
#initiates the dictionary to hold the output
holder = []
#this is the target URL
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm"
data = []
filename = "fullbody.html"
target = open(filename, 'w')
def bodyscraper(url):
#opens the url for read access
this_url = urllib.urlopen(url).read()
#creates a new BS holder based on the URL
soup = BeautifulSoup(this_url, 'lxml')
#finds the body text
body = soup.find('td', {'class':'b12c'})
data.append(body)
holder.append(data)
print holder[0]
for item in holder:
target.write("%s\n" % item)
bodyscraper(target_url)
with open('bodyscraper.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(holder)
As the source htm is utf-8 encoded, when using bs just decoding what urllib returns which will work. I have tested both of html and csv output will show Chinese characters, here is the amended code:
from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv
#initiates the dictionary to hold the output
holder = []
#this is the target URL
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm"
data = []
filename = "fullbody.html"
target = open(filename, 'w')
def bodyscraper(url):
#opens the url for read access
this_url = urllib.urlopen(url).read()
#creates a new BS holder based on the URL
soup = BeautifulSoup(this_url.decode("utf-8"), 'lxml') #decoding urllib returns
#finds the body text
body = soup.find('td', {'class':'b12c'})
target.write("%s\n" % body) #write the whole decoded body to html directly
data.append(body)
holder.append(data)
bodyscraper(target_url)
with open('bodyscraper.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(holder)
I am building the crawler in python and i have the list of href from the page.
Now i have the list of file extensions to download like
list = ['zip','rar','pdf','mp3']
How can i save the files from that url to local directory using python
EDIT:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.example.com/downlaod"
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
list_urls = soup.find_all('a')
print list_urls[6]
Going by your posted example:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.example.com/downlaod"
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
list_urls = soup.find_all('a')
print list_urls[6]
So, the URL you want to fetch next is presumably list_urls[6]['href'].
The first trick is that this might be a relative URL rather than absolute. So:
newurl = list_urls[6]['href']
absurl = urlparse.urljoin(site.url, newurl)
Also, you want to only fetch the file if it has the right extension, so:
if not absurl.endswith(extensions):
return # or break or whatever
But once you've decided what URL you want to download, it's no harder than your initial fetch:
page = urllib2.urlopen(absurl)
html = page.read()
path = urlparse.urlparse(absurl).path
name = os.path.basename(path)
with open(name, 'wb') as f:
f.write(html)
That's mostly it.
There are a few things you might want to add, but if so, you have to add them all manually. For example:
Look for a Content-disposition header with a suggested filename to use in place of the URL's basename.
copyfile from page to f instead of reading the whole thing into memory and then writeing it out.
Deal with existing files with the same name.
…
But that's the basics.
You can use python requests library as you have asked in question : http://www.python-requests.org
You can save file from url like this :
import requests
url='http://i.stack.imgur.com/0LJdh.jpg'
data=requests.get(url).content
filename="image.jpg"
with open(filename, 'wb') as f:
f.write(data)
solution using urllib3
import os
import urllib3
from bs4 import BeautifulSoup
import urllib.parse
url = "https://path/site"
site = urllib3.PoolManager()
html = site.request('GET', url)
soup = BeautifulSoup(html.data, "lxml")
list_urls = soup.find_all('a')
and then a recursive function to get all the files
def recursive_function(list_urls)
newurl = list_urls[0]['href']
absurl = url+newurl
list_urls.pop(0)
if absurl.endswith(extensions): # verify if contains the targeted extensions
page = urllib3.PoolManager()
html = site.request('GET', absurl)
name = os.path.basename(absurl)
with open(name, 'wb') as f:
f.write(html.data)
return recursive_function(list_urls)