I'm new in python world and wondering how can I scrape data from github into CSV file, e.g.
https://gist.github.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19#file-pokemon-csv
I'm trying with this code, however it is not very successful. Definitely there should be an easier way how to do it.
Thank you in advance!
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://gist.github.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pokemon_table = soup.find('table', class_= 'highlight tab-size js-file-line-container')
for pokemon in pokemon_table.find_all('tr'):
name = [pokemon.find('td', class_= 'blob-code blob-code-inner js-file-line').text]
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(name)
If you do not mind taking the 'Raw version' of the CSV file, the following code would work:
import requests
response = requests.get("https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/bd584ee6c307cc9fab5ba38916e98a85de9c2ba7/pokemon.csv")
with open("output.csv", "w") as file:
file.write(response.text)
The URL you are using does not link to the CSV's Raw version, but the following does:
https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/bd584ee6c307cc9fab5ba38916e98a85de9c2ba7/pokemon.csv
Edit 1: Just to clarify it, you can access the Raw version of that CSV file by pressing the 'Raw' button on the top-right side of the CSV file shown in the link you provided.
Edit 2: Also, it looks like the following URL would also work, and it is shorter and easier to 'build' based in the original URL:
https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/pokemon.csv
Related
I am working on a webscrape code, he work fine, now I want replace the url, with a CSV file who containt thousand of url, it's like this :
url1
url2
url3
.
.
.urlX
my first line web scrape code is a basic :
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "HERE THE URL FROM EACH LINE OF THE CSV FILE"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
how can i do for tell to python, to use the urls from the CSV, i think to do a dico, but i dont very know how i can do that, anyone have a solution please ? i know it's seams very simple for you, but it will be very usefull for me.
If this is just a list of urls, you don't really need the csv module. But here is a solution assuming the url is in column 0 of the file. You want a csv reader, not writer, and then its a simple case of iterating the rows and taking action.
from bs4 import BeautifulSoup
import requests
import csv
with open("url-collection.csv", newline="") as fileobj:
for row in csv.reader(fileobj):
# TODO: add try/except to handle errors
url = row[0]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
I am very new to python and BeautifulSoup. I wrote the code below to call up the website: https://www.baseball-reference.com/leagues/MLB-standings.shtml, with the goal of scraping the table at the bottom named "MLB Detailed Standings" and exporting to a CSV file. My code successfully creates a CSV file but with the wrong data table pulled and it is missing the first column with the team names. My code pulls in the "East Division" table up top (excluding the first column) rather than my targeted table with the full "MLB Detailed Standings" table at the bottom.
Wondering if there is a simple way to pull the MLB Detailed Standings table at the bottom. When I inspect the page, the ID for the specific table I am trying to pull is: "expanded_standings_overall". Do I need to reference this in my code? Or, any other guidance to rework the code to pull the correct table would be greatly appreciated. Again, I very new and trying my best to learn.
import requests
import csv
import datetime
from bs4 import BeautifulSoup
# static urls
season = datetime.datetime.now().year
URL = "https://www.baseball-reference.com/leagues/MLB-standings.shtml".format(season=season)
# request the data
batting_html = requests.get(URL).text
def parse_array_from_fangraphs_html(input_html, out_file_name):
"""
Take a HTML stats page from fangraphs and parse it out to a CSV file.
"""
# parse input
soup = BeautifulSoup(input_html, "lxml")
table = soup.find("table", class_=["sortable,", "stats_table", "now_sortable"])
# get headers
headers_html = table.find("thead").find_all("th")
headers = []
for header in headers_html:
headers.append(header.text)
print(headers)
# get rows
rows = []
rows_html = table.find_all("tr")
for row in rows_html:
row_data = []
for cell in row.find_all("td"):
row_data.append(cell.text)
rows.append(row_data)
# write to CSV file
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file)
writer.writerow(headers)
writer.writerows(rows)
parse_array_from_fangraphs_html(batting_html, 'BBRefTest.csv')
First of all, yes, it would be better to reference the ID as you would suspect the developer has made this ID unique to this table vs class which are just style descriptor.
Now, the problem run deeper. A quick look at the page code actually shows that the html that defines the table is commented out a few tags above. I suspect a script 'enables' this code on the client-side (in your browser). requests.get which just pull out the html without processing any javascript doesn't catch it (you could check the content of batting_html to verify that).
A very quick and dirty fix would be to catch the commented out code and reprocess it in BeautifulSoup:
from bs4 import Comment
...
# parse input
soup = BeautifulSoup(input_html, "lxml")
dynamic_content = soup.find("div", id="all_expanded_standings_overall")
comments = dynamic_content.find(string=lambda text: isinstance(text, Comment))
table = BeautifulSoup(comments, "lxml")
# get headers
By the way, you want to specify utf8 encoding when writing your file ...
with open(out_file_name, "w", encoding="utf8") as out_file:
writer = csv.writer(out_file)
...
Now that's really 'quick and dirty' and I would try to check deeper into the html code and javascript what is really happening before scaling this out to other pages.
I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error.
Simply trying to scrape headers from the web and save them to a txt file.
scraping code works well, however, it saves only last string bypassing all headers to the last one.
I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue.
please help.
here is my code
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h2"):
print(headlines.text.strip())
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
file_object.write(str(headlines.text.strip()))
'''
Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n") # write a newline after each headline
here is full working code corrected as per advice.
from bs4 import BeautifulSoup
import requests
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n")
this code will trough error in Jupiter work but when an
opening file, however when file open outside Jupiter headers saved...
Extracting the "2016-Annual" table in http://www.americashealthrankings.org/api/v1/downloads/131 to a csv. The table has 3 fields- STATE, RANK, VALUE. Getting error with the following:
import urllib2
from bs4 import BeautifulSoup
import csv
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find('2016-Annual', {'class': 'STATE-RANK-VALUE'})
f = open('output.csv', 'w')
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 3:
STATE = cells[0].find(text=True)
RANK = cells[1].find(text=True)
VALUE = cells[2].find(text=True)
print write_to_file
f.write(write_to_file)
f.close()
What am I missing here? Using python 2.7
you code is wrong
this 'http://www.americashealthrankings.org/api/v1/downloads/131' download
csv file.
download csv file to local computer, you can use this file.
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
import urllib2
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
html = urllib2.urlopen(url).read()
with open('output.csv', 'w') as output:
output.write(html)
According to the Beautifulsoup docs, you need to pass a string to be parsed on initialization. However, page = urllib2.urlopen(req) returns a pointer to a page.
Try using soup = BeautifulSoup(page.read(), 'html.parser') instead.
Also, the variable write_to_file doesn't exist.
If this doesn't solve it, please also post which error you get.
The reason its not working is because your pointing to a file that is already a csv - you can literally load that URL in your browser and it will download in CSV file format ---- the table your expecting though, is not at that endpoint - it is at this URL:
http://www.americashealthrankings.org/explore/2016-annual-report
Also - I dont see a class called STATE-RANK-VALUE I only see th headers called state,rank, and ,value
I have problems with scraping data from certain URL with Beautiful Soup.
I've successfully made part where my code opens text file with list of URL's and goes through them.
First problem that I encounter is when I want to go through two separate places on HTML page.
With code that I wrote so far, it only goes through first "class" and just doesn't want to search and scrap another one that I defined.
Second issue is that I can get data only if I run my script in terminal with:
python mitel.py > mitel.txt
Output that I get is not the one that I want. I am just looking for two strings from it, but I cannot find a way to extract it.
Finally, there's no way I can get my results to write to CSV.
I only get last string of last URL from url-list into my CSV.
Can you assist TOTAL beginner in Python?
Here's my script:
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('urllist.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll(True, {"class":["tel-number", "tel-result main"]}):
finalt = target.text.strip()
print finalt
with open('output_file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(finalt)
For some reason, I cannot paste succesfully targeted HTML code, so I'll just put a link here to one of the pages, and if it gets needed, I'll try to somehow paste it, although, its very big and complex.
Targeted URL for scraping
Thank you so much in advance!
Well I managed to get some results with the help of #furas and google.
With this code, I can get all "a" from the page, and then in MS Excel I was able to get rid of everything that wasn't name and phone.
Sorting and rest of the stuff is also done in excel... I guess I am to big of a newbie to accomplish everything in one script.
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
finalt = []
proxy = urllib2.ProxyHandler({'http': 'http://163.158.216.152:80'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
with open('mater.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll('a'):
finalt.append( target.text.strip() )
print finalt
with open('imena1-50.csv', 'wb') as f:
writer = csv.writer(f)
for i in finalt:
writer.writerow([i])
It also uses proxy.. sort off. Didn't get it to get proxies from .txt list.
Not bad for first python scraping, but far from efficient and the way I imagine it.
maybe your selector is wrong, try this
for target in soup.findAll(True, {"class":["tel-number",'tel-result-main']}):