Writing URLs via Beautifulsoup to a csv file vertically - python

I have a project for one of my college classes that requires me to pull all URLs from a page on the U.S. census bureau website and store them in a CSV file. For the most part I've figured out how to do that but for some reason when the data gets appended to the CSV file, all the entries are being inserted horizontally. I would expect the data to be arranged vertically, meaning row 1 has the first item in the list, row 2 has the second item and so on. I have tried several approaches but the data always ends up as a horizontal representation. I am new to python and obviously don't have a firm enough grasp on the language to figure this out. Any help would be greatly fully appreciated.
I am parsing the website using Beautifulsoup4 and the request library. Pulling all the 'a' tags from the website was easy enough and getting the URLs from those 'a' tags into a list was pretty clear as well. But when I append the list to my CSV file with a writerow function, all the data ends up in one row as opposed to one separate row for each URL.
import requests
import csv
requests.get
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(links)
pprint(links)

Try this:
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
for link in links:
if (isinstance(link, str)):
f.write(link + "\n",)
I changed it to check whether a given link was indeed a string and if so, add a newline after it.

Try making a list of lists, by appending the url inside a list
links.append([link.get('href')])
Then the csv writer will put each list on a new line with writerows
writer.writerows(links)

Related

Extract data table from a specific page with multiple same name tables using Python BeautifulSoup

I am very new to python and BeautifulSoup. I wrote the code below to call up the website: https://www.baseball-reference.com/leagues/MLB-standings.shtml, with the goal of scraping the table at the bottom named "MLB Detailed Standings" and exporting to a CSV file. My code successfully creates a CSV file but with the wrong data table pulled and it is missing the first column with the team names. My code pulls in the "East Division" table up top (excluding the first column) rather than my targeted table with the full "MLB Detailed Standings" table at the bottom.
Wondering if there is a simple way to pull the MLB Detailed Standings table at the bottom. When I inspect the page, the ID for the specific table I am trying to pull is: "expanded_standings_overall". Do I need to reference this in my code? Or, any other guidance to rework the code to pull the correct table would be greatly appreciated. Again, I very new and trying my best to learn.
import requests
import csv
import datetime
from bs4 import BeautifulSoup
# static urls
season = datetime.datetime.now().year
URL = "https://www.baseball-reference.com/leagues/MLB-standings.shtml".format(season=season)
# request the data
batting_html = requests.get(URL).text
def parse_array_from_fangraphs_html(input_html, out_file_name):
"""
Take a HTML stats page from fangraphs and parse it out to a CSV file.
"""
# parse input
soup = BeautifulSoup(input_html, "lxml")
table = soup.find("table", class_=["sortable,", "stats_table", "now_sortable"])
# get headers
headers_html = table.find("thead").find_all("th")
headers = []
for header in headers_html:
headers.append(header.text)
print(headers)
# get rows
rows = []
rows_html = table.find_all("tr")
for row in rows_html:
row_data = []
for cell in row.find_all("td"):
row_data.append(cell.text)
rows.append(row_data)
# write to CSV file
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file)
writer.writerow(headers)
writer.writerows(rows)
parse_array_from_fangraphs_html(batting_html, 'BBRefTest.csv')
First of all, yes, it would be better to reference the ID as you would suspect the developer has made this ID unique to this table vs class which are just style descriptor.
Now, the problem run deeper. A quick look at the page code actually shows that the html that defines the table is commented out a few tags above. I suspect a script 'enables' this code on the client-side (in your browser). requests.get which just pull out the html without processing any javascript doesn't catch it (you could check the content of batting_html to verify that).
A very quick and dirty fix would be to catch the commented out code and reprocess it in BeautifulSoup:
from bs4 import Comment
...
# parse input
soup = BeautifulSoup(input_html, "lxml")
dynamic_content = soup.find("div", id="all_expanded_standings_overall")
comments = dynamic_content.find(string=lambda text: isinstance(text, Comment))
table = BeautifulSoup(comments, "lxml")
# get headers
By the way, you want to specify utf8 encoding when writing your file ...
with open(out_file_name, "w", encoding="utf8") as out_file:
writer = csv.writer(out_file)
...
Now that's really 'quick and dirty' and I would try to check deeper into the html code and javascript what is really happening before scaling this out to other pages.

Scrape texts from multiple websites and save separately in text files

I am a beginner in python, have been using it for my master thesis to conduct textual analysis in gaming industry. I have been trying to scrape reviews from several gaming critic sites.
I used a list of URLs in the code to scrape the reviews and have been successful. Unfortunately, i could not write each reviews in a separate file. as i write the files, either i receive only the review from the last URL in the list to all the files, or all of the reviews in all of the files after changing the indent. following here is my code. could you kindly suggest what's wrong in here?
from bs4 import BeautifulSoup
import requests
urls= ['http://www.playstationlifestyle.net/2018/05/08/ao-international-tennis-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/27/atelier-lydie-and-suelle-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/15/attack-on-titan-2-review-from-a-different-perspective-ps4/#/slide/1']
for url in urls:
r=requests.get(url).text
soup= BeautifulSoup(r, 'lxml')
for i in range(len(urls)):
file=open('filename%i.txt' %i, 'w')
for article_body in soup.find_all('p'):
body=article_body.text
file.write(body)
file.close()
I think you only need one for loop. If I understand correctly, you only want to iterate through urls and store an individual file for each.
Therefore, I would suggest removing the second for statement. You do though then need to modify the for url in urls to get a unique index for the current url you can use for i and you can use enumerate for that.
Your single for statement would become:
for i, url in enumerate(urls):
I've not tested this myself but this is what I believe should resolve your issue.
I totally believe you are a beginner in python. I post the right one before explaining it.
for i,url in enumerate(urls):
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
file = open('filename{}.txt'.format(i), 'w')
for article_body in soup.find_all('p'):
body = article_body.text
file.write(body)
file.close()
The reason why i receive only the review from the last URL in the list to all the files
one variable for one value , so after for-loop finished you will get the last result(the third one). The result of first and second result will be override
for url in urls:
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')

What am I missing from this script to scrape a row of a table from a webpage?

As you can see, I've made my very first novice attempt to scrape this webpage. This is how I found the code. So as you can see, I inspected, found td, and what I want is in td a href.
import requests
from bs4 import BeautifulSoup
import lxml
# URL for the table
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
#grab the page
html = requests.get(url).text
#import into BS
soup = BeautifulSoup(html, "lxml")
print(soup)
#find data we want, starting with first row
for item_name in soup.find_all("td", {"class": "table-item-link"}):
print(table-item-link.text)
My objective: Scrape the page, grabbing the name of the item, and then placing the name of that item into a table, possibly. I'm not writing to CSV yet, as I'm way too novice for that. Just taking it one step at a time. For this step, I'm just trying to figure out how to grab the item name, and store it into a table. Next, I will learn how to move to the next object in the table I want to grab, the total rise, and then finally the percent change.
end goal: to be able to scrape a table like this, grab everything I need from each row, and store it into my own table, then export to CSV. But I'm not there yet, so one step at a time!
The following should do what you need. Creating a CSV file using Python's library is quite straight forward. It simply takes a list of items and writes them correctly as comma separated entries for you into the file:
import requests
from bs4 import BeautifulSoup
import lxml
import csv
header = ['Item', 'Start price', 'End price', 'Total Rise', 'Change']
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
table = soup.find("a", {"class": "table-item-link"}).parent.parent.parent
with open('prices.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for tr in table.find_all('tr'):
row = [td.get_text(strip=True) for td in tr.find_all('td')]
del row[1]
csv_output.writerow(row)
Giving you an output prices.csv starting as:
Item,Start price,End price,Total Rise,Change
Opal bolt tips,2,3,1,+50%
Half plain pizza,541,727,186,+34%
Poorly-cooked bird mea...,79,101,22,+27%
I use .parent.parent.parent simply to work backwards to find the start of the containing table for the entry that you were looking for.
The HTML table is composed of a load of <tr> elements, and within each are a load of <td> elements. So the trick is to first find the table, then using this, use find_all() to iterate through all of the <tr> elements inside it. Then with each of these <td> elements, use the get_text(strip=True) to extract the text inside each element. strip=True removes any extra newlines or spaces to ensure you just get the text you need.
I used a Python list comprehension to create the list of values in each row. A separate for loop could also be used, and might be easier to understand initially, e.g.
row = []
for td in tr.find_all('td'):
row.append(td.get_text(strip=True))
Note, the advantage of using Python's CSV library rather than simply writing the information yourself to the file is that if any of the values were to contain a comma in them, it would automatically correctly enclose the entry in quotes for you.
I edited your code to get the item :
import requests
from bs4 import BeautifulSoup
import lxml
# URL for the table
url = 'http://services.runescape.com/m=itemdb_rs/top100?list=2'
#grab the page
html = requests.get(url).text
#import into BS
soup = BeautifulSoup(html, "lxml")
#find data we want, starting with first row
# Tag is <a> not <td> as <td> is just holding the <a> tags
# You were also not using the right var name in your for loop
for item_name in soup.find_all("a", {"class": "table-item-link"}):
print(item_name.text)
To store your data easily in any format, I suggest tablib which is well documented and can handle many formats.

How to extract text from a webpage using python 2.7?

I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:
http://www.ebi.ac.uk/ena/data/view/ERS019623
I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, i.e. "http://www.ebi.ac.uk/ena/data/view/ERS******". I have the ERS code for each of my assemblies so I can construct the URL for each one.
I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:
import urllib2
from bs4 import BeautifulSoup
import re
import csv
with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
sample = row[0] #reads index 0 (1st row)
ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
page_text = soup.text #returns text from variable soup, i.e. no tags
ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL
This worked very well, but as the study accession is not in the XML I can't use it to access this.
I also attempted to use BeautifulSoup again to download the HTML by doing this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
print soup
get_category_links(BASE_URL)
But again I can't see the study accession in the output from this either...
I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.
When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.
So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?
(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')
ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)
print(ERS, ERR, url)
out:
ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623
bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.
i find a TEXT download link:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt
this link's fileds can be changed to get the data you want, like this:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession&download=txt
by doing so, you can get all you data in a text file
In you sample soup is a BeautifulSoup object: a representation of the parsed document.
If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()).
The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...

Python Scraper for Links

import requests
from bs4 import BeautifulSoup
data = requests.get("http://www.basketball-reference.com/leagues/NBA_2014_games.html")
soup = BeautifulSoup(data.content)
soup.find_all("a")
for link in soup.find_all("a"):
"<a href='%s'>%s</a>" %(link.get("href=/boxscores"),link.text)
I am trying to get the links for the box scores only. Then run a loop and organize the data from the individual links into a csv. I need to save the links as vectors and run a loop....then I am stuck and I am not sure if this is even the proper way to do it.
The idea is to iterate over all links that have href attribute (a[href] CSS Selector), then loop over the links and construct an absolute link if href attribute value doesn't start with http. Collect all links into a list of lists and use writerows() to dump it to csv:
import csv
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.basketball-reference.com'
data = requests.get("http://www.basketball-reference.com/leagues/NBA_2014_games.html")
soup = BeautifulSoup(data.content)
links = [[urljoin(base_url, link['href']) if not link['href'].startswith('http') else link['href']]
for link in soup.select("a[href]")]
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(links)
output.csv now contains:
http://www.sports-reference.com
http://www.baseball-reference.com
http://www.sports-reference.com/cbb/
http://www.pro-football-reference.com
http://www.sports-reference.com/cfb/
http://www.hockey-reference.com/
http://www.sports-reference.com/olympics/
http://www.sports-reference.com/blog/
http://www.sports-reference.com/feedback/
http://www.basketball-reference.com/my/auth.cgi
http://twitter.com/bball_ref
...
It is unclear what your output should be, but this is, at least, what you can use a starting point.

Categories

Resources