r.write('\n') cuts off needed data - python

import requests
from bs4 import BeautifulSoup
#get website, trick website into thinking you're using chrome.
url = 'http://www.espn.com/mlb/stats/pitching/_/sort/wins/league/al/year/2019/seasontype/2'
headers ={'User-Agent': 'Mozilla/5.0'}
res = requests.get(url, headers)
#sets soup to res and spits it out in html parsing format
soup = BeautifulSoup(res.content, 'html.parser')
#finds table from website
stats = soup.find_all('table', class_='tablehead')
stats = stats[0]
#saves the table into a text file.
with open('pitchers_stats.txt', 'w') as r:
for row in stats.find_all('tr'):
r.write(row.text.ljust(5))
#delete next two lines. the program sort of works, for those trying to help me on stackOverflow.
for cell in stats.find_all('td'):
r.write('\n')
# divide each row by a new line
r.write('\n')
simply prints out "Sortable pitching"
and i'm not sure why.
the program sort of works as expected by deleting the lines:
for cell in stats.find_all('td'):
r.write('\n')
^doing this shows the following:
.

I assume you are trying to parse the table and post each column on a new line.
The issue is that you're trying to parse the content of the <tr> tag. What you should do instead is iterating the <td> elements inside it.
import requests
from bs4 import BeautifulSoup
#get website, trick website into thinking you're using chrome.
url = 'http://www.espn.com/mlb/stats/pitching/_/sort/wins/league/al/year/2019/seasontype/2'
headers ={'User-Agent': 'Mozilla/5.0'}
res = requests.get(url, headers)
#sets soup to res and spits it out in html parsing format
soup = BeautifulSoup(res.content, 'html.parser')
#finds table from website
stats = soup.find_all('table', class_='tablehead')
stats = stats[0]
#saves the table into a text file.
with open('pitchers_stats.txt', 'w') as r:
for row in stats.find_all('tr'):
for data in row.find_all('td'):
r.write(data.text.ljust(5))
r.write('\n')
r.write('\n')

Related

BeautifulSoup not working after the first page

I'm trying to use Python's BeautifulSoup to scrape data from the following website. The data on the website is split over four different pages. Each page has a unique link (i.e. http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/0 for the first page, http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/1 for the second page, etc.). I am able to successfully scrape the data on the first page, but when I try to scrape data for the second page onward it comes up empty. Here is the code I'm using:
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
# Define url and request webpage
season = 2019
page = 1
url = "http://insider.espn.com/nbadraft/results/top100/_/year/{}/set/{}".format(season, page)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
# Scrape all of the data in the table
rows = page_soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
# Get the column headers
headers = player_stats[0]
# Remove the first row
player_stats.pop(0)
# Convert to pandas dataframe
df = pd.DataFrame(player_stats, columns = headers)
# Remove all rows where Name = None
df = df[~df['NAME'].isnull()]
# Remove PLAYER column because it's empty
df = df.drop(columns='PLAYER')
df
Any advice would be much appreciated! I'm a bit new to using BeautifulSoup, so I apologize in advance if the code isn't particularly nice or efficient.
Update: The links only work if opened in Chrome, which is likely what is causing the problem. Is there any way around it?

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.
Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.
As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

Web Scraping Python For Two Different Buttons

I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries.
There are two tables on this website which get switched when we select the options:
1. Treasury Notes and Bond
2. Treasury Bills
I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond.
Can someone help me with that?
Following the my code:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
df = pd.DataFrame(list_rows)
df1 = df[0].str.split(',', expand=True)
All the data in the site is loaded once and then js is used to update the values in the table
Here is a working quickly written code:
import requests
from bs4 import BeautifulSoup
import json
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags
importantJson = ''
for r in rows:
text = r.text
if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
importantJson = text
break
# remove the non json stuff
importantJson = importantJson\
.replace('window.__STATE__ =', '')\
.replace(';', '')\
.strip()
#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need
How did I got to this conclusion?
First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there.
Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page.
Then with simple search in the source I found the script tag containing the json.

Python Extract Table from URL to csv

Extracting the "2016-Annual" table in http://www.americashealthrankings.org/api/v1/downloads/131 to a csv. The table has 3 fields- STATE, RANK, VALUE. Getting error with the following:
import urllib2
from bs4 import BeautifulSoup
import csv
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find('2016-Annual', {'class': 'STATE-RANK-VALUE'})
f = open('output.csv', 'w')
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 3:
STATE = cells[0].find(text=True)
RANK = cells[1].find(text=True)
VALUE = cells[2].find(text=True)
print write_to_file
f.write(write_to_file)
f.close()
What am I missing here? Using python 2.7
you code is wrong
this 'http://www.americashealthrankings.org/api/v1/downloads/131' download
csv file.
download csv file to local computer, you can use this file.
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
import urllib2
url = 'http://www.americashealthrankings.org/api/v1/downloads/131'
html = urllib2.urlopen(url).read()
with open('output.csv', 'w') as output:
output.write(html)
According to the Beautifulsoup docs, you need to pass a string to be parsed on initialization. However, page = urllib2.urlopen(req) returns a pointer to a page.
Try using soup = BeautifulSoup(page.read(), 'html.parser') instead.
Also, the variable write_to_file doesn't exist.
If this doesn't solve it, please also post which error you get.
The reason its not working is because your pointing to a file that is already a csv - you can literally load that URL in your browser and it will download in CSV file format ---- the table your expecting though, is not at that endpoint - it is at this URL:
http://www.americashealthrankings.org/explore/2016-annual-report
Also - I dont see a class called STATE-RANK-VALUE I only see th headers called state,rank, and ,value

How to go through a list of urls to retrieve page data - Python

In a .py file, I have a variable that's storing a list of urls. How do I properly build a loop to retrieve the code from each url, so that I can extract specific data items from each page?
This is what I've tried so far:
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gymsfinal.csv")
csvfilelist = csvfile.read()
print csvfilelist
#Get data from each url
def get_page_data():
for page_data in csvfilelist.splitlines():
r = requests.get(page_data.strip())
soup = BeautifulSoup(r.text, 'html.parser')
return soup
pages = get_page_data()
print pages
By not using the csv module, you are reading the gymsfinal.csv file as text files. Read through the documentation on reading/writing csv files here: CSV File Reading and Writing.
Also you will get only the first page's soup content from your current code. Because get_page_data() function will return after creating the first soup. For your current code, You can yield from the function like,
def get_page_data():
for page_data in csvfilelist.splitlines():
r = requests.get(page_data.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup
pages = get_page_data()
# iterate over the generator
for page in pages:
print pages
Also close the file you just opened.

Categories

Resources