Syntax issues when scraping data - python

import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
from lxml import html
base_url = 'http://www.pro-football-reference.com' # base url for concatenation
data = requests.get("http://www.pro-football-reference.com/years/2014/games.htm") #website for scraping
soup = BeautifulSoup(data.content)
list_of_cells = []
for link in soup.find_all('a'):
if link.has_attr('href'):
if link.get_text() == 'boxscore':
url = base_url + link['href']
for x in url:
response = requests.get('x')
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'stats_table x_large_text'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
I am using the code in order to get all the boxscore urls from http://www.pro-football-reference.com/years/2014/games.htm. After I get these boxscore urls I would like to loop through them to scrape the quarter by quarter data for each team but my syntax always seems to be off no matter how I format the code.
If it is possible I would like to scrape more than just the scoring data by also getting the Game Info, officials, and Expected points per game.

If you modify your loop slightly to:
for link in soup.find_all('a'):
if not link.has_attr('href'):
continue
if link.get_text() != 'boxscore':
continue
url = base_url + link['href']
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
# Scores
table = soup.find('table', attrs={'id': 'scoring'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
That returns each of the cells for each row in the scoring table for each page linked to with the 'boxscore' text.
The issues I found with the existing code were:
You were attempting to loop through each character in the href returned for the 'boxscore' link.
You were always requesting the string 'x'.
Not so much an issue, but I changed the table selector to identify the table by its id 'scoring' rather than the class. Ids at least should be unique within the page (though there is no guarentee).
I'd recommend that you find each table (or HTML element) containing the data you want in the main loop (e.g score_table = soup.find('table'...) but that you move the code that parses that data (e.g)...
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
...into a separate function that returns said data (one for each type of data you are extracting), just to keep the code slightly more manageable. The more the code indents to handle if tests and for loops the more difficult it tends to be to follow the flow. For example:
score_table = soup.find('table', attrs={'id': 'scoring'})
score_data = parse_score_table(score_table)
other_table = soup.find('table', attrs={'id': 'other'})
other_data = parse_other_table(other_table)

Related

HTML parts locating

I am trying to extract each row individually to eventually create a dataframe to export them into a csv. I can't locate the individual parts of the html.
I can find and save the entire content (although I can only seem to save this on a loop so the pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first row:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.find('div', {'class': 'view-content'})
for infos in content:
try:
data = infos.find('div', {'class': 'type type_18'}).text
except:
print("None found")
df = pd.DataFrame(data)
df.columns = df.columns.str.lower().str.replace(': ','')
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)
df.to_csv (r'savehere.csv', index = False, header = True)
This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what.
Any help would be much appreciated.
What happens?
Main issue here is that content = soup.find('div', {'class': 'view-content'}) is no ResultSet and contains only a single element. Thats why your second loop only iterates once.
Also Caused by this behavior you will swap from beautifoulsoup method find() to python string method find() and these two are operating in a different way - Without try/except you will see the what is going on, it try to find a string:
for x in soup.find('div', {'class': 'view-content'}):
print(x.find('div'))
Output
...
-1
<div class="views-field views-field-title-1"> <span class="views-label views-label-title-1">RRR: </span> <span class="field-content"><div class="type type_18">Eleemosynary grant</div>2256</span> </div>
...
How to fix?
Select your elements more specific in this case the views-row:
sections = soup.find_all('div', {'class': 'views-row'})
While you iterate each section you could select expected value:
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
print(section.select_one('div[class*="type_"]').text)
Example
Is scraping all the information and creates DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
website = #link here#
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
d = {}
for row in section.select('div.views-field'):
d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
data.append(d)
df = pd.DataFrame(data)
### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')
...
I think that You wanted to make pagination using for loop and range method and to grab RRR value.I've done the next pages meaning pagination in long url.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = #insert url#
data=[]
for page in range(1,7):
req=requests.get(url.format(page=page))
soup = BeautifulSoup(req.content,'lxml')
for r in soup.select('[class="views-field views-field-title-1"] span:nth-child(2)'):
rr=list(r.stripped_strings)[-1]
#print(rr)
data.append(rr)
df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)
Output:
List

How would I scrape some of the 'title' attributes if there are multiple under the same HTML tree?

I'm trying to scrape the front page of https://nyaa.si/ for torrent names and torrent magnet links. I was successful in getting the magnet links but am having issues with the torrent names. This is due to the HTML structure of where the torrent names are placed. The contents I'm trying to scrape are located in a <td> tag (which are table rows) which can be uniquely identified through an attribute, but after that the contents are located in an <a> tag under the <title> attribute which has no uniquely identifiable attribute I can see. Is there anyway I can scrape this information?
Here is my code:
import re, requests
from bs4 import BeautifulSoup
nyaa_link = 'https://nyaa.si/'
request = requests.get(nyaa_link, headers={'User-Agent': 'Mozilla/5.0'})
source = request.content
soup = BeautifulSoup(source, 'lxml')
#GETTING TORRENT NAMES
title = []
rows = soup.findAll("td", colspan="2")
for row in rows:
title.append(row.content)
#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
magnets.append(link.get('href'))
print(magnets)
You need to pull the title from the link in the table datum. Since each <td> here contains an <a>, just call td.find('a')['title']
import re, requests
from bs4 import BeautifulSoup
nyaa_link = 'https://nyaa.si/'
request = requests.get(nyaa_link, headers={'User-Agent': 'Mozilla/5.0'})
source = request.content
soup = BeautifulSoup(source, 'lxml')
#GETTING TORRENT NAMES
title = []
rows = soup.findAll("td", colspan="2")
for row in rows:
#UPDATED CODE
desired_title = row.find('a')['title']
if 'comment' not in desired_title:
title.append(desired_title)
#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
magnets.append(link.get('href'))
print(magnets)
So I've figured out the problem and found a solution
The problem was this line: if 'comment' not in desired_title:
What that did was it only processed the HTML which didn't contain 'comment'. The issue is the way the HTML structure on the page I was trying to scrape, basically if the torrent had a comment on it it would show up on the HTML structure higher than the title name. So my code would completely skip torrents with comments on them.
Here is a working solution:
import re, requests
from bs4 import BeautifulSoup
nyaa_link = 'https://nyaa.si/?q=test'
request = requests.get(nyaa_link)
source = request.content
soup = BeautifulSoup(source, 'lxml')
#GETTING TORRENT NAMES
title = []
n = 0
rows = soup.findAll("td", colspan="2")
for row in rows:
if 'comment' in row.find('a')['title']:
desired_title = row.findAll('a', title=True)[1].text
print(desired_title)
title.append(desired_title)
n = n+1
else:
desired_title = row.find('a')['title']
title.append(desired_title)
print(row.find('a')['title'])
print('\n')
#print(title)
#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
magnets.append(link.get('href'))
#print(magnets)
#GETTING NUMBER OF MAGNET LINKS AND TITLES
print('Number of rows', len(rows))
print('Number of magnet links', len(magnets))
print('Number of titles', len(title))
print('Number of removed', n)
Thank you CannedScientist for some of the code needed for the solution

Python Web Scraping - How to scrape this type of site?

Okay, so I need to scrape the following webpage: https://www.programmableweb.com/category/all/apis?deadpool=1
It's a list of APIs. There are approx 22,000 APIs to scrape.
I need to:
1) Get the URL of each API in the table (pages 1-889), and also to scrape the following info:
API name
Description
Category
Submitted
2) I then need to scrape a bunch of information from each URL.
3) Export the data to a CSV
The thing is, I’m a bit lost of how to think about this project. From what I can see, there are no AJAX calls been made to populate the table, which means I’m going to have to parse the HTML directly (right?)
In my head, the logic would be something like this:
Use the requests & BS4 libraries to scrape the table
Then, somehow grab the HREF from every row
Access that HREF, scrape the data, move onto the next one
Rinse and repeat for all table rows.
Am I on the right track, is this possible with requests & BS4?
Here's are some screenshots of what I've been trying to explain.
Thank you SOO much for any help. This is hurting my head haha
Here we go using requests, BeautifulSoup and pandas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
name = []
desc = []
cat = []
sub = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for item1 in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
name.append(item1.text)
for item2 in soup.findAll('td', attrs={'class': 'views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8'}):
desc.append(item2.text)
for item3 in soup.findAll('td', attrs={'class': 'views-field views-field-field-article-primary-category'}):
cat.append(item3.text)
for item4 in soup.findAll('td', attrs={'class': 'views-field views-field-created'}):
sub.append(item4.text)
result = []
for item in zip(name, desc, cat, sub):
result.append(item)
df = pd.DataFrame(
result, columns=['API Name', 'Description', 'Category', 'Submitted'])
df.to_csv('output.csv')
print('Task Completed, Result saved to output.csv file.')
Result can be viewed online: Check Here
Output Simple:
Now For href parsing:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=0&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
links = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for link in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
for href in link.findAll('a'):
result = 'https://www.programmableweb.com'+href.get('href')
links.append(result)
spans = []
for link in links:
r = requests.get(link)
soup = soup = BeautifulSoup(r.text, 'html.parser')
span = [span.text for span in soup.select('div.field span')]
spans.append(span)
data = []
for item in spans:
data.append(item)
df = pd.DataFrame(data)
df.to_csv('data.csv')
print('Task Completed, Result saved to data.csv file.')
Check Result Online: Here
Sample View is Below:
In Case if you want those 2 csv files together so here's the code:
import pandas as pd
a = pd.read_csv("output.csv")
b = pd.read_csv("data.csv")
merged = a.merge(b)
merged.to_csv("final.csv", index=False)
Online Result: Here
You should read more about scraping if you are going to pursue it .
from bs4 import BeautifulSoup
import csv , os , requests
from urllib import parse
def SaveAsCsv(list_of_rows):
try:
with open('data.csv', mode='a', newline='', encoding='utf-8') as outfile:
csv.writer(outfile).writerow(list_of_rows)
except PermissionError:
print("Please make sure data.csv is closed\n")
if os.path.isfile('data.csv') and os.access('data.csv', os.R_OK):
print("File data.csv Already exists \n")
else:
SaveAsCsv([ 'api_name','api_link','api_desc','api_cat'])
BaseUrl = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page={}'
for i in range(1, 890):
print('## Getting Page {} out of 889'.format(i))
url = BaseUrl.format(i)
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
table_rows = soup.select('div.view-content > table[class="views-table cols-4 table"] > tbody tr')
for row in table_rows:
tds = row.select('td')
api_name = tds[0].text.strip()
api_link = parse.urljoin(url, tds[0].find('a').get('href'))
api_desc = tds[1].text.strip()
api_cat = tds[2].text.strip() if len(tds) >= 3 else ''
SaveAsCsv([api_name,api_link,api_desc,api_cat])

Unable to print once to get all the data altogether

I've written a script in python to scrape the tablular content from a webpage. In the first column of the main table there are the names. Some names have links to lead another page, some are just the names without any link. My intention is to parse the rows when a name has no link to another page. However, when the name has link to another page then the script will first parse the concerning rows from the main table and then follow that link to parse associated information of that name from the table located at the bottom under the title Companies. Finally, write them in a csv file.
site link
I've tried so far:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
first_table = [i.text for i in item.select("td")]
print(first_table)
else:
first_table = [i.text for i in item.select("td")]
print(first_table)
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info = [elem.text for elem in elems.select("td")]
print(associated_info)
My above script can do almost everything but I can't create any logic to print once rather than printing thrice to get all the data atltogether so that I can write them in a csv file.
Put all your scraped data into a list, here I've called the list associated_info then all the data is in one place & you can iterate over the list to print it out to a CSV if you like...
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
associated_info = []
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
associated_info.append([i.text for i in item.select("td")])
else:
associated_info.append([i.text for i in item.select("td")])
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info.append([elem.text for elem in elems.select("td")])
print(associated_info)

How to use BeautifulSoup to parse a table?

This is a context-specific question regarding how to use BeautifulSoup to parse an html table in python2.7.
I would like to extract the html table here and place it in a tab-delim csv, and have tried playing around with BeautifulSoup.
Code for context:
proxies = {
"http://": "198.204.231.235:3128",
}
site = "http://sloanconsortium.org/onlineprogram_listing?page=11&Institution=&field_op_delevery_mode_value_many_to_one[0]=100%25%20online"
r = requests.get(site, proxies=proxies)
print 'r: ', r
html_source = r.text
print 'src: ', html_source
soup = BeautifulSoup(html_source)
Why doesn't this code get the 4th row?
soup.find('table','views-table cols-6').tr[4]
How would I print out all of the elements in the first row (not the header row)?
Okey, someone might be able to give you a one liner, but the following should get you started
table = soup.find('table', class_='views-table cols-6')
for row in table.find_all('tr'):
row_text = list()
for item in row.find_all('td'):
text = item.text.strip()
row_text.append(text.encode('utf8'))
print row_text
I believe your tr[4] is believed to be an attribute and not an index as you suppose.

Categories

Resources