I am having an issue where not all instances are captured within a relatively simply beautifulsoup scrape. What I am running is the below:
from bs4 import BeautifulSoup as bsoup
import requests as reqs
home_test = "https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two"
away_test = "https://fbref.com/en/matches/ea736ad1/Carlisle-United-Northampton-Town-August-11-2018-League-Two"
page_to_parse = home_test
page = reqs.get(page_to_parse)
status_code = page.status_code
status_code = str(status_code)
parse_page = bsoup(page.content, 'html.parser')
find_stats = parse_page.find_all('div',id="team_stats_extra")
print(find_stats)
for stat in find_stats:
add_stats = stat.find_next('div').get_text()
print(add_stats)
If you have a look at the first print, the scrape captures the part of the website that I'm after, however if you inspect the second print, half of the instances in the earlier one aren't actually being taken on at all. I do not have any limits on this, so in theory it should take in all the right ones.
I've already testes quite a few different variants of find_next, find, or find_all, but the second loop find never takes all of them.
Results are always:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Where it should take on the following instead:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Northampton Lincoln City
2Offsides2
9Goal Kicks15
32Throw Ins24
18Long Balls23
parse_page.find_all returns a list of one item, the WebElement with id="team_stats_extra". The loop need to be on it's child elements
find_stats = parse_page.find_all('div', id="team_stats_extra")
all_stats = find_stats[0].find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
If you have multiple tables use two loops
find_stats = parse_page.find_all('div', id="team_stats_extra")
for stats in find_stats:
all_stats = stats.find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
find_stats = parse_page.find_all('div',id="team_stats_extra") actually returns only one block, so the next loop performs only one iteration.
You can change the way to select the div blocks with :
find_stats = parse_page.select('div#team_stats_extra > div')
print(len(find_stats)) # >>> returns 2
for stat in find_stats:
add_stats = stat.get_text()
print(add_stats)
To explain the selector select('div#team_stats_extra > div'), it is the same as :
find the div block with the id team_stats_extra
and select all direct children that are div
With bs4 4.7.1+ you can use :has to ensure you get the appropriate divs with class th as a child so you have the appropriate elements to loop over
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two')
soup = bs(r.content, 'lxml')
for div in soup.select('#team_stats_extra div:has(.th)'):
print(div.get_text())
Related
im begginer at programming, so i have problem in find method in beautifuloup when i use it in web scraping,i have this code
from os import execle, link, unlink, write
from typing import Text
import requests
from bs4 import BeautifulSoup
import csv
from itertools import zip_longest
job_titleL =[]
company_nameL=[]
location_nameL=[]
experience_inL=[]
links=[]
salary=[]
job_requirementsL=[]
date=[]
result= requests.get(f"https://wuzzuf.net/search/jobs/?a=%7B%7D&q=python&start=1")
source = result.content
soup= BeautifulSoup(source , "lxml")
job_titles = soup.find_all("h2",{"class":"css-m604qf"} )
companies_names = soup.find_all("a",{"class":"css-17s97q8"})
locations_names = soup.find_all("span",{"class":"css-5wys0k"})
experience_in = soup.find_all("a", {"class":"css-5x9pm1"})
posted_new = soup.find_all("div",{"class":"css-4c4ojb"})
posted_old = soup.find_all("div",{"class":"css-do6t5g"})
posted = [*posted_new,*posted_old]
for L in range(len(job_titles)):
job_titleL.append(job_titles[L].text)
links.append(job_titles[L].find('a').attrs['href'])
company_nameL.append(companies_names[L].text)
location_nameL.append(locations_names[L].text)
experience_inL.append(experience_in[L].text)
date_text=posted[L].text.replace("-","").strip()
date.append(posted[L].text)
for link in links:
result= requests.get(link)
source= result.content
soup=BeautifulSoup(source,"lxml")
requirements=soup.find("div",{"class":"css-1t5f0fr"}).ul
requirements1=soup.find("div",{"class":"css-1t5f0fr"}).p
respon_text=""
if requirements:
for li in requirements.find_all("li"):
print(li)
if requirements1:
for br in requirements1.find_all("br"):
print(br)
respon_text +=li.text + "|"
job_requirementsL.append(respon_text)
file_list=[job_titleL,company_nameL,date,location_nameL,experience_inL,links,job_requirementsL]
exported=zip_longest(*file_list)
with open('newspeard2.csv',"w") as spreadsheet:
wr=csv.writer(spreadsheet)
wr.writerow(["job title", "company name","date", "location", "experience in","links","job requirements"])
wr.writerows(exported)
note: im not very good at english :(
so when i use find method to get the job requirements from each job in the website page (wuzzuf),use for loop to loop throug each text i job requirements, it returns error says:"NoneType objects han nod attribute find_all("li"), so after searching why this happens ,and after dowing inspect for each job page , i found that some job pages uses (br, p and strong) tags in job requirements, i didn't know what to do , but i used if statement to test it, it returns the tags but br tag is empty without text , so please can you see where is the prblem and answer me , thanks
the webpage:
https://wuzzuf.net/search/jobs/?a=hpb&q=python&start=1
the job used p and br tags:
https://wuzzuf.net/jobs/p/T9WuTpM3Mveq-Senior-Data-Scientist-Evolvice-GmbH-Cairo-Egypt?o=28&l=sp&t=sj&a=python|search-v3|hpb
sorry i didn't understand the problem with p sooner
for link in links:
result= requests.get(link)
source= result.content
soup=BeautifulSoup(source,"lxml")
requirements_div=soup.find("div",{"class":"css-1t5f0fr"})
respon_text=[]
for child in requirements_div.children:
if child.name=='ul':
for li in child.find_all("li"):
respon_text.append(li.text)
elif child.name=='p':
for x in child.contents:
if x.name == 'br':
pass
elif x.name == 'strong':
respon_text.append(x.text)
else:
respon_text.append(x)
job_requirementsL.append('|'.join(respon_text))
How do I return data from https://finance.yahoo.com/quote/FB?p=FB. I am trying to pull the open and close data. The thing is that both of these numbers share the same class in the code.
They both share this class 'Trsdu(0.3s) '
How can I differentiate these if the classes are the same?
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()
This function:
googclose = googsoup.find(class_='Trsdu(0.3s) ').get_text()
will return just the text of the first element with class Trsdu(0.3s).
Using:
googclose = googsoupsoup.find_all(class_='Trsdu(0.3s)')
will return an array containing the page's elements with class Trsdu(0.3s).
Then you can iterate them:
for element in googsoupsoup.find_all(class_='Trsdu(0.3s)'):
print element.get_text()
Check this out, if this is what you wanted:
import requests
from bs4 import BeautifulSoup
goog = requests.get('https://finance.yahoo.com/quote/FB?p=FB')
googsoup = BeautifulSoup(goog.text, 'html.parser')
googclose = googsoup.select("span[data-reactid=42]")[1].text
googopen = googsoup.select("span[data-reactid=48]")[0].text
print("Close: {}\nOpen: {}".format(googclose,googopen))
Result:
Close: 172.17
Open: 171.69
If you want just the values for Open and Previous Close, you can either use findAll and grab the first 2 items in the results
googclose, googopen = googsoup.findAll('span', class_='Trsdu(0.3s) ')[:2]
googclose = googclose.get_text()
googopen = googopen.get_text()
print(googclose, googopen)
>>> 172.17 171.69
Or you can go one level higher, and find the values based on the parent td using the data-test attribute
googclose = googsoup.find('td', attrs={'data-test': 'PREV_CLOSE-value'}).get_text()
googopen = googsoup.find('td', attrs={'data-test': 'OPEN-value'}).get_text()
print(googclose, googopen)
>>> 172.17 171.69
If you use the Chrome browser you can right-click on the item that you want to know more about then select Inspect from the resulting menu. The browser will show you something like this for the number associated with OPEN.
Notice that, not only is there a class attribute, there's the data-reactid attribute that might do the trick. In fact, if you also inspect the close number you will find, as I did, that its attribute is different.
This suggests the following code.
>>> import requests
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> soup.findAll('span', attrs={'data-reactid': '42'})[0].text
'172.17'
>>> soup.findAll('span', attrs={'data-reactid': '48'})[0].text
'171.69'
I am writing a simple Python program which grabs a webpage and finds all the URL links in it. However I try to index the starting and ending delimiter (") of each href link but the ending one always indexed wrong.
# open a url and find all the links in it
import urllib2
url=urllib2.urlopen('right.html')
urlinfo = url.info()
urlcontent = url.read()
bodystart = urlcontent.index('<body')
print 'body starts at',bodystart
bodycontent = urlcontent[bodystart:].lower()
print bodycontent
linklist = []
n = bodycontent.index('<a href=')
while n:
print n
bodycontent = bodycontent[n:]
a = bodycontent.index('"')
b = bodycontent[(a+1):].index('"')
print a, b
linklist.append(bodycontent[(a+1):b])
n = bodycontent[b:].index('<a href=')
print linklist
I would suggest using a html parsing library instead of manually searching the DOM String.
Beautiful Soup is an excellent library for this purpose. Here is the reference link
With bs your link searching functionality could look like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(bodycontent, 'html.parser')
linklist = [a.get('href') for a in soup.find_all('a')]
I pass multiple class values to BeautifulSoup.find_all(). The value is something like l4 center OR l5 center. (i.e., "l4 center" | "l5 center").
soup.find_all("ul", {"class" : value)
I fail (output nothing) to do that with the following two solution:
soup.find_all("ul", {"class" : re.compile("l[4-5]\scenter")})
#OR
soup.find_all("ul", {"class" : ["l4 center", "l5 center"]})
The source code is as follows:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import bs4
import requests
import requests.exceptions
import re
### function, , .... ###
def crawler_chinese_idiom():
url = 'http://chengyu.911cha.com/zishu_8.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)
#for result_set in soup.find_all("ul", class=re.compile("l[45] +center")): #l4 center or l5 center
for result_set in soup.find_all("ul", {"class", re.compile(r"l[45]\s+center")}): #nothing output
#for result_set in soup.find_all("ul", {"class" : "l4 center"}): #normal one
print(result_set)
crawler_chinese_idiom()
#[] output nothing
Update: resolved https://bugs.launchpad.net/bugs/1476868
At first I thought the problem was that class='l4 center' in HTML is actually two classes -- thinking that soup won't match because it's looking for a single class that contains a space (impossible).
Tried:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("<html><div class='l5 center'>l5test</div><div class='l4 center'>l4test</div><div class='l6 center'>l6test</div>")
results1 = soup.findAll('div', re.compile(r'l4 center'));
print results1
results2 = soup.findAll('div', 'l4 center');
print results2
Output:
[]
[<div class="l4 center">l4test</div>]
But wait? The non-regex option worked fine - it found both classes.
At this point, it looks to me like a BeautifulSoup bug.
To work around it, you could do:
soup.findAll('div', ['l4 center', 'l5 center']);
# update: ^ that doesn't work either.
# or
soup.findAll('div', ['l4', 'l5', 'center']);
I'd recommend the second one just in case you want to match l4 otherclass center, but you might need to iterate the results to make sure you don't have any unwanted captures in there. Something like:
for result in soup.findAll(...):
if (result.find({'class': 'l4'}) and result.find({'class': 'center'}):
# yay!
I've submitted a bug here for investigation.
I'm trying to parse a web page, and that's my code:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
read = BeautifulSoup(openurl.read())
soup = BeautifulSoup(openurl)
x = soup.find('ul', {"class": "i_p0"})
sp = soup.findAll('a href')
for x in sp:
print x
I really with I could be more specific but as the title says, it gives me no response. No errors, nothing.
First of all, omit the line read = BeautifulSoup(openurl.read()).
Also, the line x = soup.find('ul', {"class": "i_p0"}) doesn't actually make any difference, because you are reusing x variable in the loop.
Also, soup.findAll('a href') doesn't find anything.
Also, instead of old-fashioned findAll(), there is a find_all() in BeautifulSoup4.
Here's the code with several alterations:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl)
sp = soup.find_all('a')
for x in sp:
print x['href']
This prints the values of href attribute of all links on the page.
Hope that helps.
I altered a couple of lines in your code and I do get a response, not sure if that is what you want though.
Here:
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl.read()) # This is what you need to use for selecting elements
# soup = BeautifulSoup(openurl) # This is not needed
# x = soup.find('ul', {"class": "i_p0"}) # You don't seem to be making a use of this either
sp = soup.findAll('a')
for x in sp:
print x.get('href') #This is to get the href
Hope this helps.