Passing a regex expression to 'BeautifulSoup.find_all' doesn't work

Passing a regex expression to 'BeautifulSoup.find_all' doesn't work - python

I pass multiple class values to BeautifulSoup.find_all(). The value is something like l4 center OR l5 center. (i.e., "l4 center" | "l5 center").
soup.find_all("ul", {"class" : value)
I fail (output nothing) to do that with the following two solution:
soup.find_all("ul", {"class" : re.compile("l[4-5]\scenter")})
#OR
soup.find_all("ul", {"class" : ["l4 center", "l5 center"]})
The source code is as follows:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import bs4
import requests
import requests.exceptions
import re
### function, , .... ###
def crawler_chinese_idiom():
url = 'http://chengyu.911cha.com/zishu_8.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)
#for result_set in soup.find_all("ul", class=re.compile("l[45] +center")): #l4 center or l5 center
for result_set in soup.find_all("ul", {"class", re.compile(r"l[45]\s+center")}): #nothing output
#for result_set in soup.find_all("ul", {"class" : "l4 center"}): #normal one
print(result_set)
crawler_chinese_idiom()
#[] output nothing

Update: resolved https://bugs.launchpad.net/bugs/1476868
At first I thought the problem was that class='l4 center' in HTML is actually two classes -- thinking that soup won't match because it's looking for a single class that contains a space (impossible).
Tried:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("<html><div class='l5 center'>l5test</div><div class='l4 center'>l4test</div><div class='l6 center'>l6test</div>")
results1 = soup.findAll('div', re.compile(r'l4 center'));
print results1
results2 = soup.findAll('div', 'l4 center');
print results2
Output:
[]
[<div class="l4 center">l4test</div>]
But wait? The non-regex option worked fine - it found both classes.
At this point, it looks to me like a BeautifulSoup bug.
To work around it, you could do:
soup.findAll('div', ['l4 center', 'l5 center']);
# update: ^ that doesn't work either.
# or
soup.findAll('div', ['l4', 'l5', 'center']);
I'd recommend the second one just in case you want to match l4 otherclass center, but you might need to iterate the results to make sure you don't have any unwanted captures in there. Something like:
for result in soup.findAll(...):
if (result.find({'class': 'l4'}) and result.find({'class': 'center'}):
# yay!
I've submitted a bug here for investigation.

Related

How to display the entire list of currencies?

I need to output the exchange rate given by the ECB API. But the output shows an error
"TypeError: string indices must be integers"
How to fix this error?
import requests, config
from bs4 import BeautifulSoup
r = requests.get(config.ecb).text
soup = BeautifulSoup(r, "lxml")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
print(y['currency'], y['rate'])

You have too many for-loops
for i in course:
print(i['currency'], i['rate'])
But this need also to search <cube> with attribute currency
course = soup.findAll("cube", currency=True)
course = soup.findAll("cube", {"currenc": True})
or you would have to check if item has attribute currency
for i in course:
if 'currency' in i.attrs:
print(i['currency'], i['rate'])
Full code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
course = soup.find_all("cube", currency=True)
for i in course:
#print(i)
print(i['currency'], i['rate'])

try this
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r, "lxml")
result = [{currency.get('currency'): currency.get('rate')} for currency in soup.find_all("cube", {'currency': True})]
print(result)
OUTPUT:
[{'USD': '0.9954'}, {'JPY': '142.53'}, {'BGN': '1.9558'}, {'CZK': '24.497'}, {'DKK': '7.4366'}, {'GBP': '0.87400'}, {'HUF': '403.98'}, {'PLN': '4.7143'}, {'RON': '4.9238'}, {'SEK': '10.7541'}, {'CHF': '0.9579'}, {'ISK': '138.30'}, {'NOK': '10.1985'}, {'HRK': '7.5235'}, {'TRY': '18.1923'}, {'AUD': '1.4894'}, {'BRL': '5.2279'}, {'CAD': '1.3226'}, {'CNY': '6.9787'}, {'HKD': '7.8133'}, {'IDR': '14904.67'}, {'ILS': '3.4267'}, {'INR': '79.3605'}, {'KRW': '1383.58'}, {'MXN': '20.0028'}, {'MYR': '4.5141'}, {'NZD': '1.6717'}, {'PHP': '57.111'}, {'SGD': '1.4025'}, {'THB': '36.800'}, {'ZAR': '17.6004'}]

Just in addition to answer from #Sergey K, that is on point how it should be done, to show what is the main issue.
Main issue in your code is that, your selection is not that precise as it should be:
soup.findAll("cube")
This will also find_all() parent <cube> that do not have an attribute called currency or rate but much more decisive is that there are spaces in the markup in between nodes BeautifulSoup will turn those into NavigableString's.
Using the index to get the attribute values, wont work while you do it with a NavigableStringinstead of the next`.
You can see this if you print(y.name) only:
None
Cube
None
Cube
...
How to fix this error?
There are two approaches in my opinion
Best is already shwon https://stackoverflow.com/a/73756178/14460824 by Sergey K who used very precise arguments to find_all() specific elements.
While working with your code, is to implement an if-statement that checks, if the tag.name is equal to 'cube'. It is working fine, but I would recommend to use a more precise selection instead.
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml?c892a2e0fae19504ef05028330310886').text
soup = BeautifulSoup(r)
soup.findAll("cube")
course = soup.findAll("cube")
for i in course:
for x in i("cube"):
for y in x:
if y.name == 'cube':
print(y['currency'], y['rate'])
Output
USD 0.9954
JPY 142.53
BGN 1.9558
CZK 24.497
DKK 7.4366
GBP 0.87400
HUF 403.98
PLN 4.7143
...

NoneType error Problem in BeautifulSoup in python

im begginer at programming, so i have problem in find method in beautifuloup when i use it in web scraping,i have this code
from os import execle, link, unlink, write
from typing import Text
import requests
from bs4 import BeautifulSoup
import csv
from itertools import zip_longest
job_titleL =[]
company_nameL=[]
location_nameL=[]
experience_inL=[]
links=[]
salary=[]
job_requirementsL=[]
date=[]
result= requests.get(f"https://wuzzuf.net/search/jobs/?a=%7B%7D&q=python&start=1")
source = result.content
soup= BeautifulSoup(source , "lxml")
job_titles = soup.find_all("h2",{"class":"css-m604qf"} )
companies_names = soup.find_all("a",{"class":"css-17s97q8"})
locations_names = soup.find_all("span",{"class":"css-5wys0k"})
experience_in = soup.find_all("a", {"class":"css-5x9pm1"})
posted_new = soup.find_all("div",{"class":"css-4c4ojb"})
posted_old = soup.find_all("div",{"class":"css-do6t5g"})
posted = [*posted_new,*posted_old]
for L in range(len(job_titles)):
job_titleL.append(job_titles[L].text)
links.append(job_titles[L].find('a').attrs['href'])
company_nameL.append(companies_names[L].text)
location_nameL.append(locations_names[L].text)
experience_inL.append(experience_in[L].text)
date_text=posted[L].text.replace("-","").strip()
date.append(posted[L].text)
for link in links:
result= requests.get(link)
source= result.content
soup=BeautifulSoup(source,"lxml")
requirements=soup.find("div",{"class":"css-1t5f0fr"}).ul
requirements1=soup.find("div",{"class":"css-1t5f0fr"}).p
respon_text=""
if requirements:
for li in requirements.find_all("li"):
print(li)
if requirements1:
for br in requirements1.find_all("br"):
print(br)
respon_text +=li.text + "|"
job_requirementsL.append(respon_text)
file_list=[job_titleL,company_nameL,date,location_nameL,experience_inL,links,job_requirementsL]
exported=zip_longest(*file_list)
with open('newspeard2.csv',"w") as spreadsheet:
wr=csv.writer(spreadsheet)
wr.writerow(["job title", "company name","date", "location", "experience in","links","job requirements"])
wr.writerows(exported)
note: im not very good at english :(
so when i use find method to get the job requirements from each job in the website page (wuzzuf),use for loop to loop throug each text i job requirements, it returns error says:"NoneType objects han nod attribute find_all("li"), so after searching why this happens ,and after dowing inspect for each job page , i found that some job pages uses (br, p and strong) tags in job requirements, i didn't know what to do , but i used if statement to test it, it returns the tags but br tag is empty without text , so please can you see where is the prblem and answer me , thanks
the webpage:
https://wuzzuf.net/search/jobs/?a=hpb&q=python&start=1
the job used p and br tags:
https://wuzzuf.net/jobs/p/T9WuTpM3Mveq-Senior-Data-Scientist-Evolvice-GmbH-Cairo-Egypt?o=28&l=sp&t=sj&a=python|search-v3|hpb

sorry i didn't understand the problem with p sooner
for link in links:
result= requests.get(link)
source= result.content
soup=BeautifulSoup(source,"lxml")
requirements_div=soup.find("div",{"class":"css-1t5f0fr"})
respon_text=[]
for child in requirements_div.children:
if child.name=='ul':
for li in child.find_all("li"):
respon_text.append(li.text)
elif child.name=='p':
for x in child.contents:
if x.name == 'br':
pass
elif x.name == 'strong':
respon_text.append(x.text)
else:
respon_text.append(x)
job_requirementsL.append('|'.join(respon_text))

find_next not capturing all <div> instances

I am having an issue where not all instances are captured within a relatively simply beautifulsoup scrape. What I am running is the below:
from bs4 import BeautifulSoup as bsoup
import requests as reqs
home_test = "https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two"
away_test = "https://fbref.com/en/matches/ea736ad1/Carlisle-United-Northampton-Town-August-11-2018-League-Two"
page_to_parse = home_test
page = reqs.get(page_to_parse)
status_code = page.status_code
status_code = str(status_code)
parse_page = bsoup(page.content, 'html.parser')
find_stats = parse_page.find_all('div',id="team_stats_extra")
print(find_stats)
for stat in find_stats:
add_stats = stat.find_next('div').get_text()
print(add_stats)
If you have a look at the first print, the scrape captures the part of the website that I'm after, however if you inspect the second print, half of the instances in the earlier one aren't actually being taken on at all. I do not have any limits on this, so in theory it should take in all the right ones.
I've already testes quite a few different variants of find_next, find, or find_all, but the second loop find never takes all of them.
Results are always:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Where it should take on the following instead:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Northampton Lincoln City
2Offsides2
9Goal Kicks15
32Throw Ins24
18Long Balls23

parse_page.find_all returns a list of one item, the WebElement with id="team_stats_extra". The loop need to be on it's child elements
find_stats = parse_page.find_all('div', id="team_stats_extra")
all_stats = find_stats[0].find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
If you have multiple tables use two loops
find_stats = parse_page.find_all('div', id="team_stats_extra")
for stats in find_stats:
all_stats = stats.find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())

find_stats = parse_page.find_all('div',id="team_stats_extra") actually returns only one block, so the next loop performs only one iteration.
You can change the way to select the div blocks with :
find_stats = parse_page.select('div#team_stats_extra > div')
print(len(find_stats)) # >>> returns 2
for stat in find_stats:
add_stats = stat.get_text()
print(add_stats)
To explain the selector select('div#team_stats_extra > div'), it is the same as :
find the div block with the id team_stats_extra
and select all direct children that are div

With bs4 4.7.1+ you can use :has to ensure you get the appropriate divs with class th as a child so you have the appropriate elements to loop over
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two')
soup = bs(r.content, 'lxml')
for div in soup.select('#team_stats_extra div:has(.th)'):
print(div.get_text())

Python: unnable to get any output using beautifulsoup

I am trying to scrape some words from any random website, but the following program is not showing errors and not showing any output when i tried printing the results.
I have checked the code twice and even incorporated an if statement to see whether the program is getting any words or not.
import requests
import operator
from bs4 import BeautifulSoup
def word_count(url):
wordlist = []
source_code = requests.get(url)
source = BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('a', {'class':'txt'}):
word_string=post_text.string
if word_string is not None:
word = word_string.lower().split()
for each_word in word:
print(each_word)
wordlist.append(each_word)
else:
print("None")
word_count('https://mumbai.craigslist.org/')
I am expecting all the words under the "class= txt" to be displayed in the output.

OP: I am expecting all the words of the class text to be displayed in the output
The culprit:
for post_text in source.findAll('a', {'class':'txt'}):
The reason:
anchor tag has no class txt but the span tag inside it does.
Hence:
import requests
from bs4 import BeautifulSoup
def word_count(url):
source_code = requests.get(url)
source=BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('a'):
s_text = post_text.find('span', class_ = "txt")
if s_text is not None:
print(s_text.text)
word_count('https://mumbai.craigslist.org/')
OUTPUT:
community
activities
artists
childcare
classes
events
general
groups
local news
lost+found
missed connections
musicians
pets
.
.
.

You are targeting the wrong elements.
if you use
print(source)
Everything works fine but the moment you try to target the element with findAll you are targeting something wrong because you get an empty list array.
If you replace
for post_text in source.findAll('a', {'class':'txt'}):
with
for post_text in source.find_all('a'):
everyting seems to work fine

I have visited https://mumbai.craigslist.org/, and find there is no <a class="txt">, only <span class="txt">, so I think you can try this:
def word_count(url):
wordlist = []
source_code = requests.get(url)
source=BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('span', {'class':'txt'}):
word_string=post_text.text
if word_string is not None:
word = word_string.lower().split ()
for each_word in word:
print(each_word)
wordlist.append(each_word)
else:
print("None")
it will output correctly:
community
activities
artists
childcare
classes
events
general
...
Hope that helps you, and comment if you have further questions. : )

HTML Parsing gives no response

I'm trying to parse a web page, and that's my code:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
read = BeautifulSoup(openurl.read())
soup = BeautifulSoup(openurl)
x = soup.find('ul', {"class": "i_p0"})
sp = soup.findAll('a href')
for x in sp:
print x
I really with I could be more specific but as the title says, it gives me no response. No errors, nothing.

First of all, omit the line read = BeautifulSoup(openurl.read()).
Also, the line x = soup.find('ul', {"class": "i_p0"}) doesn't actually make any difference, because you are reusing x variable in the loop.
Also, soup.findAll('a href') doesn't find anything.
Also, instead of old-fashioned findAll(), there is a find_all() in BeautifulSoup4.
Here's the code with several alterations:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl)
sp = soup.find_all('a')
for x in sp:
print x['href']
This prints the values of href attribute of all links on the page.
Hope that helps.

I altered a couple of lines in your code and I do get a response, not sure if that is what you want though.
Here:
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl.read()) # This is what you need to use for selecting elements
# soup = BeautifulSoup(openurl) # This is not needed
# x = soup.find('ul', {"class": "i_p0"}) # You don't seem to be making a use of this either
sp = soup.findAll('a')
for x in sp:
print x.get('href') #This is to get the href
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Passing a regex expression to 'BeautifulSoup.find_all' doesn't work - python

Related

How to display the entire list of currencies?

NoneType error Problem in BeautifulSoup in python

find_next not capturing all <div> instances

Python: unnable to get any output using beautifulsoup

HTML Parsing gives no response

Categories

Resources