Hi I need to pass a variable to the soup.find() function, but it doesn't work :(
Does anyone know a solution for this?
from bs4 import BeautifulSoup
html = '''<div> blabla
<p class='findme'> p-tag content</p>
</div>'''
sources = {'source1': '\'p\', class_=\'findme\'',
'source2': '\'span\', class_=\'findme2\'',
'source1': '\'div\', class_=\'findme3\'',}
test = BeautifulSoup(html)
# this works
#print(test.find('p', class_='findme'))
# >>> <p class="findme"> p-tag content</p>
# this doesn't work
tag = '\'p\' class_=\'findme\''
# a source gets passed
print(test.find(sources[source]))
# >>> None
I am trying to split it up as suggested like this:
pattern = '"p", {"class": "findme"}'
tag = pattern.split(', ')
tag1 = tag[0]
filter = tag[1]
date = test.find(tag1, filter)
I don't get errors, just None for date. The problem is propably the content of tag1 and filter The debuger of pycharm gives me:
tag1 = '"p"'
filter = '{"class": "findme"}'
Printing them doesn't show these apostrophs. Is it possible to remove these apostrophs?
The first argument is a tag name, and your string doesn't contain that. BeautifulSoup (or Python, generally) won't parse out a string like that, it cannot guess that you put some arbitrary Python syntax in that value.
Separate out the components:
tag = 'p'
filter = {'class_': 'findme'}
test.find(tag, **filter)
Okay I got it, thanks again.
dic_date = {'source1': 'p, class:findme', other sources ...}
pattern = dic_date[source]
tag = pattern.split(', ')
if len(tag) is 2:
att = tag[1].split(':') # getting the attribute
att = {att[0]: att[1]} # building a dictionary for the attributes
date = soup.find(tag[0], att)
else:
date = soup.find(tag[0]) # if there is only a tag without an attribute
Well it doesn't look very nice but it's working :)
Related
I'm trying to use beautifulsoup to retain the value "XXXXX" in the self closing html tag below (apologies if my terminology is incorrect)
Is this possible? All the questions I can find are around getting data out that is between div tags, rather than an attribute in a self closing tag.
<input name="nonce" type="hidden" value="XXXXX"/>
Considering the text you need to parse is on the file variable, you can use the following code:
soup = BeautifulSoup(file, "html.parser")
X = soup.find('input').get('value')
print(X)
I don't think it should make a difference that it's a self-closing tag in this case. The same methods should still be applicable. (Any of the methods in the comments should also work as an alternative.)
nonceInp = soup.select_one('input[name="nonce"]')
# nonceInp = soup.find('input', {'name': 'nonce'})
if nonceInp:
nonceVal = nonceInp['value']
# nonceVal = nonceInp.attrs['value']
# nonceVal = nonceInp.get('value')
# nonceVal = nonceInp.attrs.get('value')
else: nonceVal = None # print('could not find an input named "nonce"')
I am new to Python web scraping, and I am scraping productreview.com for review. The following code pulls all the data I need for a single review:
#Scrape TrustPilot for User Reviews (Rating, Comments)
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import json
import requests
import datetime as dt
final_list=[]
url = 'https://www.productreview.com.au/listings/world-nomads'
r = requests.get(url)
soup = bs(r.text, 'lxml')
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
try:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
name = name.find('span').text
location = soup.find('h4').find('small').text
policy = soup.find('div', class_ ='px-4_1Cw pt-4_9Zz pb-2_1Ex card-body_2iI').find('span').text
title = soup.find('h3').find('span').text
content = soup.find('p', class_ = 'mb-0_2CX').text
rating = soup.find('div', class_ = 'mb-4_2RH align-items-center_3Oi flex-wrap_ATH d-flex_oSG')
rating = rating.find('div')['title']
final_list.append([name, location, policy, rating, title, content])
except AttributeError:
pass
reviews = pd.DataFrame(final_list, columns = ['Name', 'Location', 'Policy', 'Rating', 'Title', 'Content'])
print(reviews)
But when I edit
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
to
for div in soup.findAll('div', class_ = 'loadingOverlay_24D'):
I don't get all reviews, I just get the same entry looped over and over.
Any help would be much appreciated.
Thanks!
Issue 1: Repeated data inside the loop
You loop has the following form:
for div in soup.find('div' , ...):
name = soup.find('h4', ... )
policy = soup.find('div', ... )
...
Notice that you are calling find inside the loop for the soup object. This means that each time you try to find the value for name, it will search the whole document from the beginning and return the first match, in every iteration.
This is why you are getting the same data over and over.
To fix this, you need to call find inside the current review div that you are currently at. That is:
for div in soup.find('div' , ...):
name = div.find('h4', ... )
policy = div.find('div', ... )
...
Issue 2: Missing data and error handling
In your code, any errors inside the loop are ignored. However, there are many errors that are actually happening while parsing and extracting the values. For example:
location = div.find('h4').find('small').text
Not all reviews have location information. Hence, the code will extract h4, then try to find small, but won't find any, returning None. Then you are calling .text on that None object, causing an exception. Hence, this review will not be added to the result data frame.
To fix this, you need to add more error checking. For example:
locationDiv = div.find('h4').find('small')
if locationDiv:
location = locationDiv.text
else:
location = ''
Issue 3: Identifying and extracting data
The page you're trying to parse has broken HTML, and uses CSS classes that seem random or at least inconsistent. You need to find the correct and unique identifiers for the data that you are extracting such that they strictly match all the entries.
For example, you are extracting the review-container div using CSS class loadingOverlay_24D. This is incorrect. This CSS class seems to be for a "loading" placeholder div or something similar. Actual reviews are enclosed in div blocks that look like this:
<div itemscope="" itemType="http://schema.org/Review" itemProp="review">
....
</div>
Notice that the uniquely identifying property is the itemProp attribute. You can extract those div blocks using:
soup.find('div', {'itemprop': 'review'}):
Similarly, you have to find the correct identifying properties of the other data you want to extract to ensure you get all your data fully and correctly.
One more thing, when a tag has more than one CSS class, usually only one of them is the identifying property you want to use. For example, for names, you have this:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
but in reality, you don't need all these classes. The first class, in this case, is sufficient to identify the name h4 blocks
name = soup.find('h4', class_ = 'my-0_27D')
Example:
Here's an example to extract the author names from review page:
for div in soup.find_all('div', {'itemprop': 'review'}):
name = div.find('h4', class_ = 'my-0_27D')
if (name):
name = name.find('span').text
else:
name = '-'
print(name)
Output:
Aidan
Bruno M.
Ba. I.
Luca Evangelista
Upset
Julian L.
Alison Peck
...
The page servs broken html code and html.parser is better at dealing with it.
Change soup = bs(r.text, 'lxml') to soup = bs(r.text, 'html.parser')
Does Beautiful Soup allow for the exclusion of html code by div (or other filters)?
I am trying to parse code that is very poorly written, where there is not an appropriate tag, id, class or anything else to key on parsing the desired content.
What I am looking for is a select or findall everything in a id that is not a certain class. Per the sample code below, I want everything in id-main that is not contained in class-toc-indentation.
Below I have main_txt and toc_txt, though my goal is to have main_txt with toc_txt further parsed out.
soup = BeautifulSoup(orig_file)
title = soup.find('title')
main_txt = soup.findAll(id='main')[0]
toc_txt = soup.findAll(class_ ='toc-indentation')
I did my best to find the answer but just can seem to locate anything that will help me.
Please let me know if you have any questions or required further info.
Any assistance will be highly appreciated.
To get all elements inside main_text except those that are inside elements with class 'toc-indentation':
def not_inside_toc(tag):
return tag.get('class') != ['toc-indentation'] or tag.clear()
main_text = soup.find(id='main')
tags = main_text.find_all(not_inside_toc)
By passing a function to find_all you can make a filter doing what you want.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function
def myFilter(tag, ID, cls):
'''
Returns every eleement that have a parent with id=ID and class != cls
'''
if tag.has_attr('class') and cls not in tag['class']:
parent = tag.parents.next()
else:
return False
if parent.has_attr('id') and ID in parent['id']:
return True
else:
return False
print soup.find_all(lambda tag: myFilter(tag, 'main', 'toc-indentation'))
I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object.
Using regular expressions, I can identify the email address. My code to find the email is:
def get_email(link):
mail_list = []
for i in link:
a = str(i)
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._#]*)\">", re.IGNORECASE)
ik = re.findall(email_pattern, a)
if (len(ik) == 1):
mail_list.append(i)
else:
pass
s_email = str(mail_list[0]).split('<a href="')
t_email = str(s_email[1]).split('">')
print t_email[0]
Now, I also need to collect the phone number, mailing address and web url. I think in BeautifulSoup there must be an easy way to find those specific data.
A sample html page is as below:
<ul>
<li>
<span>Email:</span>
Message Us
</li>
<li>
<span>Website:</span>
<a target="_blank" href="http://www.abcl.com">Visit Our Website</a>
</li>
<li>
<span>Phone:</span>
(123)456-789
</li>
</ul>
And using BeatifulSoup, I am trying to collect the span values of Email, website and Phone.
Thanks in advance.
The most obvious problem with your code is that you're turning the object representing the link back into HTML and then parsing it with a regular expression again - that ignores much of the point of using BeautifulSoup in the first place. You might need to use a regular expression to deal with the contents of the href attribute, but that's it. Also, the else: pass is unnecessary - you can just leave it out entirely.
Here's some code that does something like what you want, and might be a useful starting point:
from BeautifulSoup import BeautifulSoup
import re
# Assuming that html is your input as a string:
soup = BeautifulSoup(html)
all_contacts = []
def mailto_link(e):
'''Return the email address if the element is is a mailto link,
otherwise return None'''
if e.name != 'a':
return None
for key, value in e.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = {}
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
print all_contacts
That will produce a list of one dictionary per contact found, in this case that will just be:
[{'website': u'http://www.abcl.com', 'phone': u'(123)456-789', 'email': u'abc#gmail.com'}]
I would like to get all the <script> tags in a document and then process each one based on the presence (or absence) of certain attributes.
E.g., for each <script> tag, if the attribute for is present do something; else if the attribute bar is present do something else.
Here is what I am doing currently:
outputDoc = BeautifulSoup(''.join(output))
scriptTags = outputDoc.findAll('script', attrs = {'for' : True})
But this way I filter all the <script> tags with the for attribute... but I lost the other ones (those without the for attribute).
If i understand well, you just want all the script tags, and then check for some attributes in them?
scriptTags = outputDoc.findAll('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()
You don't need any lambdas to filter by attribute, you can simply use some_attribute=True in find or find_all.
script_tags = soup.find_all('script', some_attribute=True)
# or
script_tags = soup.find_all('script', {"some-data-attribute": True})
Here are more examples with other approaches as well:
soup = bs4.BeautifulSoup(html)
# Find all with a specific attribute
tags = soup.find_all(src=True)
tags = soup.select("[src]")
# Find all meta with either name or http-equiv attribute.
soup.select("meta[name],meta[http-equiv]")
# find any tags with any name or source attribute.
soup.select("[name], [src]")
# find first/any script with a src attribute.
tag = soup.find('script', src=True)
tag = soup.select_one("script[src]")
# find all tags with a name attribute beginning with foo
# or any src beginning with /path
soup.select("[name^=foo], [src^=/path]")
# find all tags with a name attribute that contains foo
# or any src containing with whatever
soup.select("[name*=foo], [src*=whatever]")
# find all tags with a name attribute that endwith foo
# or any src that ends with whatever
soup.select("[name$=foo], [src$=whatever]")
You can also use regular expressions with find or find_all:
import re
# starting with
soup.find_all("script", src=re.compile("^whatever"))
# contains
soup.find_all("script", src=re.compile("whatever"))
# ends with
soup.find_all("script", src=re.compile("whatever$"))
For future reference, has_key has been deprecated is beautifulsoup 4. Now you need to use has_attr
scriptTags = outputDoc.find_all('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()
If you only need to get tag(s) with attribute(s), you can use lambda:
soup = bs4.BeautifulSoup(YOUR_CONTENT)
Tags with attribute
tags = soup.find_all(lambda tag: 'src' in tag.attrs)
OR
tags = soup.find_all(lambda tag: tag.has_attr('src'))
Specific tag with attribute
tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs)
Etc ...
Thought it might be useful.
you can check if some attribute are present
scriptTags = outputDoc.findAll('script', some_attribute=True)
for script in scriptTags:
do_something()
By using the pprint module you can examine the contents of an element.
from pprint import pprint
pprint(vars(element))
Using this on a bs4 element will print something similar to this:
{'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']},
'can_be_empty_element': False,
'contents': [u'\n\t\t\t\tNESNA\n\t'],
'hidden': False,
'name': u'span',
'namespace': None,
'next_element': u'\n\t\t\t\tNESNA\n\t',
'next_sibling': u'\n',
'parent': <h1 class="pie-compoundheader" itemprop="name">\n<span class="pie-description">Bedside table</span>\n<span class="pie-productname size-3 name global-name">\n\t\t\t\tNESNA\n\t</span>\n</h1>,
'parser_class': <class 'bs4.BeautifulSoup'>,
'prefix': None,
'previous_element': u'\n',
'previous_sibling': u'\n'}
To access an attribute - lets say the class list - use the following:
class_list = element.attrs.get('class', [])
You can filter elements using this approach:
for script in soup.find_all('script'):
if script.attrs.get('for'):
# ... Has 'for' attr
elif "myClass" in script.attrs.get('class', []):
# ... Has class "myClass"
else:
# ... Do something else
A simple way to select just what you need.
outputDoc.select("script[for]")