Using Selenium to find indexed element within a div - python

I'm scraping the front-end of a webpage and having difficulty getting the HMTL text of a div within a div.
Basically, I'm simulating clicks - one for each event listed on the page. From there, I want to scrape the date and time of the event, as well as the location of the event.
Here's an example of one of the pages I'm trying to scrape:
https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event
<div class="eventInfoContainer-54d5deb3">
<div class="lineupContainer-570750d2">
<div class="eventInfoContainer-9e539994">
<img src="assets.bandsintown.com/images.clock.svg">
<div>Sunday, April 21st, 2019</div> <!––***––>
<div class="eventInfoContainer-50768f6d">5:00PM</div><!––***––>
</div>
<div class="eventInfoContainer-1a68a0e1">
<img src="assets.bandsintown.com/images.clock.svg">
<div class="eventInfoContainer-2d9f07df">
<div>Aura Nightclub</div> <!––***––>
<div>283 1st St., San Jose, CA 95113</div> <!––***––>
</div>
I've marked the elements I want to extract with asterisks - the date, time, venue, and address. Here's my code:
base_url = 'https://www.bandsintown.com/?came_from=257&page='
events = []
eventContainerBucket = []
for i in range(1, 2):
driver.get(base_url + str(i))
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=eventList-] a[class^=event-]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
# iterate through all events and open them.
for event in events:
driver.get(event)
uniqueEventContainer = driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0]
print "Event information: "+ uniqueEventContainer.text
This prints:
Event information: Sunday, April 21st, 2019
3:00 PM
San Francisco Brewing Co.
3150 Polk St, Sf, CA 94109
View All The Fourth Son Tour Dates
My issue is that I can't access the nested eventInfoContainer divs individually. For example, the 'date' div is position [1], as it is the second element (after img) in it's parent div "eventInfoContainer-9e539994". The parent div "eventInfoContainer-9e539994" is in position [1] is it is likewise the second element in it's parent div "eventInfoContainer-54d5deb3" (after "lineupContainer).
By this logic, shouldn't I be able to access the date text by this code: (accessing the 1st position element, with it's parent being the 1st position element, within the container (the 0th position element)?
for event in events:
driver.get(event)
uniqueEventContainer = driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0][1][1]
I get the following error:
TypeError: 'WebElement' object does not support indexing

When you index into webElements list (which is what find_elements_by_css_selector('div[class^=eventInfoContainer-]') returns) you get a webElement, you cannot further index into that. You can split the text of a webElement to generate a list for further indexing.
If there is regular structure across pages you could load html for div into BeautifulSoup. Example url:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event')
soup = bs(d.find_element_by_css_selector('[class^=eventInfoContainer-]').get_attribute('outerHTML'), 'lxml')
date = soup.select_one('img + div').text
time = soup.select_one('img + div + div').text
venue = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div > div').text
address = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div + div').text
print(date, time, venue, address)
If line breaks were consistent:
containers = d.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
array = containers[0].text.split('\n')
date = array[3]
time = array[4]
venue = array[5]
address = array[6]
print(date, time, venue, address)
With index and split:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event')
containers = d.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
date_time = containers[1].text.split('\n')
i_date = date_time[0]
i_time = date_time[1]
venue_address = containers[3].text.split('\n')
venue = venue_address[0]
address = venue_address[1]
print(i_date, i_time, venue, address)

As the error suggests, webelements doesn't have indexing. What you are confusing with is list.
Here
driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
This code returns a list of webelements. That is why you can access a webelement using the index of the list. But that element doesn't have indexing to another webelement. You are not getting a list of lists.
That is why
driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0] works. But driver.find_elements_by_css_selector('div[class^=eventInfoContainer-][0][1]') doesn't.
Edit:(Answer for quesion in the comment)
It is not slenium code.
The code posted in the answer by QHarr uses BeautifulSoup. It is a python package for parsing HTML and XML documents.
BeautifulSoup has a .select() method which uses CSS selector against a parsed document and return all the matching elements.
There’s also a method called select_one(), which finds only the first tag that matches a selector.
In the code,
time = soup.select_one('img + div + div').text
venue = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div > div').tex
It gets the first element found by the given CSS selector and returns the text inside the tag. The first line finds a img tag then finds the immediate sibling div tag, then again finds the sibling dev tag of the previous div tag.
In the second line it finds the third sibling tag that has class starts with eventInfoContainer- and then it finds the child div and find the child of that div.
Check out CSS selectors
This could be done directly using selenium:
date = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='clock.svg'] + div")
time = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'] + div + div")
venue = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='pin.svg'] + div > div")
address = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='pin.svg'] + div > div:nth-of-type(2)")
I've used differnt CSS selectors but it still selects the same elements.
I'm not sure about BeautifulSoup but in the answer of QHarr, the date selector would return other value instead of intended value for selenium.

Related

How to find data in HTML using bs4 when it has : and not =

Hi I am currently having issues using bs4 and regex to find information within the html as they are contained within : and not = like I am used to.
<div data-react-cache-id="ListItemSale-0" data-react-class="ListItemSale" data-react-props='{"imageUrl":"https://laced.imgix.net/products/aa0ff81c-ec3b-4275-82b3-549c819d1404.jpg?w=196","title":{"label":"Air Jordan 1 Mid Madder Root GS","href":"/products/air-jordan-1-mid-madder-root-gs"},"contentCount":3,"info":"UK 4.5 | EU 37.5 | US 5","subInfo":"DM9077-108","hasStatus":true,"isBuyer":false,"status":"pending_shipment","statusOverride":null,"statusMessage":"Pending","statusMods":["red"],"price":"£125","priceAction":null,"subPrice":null,"actions":[{"label":"View","href":"/account/selling/M2RO1DNV"},{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label","options":{"disabled":false}},{"label":"View Postage","href":"/account/selling/M2RO1DNV/shipping-label.pdf","options":{"target":"_blank","disabled":false}}]}'></div>
I am trying to extract the href link in
{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label"
How do I do this? I've tried regex, find_all but with no avail. Thanks
My code below for reference, I've put # next to the solutions I have tried on top of many others
account_soup = bs(my_account.text, 'lxml')
links = account_soup.find_all('div', {'data-react-class': 'ListItemSale'})
#for links in download_link['actions']:
#print(links['href'])
#for i in links:
#link_main = i.find('title')
#link = re.findall('^/account*shipping-label$', link_main)
#print(link)
You need to fetch the data-react-props attribute of each div, then parse that as JSON. You can then iterate the actions property and get the href property that matches your :
actions = []
for l in links:
props = json.loads(l['data-react-props'])
for a in props['actions']:
m = re.match(r'^/account.*shipping-label$', a['href'])
if m is not None:
actions.append(m[0])
print(actions)
Output for your sample data:
['/account/selling/M2RO1DNV/shipping-label']

How to fetch/scrape all elements from a html "class" which is inside "span"?

I am trying to scrape data from a website where i am collecting data from all elements under "class" which is inside "span" using this piece of code. But i am ending up in fetching only one element instead of all.
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
#element = soup.findAll("div", {"class": "sold-property-listing__location"})
place_name = expand_hits[1].find("div", {"class": "sold-property-listing__location"}).findAll("span", {"class": "item-link"})[1].getText()
print(place_name)
apartments.append(final_str)
Expected result for print(place_name)
Stockholm
Malmö
Copenhagen
...
..
.
The result which is am getting for print(place_name)
Malmö
Malmö
Malmö
...
..
.
When i try to fetch the contents from expand_hits[1] i get only one element. If i don't specify the index scraper is throwing an error regarding the usage find(), find_all() and findAll(). As far as i understood i think i have to call the content of the elements iteratively.
Any help is much appreciated.
Thanks in Advance!
Use the loop variable rather than indexing to same collection with same index (expand_hits[1]) and append place_name not final_str
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
place_name = hit_property.find("div", {"class": "sold-property-listing__location"}).find("span", {"class": "item-link"}).getText()
print(place_name)
apartments.append(place_name)
You only then need Find and no indexing
Add User-Agent header to ensure results. Also, I note that I have to pick a parent node because at least one result will not be captured by using that class item-link e.g. Övägen 6C. I use replace to get rid of the hidden text present due to now selecting for parent node.
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.hemnet.se/salda/bostader?location_ids%5B%5D=474035"
page = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page.content,'html.parser')
for result in soup.select('.sold-results__normal-hit'):
print(re.sub(r'\s{2,}',' ', result.select_one('.sold-property-listing__location h2 + div').text).replace(result.select_one('.hide-element').text.strip(), ''))
If you only want where in Malmo e.g. Limhamns Sjöstad, you need to check how many child span tags there are for each listing.
for result in soup.select('.sold-results__normal-hit'):
nodes = result.select('.sold-property-listing__location h2 + div span')
if len(nodes)==2:
place = nodes[1].text.strip()
else:
place = 'not specified'
print(place)

Webscraping Issue w/ BeautifulSoup

I am new to Python web scraping, and I am scraping productreview.com for review. The following code pulls all the data I need for a single review:
#Scrape TrustPilot for User Reviews (Rating, Comments)
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import json
import requests
import datetime as dt
final_list=[]
url = 'https://www.productreview.com.au/listings/world-nomads'
r = requests.get(url)
soup = bs(r.text, 'lxml')
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
try:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
name = name.find('span').text
location = soup.find('h4').find('small').text
policy = soup.find('div', class_ ='px-4_1Cw pt-4_9Zz pb-2_1Ex card-body_2iI').find('span').text
title = soup.find('h3').find('span').text
content = soup.find('p', class_ = 'mb-0_2CX').text
rating = soup.find('div', class_ = 'mb-4_2RH align-items-center_3Oi flex-wrap_ATH d-flex_oSG')
rating = rating.find('div')['title']
final_list.append([name, location, policy, rating, title, content])
except AttributeError:
pass
reviews = pd.DataFrame(final_list, columns = ['Name', 'Location', 'Policy', 'Rating', 'Title', 'Content'])
print(reviews)
But when I edit
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
to
for div in soup.findAll('div', class_ = 'loadingOverlay_24D'):
I don't get all reviews, I just get the same entry looped over and over.
Any help would be much appreciated.
Thanks!
Issue 1: Repeated data inside the loop
You loop has the following form:
for div in soup.find('div' , ...):
name = soup.find('h4', ... )
policy = soup.find('div', ... )
...
Notice that you are calling find inside the loop for the soup object. This means that each time you try to find the value for name, it will search the whole document from the beginning and return the first match, in every iteration.
This is why you are getting the same data over and over.
To fix this, you need to call find inside the current review div that you are currently at. That is:
for div in soup.find('div' , ...):
name = div.find('h4', ... )
policy = div.find('div', ... )
...
Issue 2: Missing data and error handling
In your code, any errors inside the loop are ignored. However, there are many errors that are actually happening while parsing and extracting the values. For example:
location = div.find('h4').find('small').text
Not all reviews have location information. Hence, the code will extract h4, then try to find small, but won't find any, returning None. Then you are calling .text on that None object, causing an exception. Hence, this review will not be added to the result data frame.
To fix this, you need to add more error checking. For example:
locationDiv = div.find('h4').find('small')
if locationDiv:
location = locationDiv.text
else:
location = ''
Issue 3: Identifying and extracting data
The page you're trying to parse has broken HTML, and uses CSS classes that seem random or at least inconsistent. You need to find the correct and unique identifiers for the data that you are extracting such that they strictly match all the entries.
For example, you are extracting the review-container div using CSS class loadingOverlay_24D. This is incorrect. This CSS class seems to be for a "loading" placeholder div or something similar. Actual reviews are enclosed in div blocks that look like this:
<div itemscope="" itemType="http://schema.org/Review" itemProp="review">
....
</div>
Notice that the uniquely identifying property is the itemProp attribute. You can extract those div blocks using:
soup.find('div', {'itemprop': 'review'}):
Similarly, you have to find the correct identifying properties of the other data you want to extract to ensure you get all your data fully and correctly.
One more thing, when a tag has more than one CSS class, usually only one of them is the identifying property you want to use. For example, for names, you have this:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
but in reality, you don't need all these classes. The first class, in this case, is sufficient to identify the name h4 blocks
name = soup.find('h4', class_ = 'my-0_27D')
Example:
Here's an example to extract the author names from review page:
for div in soup.find_all('div', {'itemprop': 'review'}):
name = div.find('h4', class_ = 'my-0_27D')
if (name):
name = name.find('span').text
else:
name = '-'
print(name)
Output:
Aidan
Bruno M.
Ba. I.
Luca Evangelista
Upset
Julian L.
Alison Peck
...
The page servs broken html code and html.parser is better at dealing with it.
Change soup = bs(r.text, 'lxml') to soup = bs(r.text, 'html.parser')

How to find text of <div><span>text</span></div> in beautifulsoup?

This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text
I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())

select a specific set of cell under a set of tables using python and beautifulsoup

Consider there are N web pages.
Each web page has one or more tables. The common thing the tables have is that their class is same, consider "table_class."
We need the contents under the same column[third column, heading is title] of every table.
Contents meaning, the href links in column three from all rows.
Some rows might just be plain text and some might have href link in them.
You should print each href link in a separate line, one after the other.
Using attributes to filter is not valid as some tags have different attributes. The position of the cell is the only hint available.
How do you code this?
Consider these two links for the web pages:
http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2014
http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2013
Consider the table: wikitable
Required content: href links of column Title
Code I tried for one page:
from urllib.request import urlopen
from bs4 import BeautifulSoup, SoupStrainer
content = urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
for sp in soup.find_all('tr'):
for bt in sp.find_all('td'):
for link in bt.find_all('a'):
print(link.get("href"))
print()
The idea is to iterate over every table with wikitable class; for every table find links directly inside the i tag directly inside td directly inside tr:
import requests
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2014"
soup = BeautifulSoup(requests.get(url).content)
# iterate over tables
for table in soup.select('table.wikitable.sortable'):
# get the table header/description, continue if not found
h3 = table.find_previous_sibling('h3')
if h3 is None:
continue
print h3.text
# get the links
for link in table.select('tr > td > i > a'):
print link.text, "|", link.get('href', '')
print "------"
Prints (also printing table names for clarity):
January 2014–june 2014[edit]
Celebrity | /wiki/Celebrity
Kshatriya | /wiki/Kshatriya
1: Nenokkadine | /wiki/1:_Nenokkadine
...
Oohalu Gusagusalade | /wiki/Oohalu_Gusagusalade
Autonagar Surya | /wiki/Autonagar_Surya
------
July 2014 – December 2014[edit]
...
O Manishi Katha | /wiki/O_Manishi_Katha
Mukunda | /wiki/Mukunda
Chinnadana Nee Kosam | /wiki/Chinnadana_Nee_Kosam
------

Categories

Resources