How to get text with no class/id using selenium/python? - python

I'm trying to get a list of variables (date, size, medium, etc.) from this page (https://www.artprice.com/artist/844/hans-arp/lots/pasts) using python/selenium.
For the titles it was easy enough to use :
titles = driver.find_elements_by_class_name("sln_lot_show")
for title in titles:
print(title.text)
However the other variables seem to be text within the source code which have no identifiable id or class.
For example, to fetch the dates made I have tried:
dates_made = driver.find_elements_by_xpath("//div[#class='col-sm-6']/p[1]")
for date_made in dates_made:
print(date_made.get_attribute("date"))
and
dates_made = driver.find_elements_by_xpath("//div[#class='col-sm-6']/p[1]/date")
for date_made in dates_made:
print(date_made.text)
which both produce no error, but are not printing any results.
Is there some way to this text, which has no specific class or id?
Specific html here :
......
<div class="col-xs-8 col-sm-6">
<p>
<i><a id="sln_16564482" class="sln_lot_show" href="/artist/844/hans-arp/print-multiple/16564482/vers-le-blanc-infini" title=""Vers le Blanc Infini"" ng-click="send_ga_event_now('artist_past_lots_search', 'select_lot_position', 'title', {eventValue: 1})">
"Vers le Blanc Infini"
</a></i>
<date>
(1960)
</date>
</p>
<p>
Print-Multiple, Etching, aquatint,
<span ng-show="unite_to == 'in'" class="ng-hide">15 3/4 x 18 in</span>
<span ng-show="unite_to == 'cm'">39 x 45 cm</span>
</p>

Progressive mode, below Javascript will return you two-dimensional array (lots and details - 0,1,2,8,9 your indexes):
lots = driver.execute_script("[...document.querySelectorAll(".lot .row")].map(e => [...e.querySelectorAll("p")].map(e1 => e1.textContent.trim()))")
Classic mode:
lots = driver.find_elements_by_css_selector(".lot .row")
for lot in lots:
lotNo = lot.find_element_by_xpath("./div[1]/p[1]").get_attribute("textContent").strip()
title = lot.find_element_by_xpath("./div[2]/i").get_attribute("textContent").strip()
details = lot.find_element_by_xpath("./div[2]/p[2]").get_attribute("textContent").strip()
date = lot.find_element_by_xpath("./div[3]/p[1]").get_attribute("textContent").strip()
country = lot.find_element_by_xpath("./div[3]/p[2]").get_attribute("textContent").strip()

Related

Scraping the attribute of the first child from multiple div (selenium)

I'm trying to scrap the class name of the first child (span) from multiple div.
Here is the html code:
<div class="ui_column is-9">
<span class="name1></span>
<span class="...">...</span>
...
<div class ="ui_column is-9">
<span class="name2></span>
<span class="...">...</span>
...
<div class ..
URL of the page for the complete code.
I'm achieving this task with this code for the first five div:
i=0
liste=[]
while i <= 4:
parent= driver.find_elements_by_xpath("//div[#class='ui_column is-9']")[i]
child= parent.find_element_by_xpath("./child::*")
class_name= child.get_attribute('class')
i = i+1
liste.append(nom_classe)
But do you know if there is an easier way to do it ?
You can directly get all these first span elements and then extract their class attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[1]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)
You can also extract the class attribute values from 5 first elements only by limiting the loop for 5 iterations
UPD
Well, after updating your question the answer becomes different and much simpler.
You can get the desired elements directly and extract their class name attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[contains(#class,'ui_bubble_rating')]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)

Remove redundant class names in HTML using BeautifulSoup

I want to convert:
<span class = "foo">data-1</span>
<span class = "foo">data-2</span>
<span class = "foo">data-3</span>
to
<span class = "foo"> data-1 data-2 data-3 </span>
Using BeautifulSoup in Python. This HTML part exists in multiple areas of the page body, hence I want to minimize this part and scrap it. Actually the mid span was with em class hence originally separated.
Adapted from this answer to show how this could be used for your span tags:
span_tags = container.find_all('span')
# combine all the text from b tags
text = ''.join(span.get_text(strip=True) for span in span_tags)
# here you choose a tag you want to preserve and update its text
span_main = span_tags[0] # you can target it however you want, I just take the first one from the list
span_main.span.string = text # replace the text
for tag in span_tags:
if tag is not span_main:
tag.decompose()

Format data of string of dates and associated values from an unstructured list

I want to save movie reviews (date and the reviews) obtained from web scraping using beautiful soup into a data frame. There are at least one review for each posted date and there can be several reviews per day.
Thing is, the HTML doesn't have a div structure for each date and associated reviews, but instead each element, dates and reviews are all siblings tags, ordered one after another.
Here a snippet of the html:
<div class="more line-bottom">
<a class="next" href="es/news/374528/cat/113418/#cm"> <span>anterior</span> <span class="icon"> </span> </a>
</div>
<div class="date">
<p>miércoles, 7 de agosto de 2019</p>
</div>
<div class="article clear-block no-photo">
<div class="box-text-article">
<p class="news-info">
<a href="es/newsdetail/376261">
<span>Dokufest 2019</span>
</a>
</p>
<h2>
Crítica: <i>Aether</i>
</h2>
</div>
</div>
<div class="date">
<p>viernes, 2 de agosto de 2019</p>
</div>
<div class="article clear-block no-photo">
<div class="box-text-article">
<p class="news-info">
<span>Peliculas / Reviews</span>
</p>
<h2>Crítica: <i>Remember Me (Recuérdame)</i></h2>
</div>
</div>
<div class="article clear-block no-photo">
<div class="box-text-article">
<p class="news-info">
<span>Peliculas / Reviews</span>
</p>
<h2>Crítica: <i>Animals</i></h2>
</div>
</div>
I am able to fetch all the text of interest using a for loop and .next_siblings, but then can only format the obtained text involving many steps. Is there a more pythonic solution that you can suggest?
I have seen other post with solutions that might apply but only if I had a known length of elements. For example using tuples and converting to dictionaries, but since there can be more than one review per date this answers don't apply.
Here is my code to the web scraping and formatting:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from itertools import groupby
req = requests.get("https://www.cineuropa.org/es/news/cat/113418/")
soup = BeautifulSoup(req.text, "lxml")
# result is the container of the tags of interest.
result = soup.find("div", attrs = {'class':'grid-65'})
# This is the element prior to the list of movie reviews
prior_sib_1st_review = result.find("div", attrs= {'class':"more line-bottom"})
Then what a do is use the only attribute present in the date div to differenciate it from the review's tag and add it to the head of a tuple. Since the fetched data is ordered, there always would be a date and then a variable number of reviews title. I add up the titles to the tuple until a new date comes up. I have to do it using try catch because an error rises. The list end ups with some unicode.
_list = []
tup = ()
for sibling in prior_sib_1st_review.next_siblings:
try:
if(list(sibling.attrs.values())[0][0] == "date"):
tup = (repr(sibling.text),)
else:
tup = tup + (repr(sibling.text),)
except AttributeError as error:
pass
_list.append(tup)
Problem with this is that I get tuples that start off with the same date and that increment on their length with the for loop. So I remove the empty elements and the one that only contains the date:
_list_dedup = [item for item in _list if len(item)>1]
Then I group by the dates.
group_list = []
for key, group in groupby(_list_dedup , lambda x: x[0]):
group_list.append(list(group))
And finally keep the longest tuple of the list, which would be the one containing all the associated reviews for each date.
final_list = []
for elem in group_list:
final_list.append(max(elem))
df_ = pd.DataFrame(final_list)
Have you tried iterating through all divs, inspecting the class of each, and then storing the most recent date encountered? I think that is the most common solution for a problem like yours. For example:
from bs4 import BeautifulSoup
import requests
req = requests.get("https://www.cineuropa.org/es/news/cat/113418/")
soup = BeautifulSoup(req.text, "lxml")
# result is the container of the tags of interest.
result = soup.find("div", attrs = {'class':'grid-65'})
entries = {}
date = ""
for o in result.find_all('div'):
if 'date' in o['class']:
date = o.text
if 'box-text-article' in o['class']:
try:
entries[date].append(o)
except:
entries[date] = [o]
print(entries)
The result of this sample is a dictionary with dates as keys and lists of BeautifulSoup objects matching the class 'box-text-article'. Since dates always precede their corresponding articles, there's always a date to match. You can add a few lines to get the title, the link, etc. (The try/except bit in the middle just allows you to make a new entry for a date not yet in the dictionary or to append to an existing date entry if it is found.)
Here is a way to produce a nested dict with the outer dict keys being the dates and the inner values being dicts which have the event name as keys and links to reviews as values. It uses :contains (required bs4 4.7.1+) to isolate each date section and css filtering to filter out future dates from current date.
from bs4 import BeautifulSoup
import requests
from collections import OrderedDict
from pprint import pprint
req = requests.get("https://www.cineuropa.org/es/news/cat/113418/")
soup = BeautifulSoup(req.text, "lxml")
base = 'https://www.cineuropa.org'
listings = OrderedDict()
for division in soup.select('.date'):
date = division.text
next_date = soup.select_one('.date:contains("' + date + '") ~ .date')
if next_date:
next_date = next_date.text
current_info = soup.select('.date:contains("' + date + '") ~ div .box-text-article:not(.date:contains("' + next_date + '") ~ div .box-text-article) a')
else:
current_info = soup.select('.date:contains("' + date + '") ~ div .box-text-article a')
listings[date] = {i.text:base + i['href'] for i in current_info}
pprint(listings)
Only Crítica
If you only want the Crítica then you can filter again with :contains
for division in soup.select('.date'):
date = division.text
next_date = soup.select_one('.date:contains("' + date + '") ~ .date')
if next_date:
next_date = next_date.text
current_info = soup.select('.date:contains("' + date + '") ~ div .box-text-article:not(.date:contains("' + next_date + '") ~ div .box-text-article) a:contains("Crítica:")')
else:
current_info = soup.select('.date:contains("' + date + '") ~ div .box-text-article a:contains("Crítica:")')
listings[date] = {i.text:base + i['href'] for i in current_info}
Example listings entries (all items):

Extracting information by scraping products

I'm learning how to take information of products by scraping ecommerce and I have achieved a little, but there are parts that I'm not able of parse.
With this code I can take the information that is in the labels
from bs4 import BeautifulSoup
soup = BeautifulSoup('A LOT OF HTML HERE', 'html.parser')
productos = soup.find_all('li', {'class': 'item product product-item col-xs-12 col-sm-6 col-md-4'})
for product_info in productos:
# To store the information to a dictionary
web_content_dict = {}
web_content_dict['Marca'] = product_info.find('div',{'class':'product-item-manufacturer'}).text
web_content_dict['Producto'] = product_info.find('strong',{'class':'product name product-item-name'}).text
web_content_dict['Precio'] = product_info.find('span',{'class':'price'}).text
# To store the dictionary to into a list
web_content_list.append(web_content_dict)
df_kiwoko = pd.DataFrame(web_content_list)
I can take information from for example:
<div class="product-item-manufacturer"> PEDIGREE </div>
And I'd like to take information from this part:
<a href="https://www.kiwoko.com/sobre-pedigree-pollo-en-salsa-100-g-pollo-
y-verduras.html" class="product photo product-item-photo" tabindex="-1"
data-id="PED321441" data-name="Sobre Pedigree Vital Protection pollo y
verduras en salsa 100 g" data-price="0.49" data-category="PERROS" data-
list="PERROS" data-brand="PEDIGREE" data-quantity="1" data-click=""
For example take "Perros" from
data-category="PERROS"
How can I take information from parts that are not between >< and take the elements between ""?

Using requests and Beautifulsoup to find text in page (With CSS)

I'm doing a request to a webpage and I'm trying to retrieve some text on it. The text is splitup with span tags like this:
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
There are "inline style sheets" (CSS sheets) that says if we have to print or not the text to the screen and thus, not print the gibberish text on the screen. This is an example of 1 of the sheet:
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}
but there are more CSS files like this.. So I don't know if there are any better way to achieve my goal (print the text that shows on screen and not use the gibberish that is not displayed)
My script is able to print the text.. but all of it (with gibberish) as the following: "This is jvgviehrgjfne my gt4ugirdfgr script!"
If i understood you right, what you should do is to parse css files with regex for attributes associated with inline and provide the results to beautiful soup api. Here is a way:
import re
import bs4
page_txt = """
<span class="ed">This</span>
<span class="1">is</span>
<span class="12">jvgviehrgjfne</span>
<span class="dfe">my</span>
<span class="fd">gt4ugirdfgr</span>
<span class="df">string</span>
"""
css_file_read_output = """
.ed{display:inline}
.1{display:inline}
.12{display:none}
.dfe{display:inline}
.fd{display:none}
.df{display:inline}"""
css_file_lines = css_file_read_output.splitlines()
css_lines_text = []
for line in css_file_lines:
inline_search = re.search(".*inline.*", line)
if inline_search is not None:
inline_group = inline_search.group()
class_name_search = re.search("\..*\{", inline_group)
class_name_group = class_name_search.group()
class_name_group = class_name_group[1:-1] # getting rid of the last { and first .
css_lines_text.append(class_name_group)
else:
pass
page_bs = bs4.BeautifulSoup(page_txt,"lxml")
wanted_text_list = []
for line in css_lines_text:
wanted_line = page_bs.find("span", class_=line)
wanted_text = wanted_line.get_text(strip=True)
wanted_text_list.append(wanted_text)
wanted_string = " ".join(wanted_text_list)

Categories

Resources