Python/selenium webscraping - python

for link in data_links:
driver.get(link)
review_dict = {}
# get the size of company
size = driver.find_element_by_xpath('//[#id="EmpBasicInfo"]//span')
#location = ??? need to get this part as well.
my concern:
I am trying to scrape a website. I am using selenium/python to scrape the "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span, but I am not able to extract the text element from the website using xpath.I have tried getText, get attribute everything. Please, help!
This is the output for each iteration:I am not getting the text value.
Thank you in advance!

It seems you want only the text, instead of interacting with some element, one solution is to use BeautifulSoup to parse the html for you, with selenium getting the code built by JavaScript, you should first get the html content with html = driver.page_source, and then you can do something like:
html ='''
<div id="CompanyContainer">
<div id="EmpBasicInfo">
<div class="">
<div class="infoEntity"></div>
<div class="infoEntity">
<label>Industry</label>
<span class="value">Woodcliff</span>
</div>
<div class="infoEntity">
<label>Size</label>
<span class="value">501 to 1000 employees</span>
</div>
</div>
</div>
</div>
''' # Just a sample, since I don't have the actual page to interact with.
soup = BeautifulSoup(html, 'html.parser')
>>> soup.find("div", {"id":"EmpBasicInfo"}).findAll("div", {"class":"infoEntity"})[2].find("span").text
'501 to 1000 employees'
Or, of course, avoiding specific indexing and looking for the <label>Size</label>, it should be more readable:
>>> [a.span.text for a in soup.findAll("div", {"class":"infoEntity"}) if (a.label and a.label.text == 'Size')]
['501 to 1000 employees']
Using selenium you can do:
>>> driver.find_element_by_xpath("//*[#id='EmpBasicInfo']/div[1]/div/div[3]/span").text
'501 to 1000 employees'

Related

Scrapy - get content after an identified class

I'm trying to extract the content from this html:
<div class=product_detail>
<p>
Random stuff
</p>
<p>
<span class="brand_color">Brand:</span>Product Brand
</p>
</div>
I'm able to get "Brand:" with response.css('span.brand_color::text'), but i'm not able to get "Product Brand".
I'd like to build something that:
find the brand_color span --> This is not present 100% of the cases
Go up, to find the father
Then go down, ignore somehow the span, then select the ::text.
(my logic may be completely wring though).
Thanks a lot!
I would suggest using the BeautifulSoup. It's a very strong parsing library.
Read More about BeautifulSoup here: https://beautiful-soup-4.readthedocs.io/en/latest/
You can install it easily:
pip install beautifulsoup4
HTML = '<div class=product_detail> <p> Random stuff </p> <p> <span class="brand_color">Brand:</span>Product Brand </p> </div>'
parsed_object = BeautifulSoup(HTML)
res = [p.get_text().strip() for p in parsed_object.find_all('p')]
print(res)
You'd get the following content:
['Random stuff', 'Brand:Product Brand']
You can then use the split to extract your data
brand_name, paragraph_content = res[1].split(':')
print(brand_name) # Brand
print(paragraph_content) # Product Brand

Extracting string from <h1> element with logic attached

I am trying to scrape some sports game data and I have ran into some issues with my code. Eventually I will move this data into a dataframe and then eventually a database.
I am trying to scrape some sports data.
In the code, I have found the class element of one of the headers I would like to parse. There are multiple h1's in the HTML I am parsing.
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
With this HTML structure, how can I get the h1 to return to a string I can use to populate a dataframe?
Code I have tried so far is:
req = requests.get(url) # + str(page) + '/')
soup = bs(req.text, 'html.parser')
stype = soup.find('h1', class_ ='type-game')
print(stype)
This code returns "None". I have checked other articles on here and nothing has worked so far.
For the next level of my question, is there a way to create a For loop or similar to go through all of the pages (website is numbered sequentially for events) for any games that contain a string?
For example, if I wanted to only save games that have the Chicago Blackhawks in the h1 for the div element that has class= type-game?
Pseudocode would be something like this:
For webpages 1 to 10000:
if class_='type-game' 'h1' contains "Blackhawks"
then proceed with parsing the code
if not, skip the code and go to the next webpage
I know this is a little open ended, but I have a good VBA background and trying to apply those coding ideas to Python has been a challenge.
Select your elements more specific for example with css selectors:
soup.select('h1:-soup-contains("Blackhawks")')
or
soup.select('div.type-game h1:-soup-contains("Blackhawks")')
To get the text from a tag just use .text or get_text()
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Example
html='''
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Blackhawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Hawks vs. Ducks</h1>
</div>
<div class="type-game">
<div class="type">NHL Regular Season</div>
<h1>Ducks vs. Blackhawks</h1>
</div>
'''
soup = BeautifulSoup(html,'lxml')
for e in soup.select('h1:-soup-contains("Blackhawks")'):
print(e.text)
Output
Blackhawks vs. Ducks
Ducks vs. Blackhawks
EDIT
for e in soup.select('div.type-game h1'):
if 'Blackhawks' in e:
pint(e.text)#or do what ever is to do

Python Selenium, Scraping LinkedIn: Looping through Work and Education Histories

I am scraping data from LinkedIn profiles in Python using Selenium. It is mostly working but I can't figure out how to extract information for each employer or school in their history section.
I am working from the following tutorial: https://www.linkedin.com/pulse/how-easy-scraping-data-from-linkedin-profiles-david-craven/
And I am looking at this profile: https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk
Here is a partial snippet of the HTML section I am struggling with:
<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
<h2 class="pv-profile-section__card-heading">
Experience
</h2>
<!----></header>
<ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more">
<li id="ember136" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view"> <section id="1762786165" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view"> <div class="display-flex justify-space-between full-width">
<div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/wagestream/" id="ember138" class="full-width ember-view"> <div class="pv-entity__logo company-logo">
<img src="https://media-exp1.licdn.com/dms/image/C560BAQEkzVWoORqWFQ/company-logo_100_100/0/1615996325297?e=1631145600&v=beta&t=SoZQKV09PqqYxYTzbjqV4XTJa7HkGUZRe4QT0jU5hmE" loading="lazy" alt="Wagestream" id="ember140" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
<h3 class="t-16 t-black t-bold">Senior Software Engineer</h3>
<p class="visually-hidden">Company Name</p>
<p class="pv-entity__secondary-title t-14 t-black t-normal">
Wagestream
<span class="pv-entity__secondary-title separator">Full-time</span>
</p>
<div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>Apr 2021 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">3 mos</span>
</h4>
</div>
<h4 class="pv-entity__location t-14 t-black--light t-normal block">
<span class="visually-hidden">Location</span>
<span>London, England, United Kingdom</span>
</h4>
<!---->
</div>
</a>
<!----> </div>
<!----> </div>
</section>
And this is followed by more "li" sections. So the overall history section can be identified with id="experience-section", work (as opposed to education) history can be identified in the "ul" section class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more". The information for the first job in the list can be identified with "li" section id="ember136".
I am trying to get job title, company, years in job etc. from this section but can't figure out how to do it. Here is a bit of python code to show what I have tried (skipping my log-in):
from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
# driver.get method() will navigate to a page given by the URL address
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
text=driver.page_source
sel = Selector(text)
# Using the "Copy xPath" option in Inspect in Google Chrome, I can manually extract the company name
sel.xpath('//*[#id="ember187"]/div[2]/p[2]/text()').extract_first()
# This will give me all of the text in the Work Experience section
stuff = driver.find_element_by_id("experience-section")
items = html_list.find_elements_by_tag_name("ul")
items = html_list.find_elements_by_tag_name("h3")
for item in items:
print(type(item))
text = item.text
print(text)
But these approaches are not great for an automated and systematic extraction of info from each job across profiles. What I would like to do is something like looping across "li" sections within each "ul" section, and within the "li" part, extract only the company name with class = "pv-entity__secondary-title t-14 t-black t-normal". But find_element_by_class_name only yields NoneTypes.
I'm not sure conceptually how to generate an iterable list of "ul" and "li" with selenium, and within each iteration extract specific bits of text using class names.
Here is a solution I came up with. I should point out I "cross posted" in a YouTube comment for the following tutorial: https://www.youtube.com/watch?v=W4Md-koupmE
Run the whole code but replace your email and password.
First, open the browser, sign into LinkedIn, and navigate to the relevant profile
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from time import sleep
# Path to the chromedriver.exe
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.get('https://www.linkedin.com')
# Log into LinkedIn
username = driver.find_element_by_id('session_key')
username.send_keys('mail#mail.com')
sleep(0.5)
password = driver.find_element_by_id('session_password')
password.send_keys('password')
sleep(0.5)
log_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
log_in_button.click()
sleep(3)
# The example profile I am trying to scrape
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
sleep(3)
If I just start trying to extract stuff, I will get an error. It turns out that I need to scroll down to the relevant section for it to load, otherwise no data is created:
# The experience section doesn't load until you scroll to it, this will scroll to the section
l= driver.find_element_by_xpath('//*[#id="oc-background-section"]')
driver.execute_script("arguments[0].scrollIntoView(true);", l)
To loop through the work experience, first I identify the "id" value for it, in this case "experience-section". Grab it with "find_element_by_id" method.
# Get stuff in work experience section
html_list = driver.find_element_by_id("experience-section")
This section contains a list of "li" elements (i.e. tag value "li"), each of which contains all the work info for each past job. Create a list of these WebElement types using "find_elements_by_tag_name".
# Jobs listed as li sections, create list of li
items = html_list.find_elements_by_tag_name("li")
Looking at the source code, I notice for instance that employer names can be identified by tag "p". This generates a list, and sometimes it contains multiple items. Make sure you select what you need:
x = items[0].find_elements_by_tag_name("p")
print(x[0].text)
# "Company Name"
print(x[1].text)
# "Wagestream Full-time"
Finally loop through the "li" sections, extracting relevant info, extract strings, and print desired info (or save as row in CSV):
# Loop through li list, extract each piece by tag name
for item in items:
name_job = item.find_elements_by_tag_name("h3")
name_emp = item.find_elements_by_tag_name("p")
more = item.find_elements_by_tag_name("h4")
job = name_job[0].text
emp = name_emp[1].text
# This just cleans up the string
yrs = [item for item in more[0].text.split('\n')][1]
loc = [item for item in more[2].text.split('\n')][1]
print(job)
print(emp)
print(yrs)
print(loc)
# terminates the application
driver.quit()

How can I find <img src> nested within <div> using Beautiful Soup?

New to both Python and Beautiful Soup. I am trying to collect the src of an img inserted into a collapsible section on an e-commerce site. The collapsible sections that contain the images have the class of accordion__contents, but <img> inserted into the collapsible sections do not have a specific class. Not every page contains an image; some contain multiple.
I am trying to extract the src from img that are randomly nested within <div>. In the HTML example below, my desired output would be: <[https://example.com/image1.png]>
<div class="accordion__title">Description</div>
<div class="accordion__contents">
<p>Enjoy Daiya’s Hon’y Mustard Dressing on your salads</p>
</div>
<div class="accordion__title">Ingredients</div>
<div class="accordion__contents">
<p>Non-GMO Expeller Pressed Canola Oil, Filtered Water</p>
<p><strong>CONTAINS: MUSTARD</strong></p>
</div>
<div class="accordion__title">Nutrition</div>
<div class="accordion__contents">
<p>
<img alt="" class="alignnone size-medium wp-image-57054" height="300" src="https://example.com/image1.png" width="162"/>
</p>
</div>
<div class="accordion__title">Warnings</div>
<div class="accordion__contents">
<p><strong>Contains mustard</strong></p>
</div>
I've written the following code that successfully drills down to the full tag, but I can't figure out how to extract src once I'm there.
img_href = container.find_all(class_ ='accordion__contents') # generates the output above, in a list form
img_href = [img.find_all('img') for img in img_href]
for x in img_href:
if len(x)==0: # skip over empty items in the list that don't have images
continue
else:
print(x) # print to make sure the image is there
x.find('img')[`src`] # generates error - see below
The error I am getting is ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? My intent is not to be treating a list like an item, thus the loop.
I've tried find_all() combined with .attrs('src') but that also didn't work. What am I doing wrong?
I've simplified my example, but the URL for the page I'm scraping is here.
You can use CSS selector ".accordion__contents img":
import requests
from bs4 import BeautifulSoup
url = "https://gtfoitsvegan.com/product/hony-mustard-dressing-by-daiya/?v=7516fd43adaa"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_imgs = [img["src"] for img in soup.select(".accordion__contents img")]
print(all_imgs)
Prints:
['https://gtfoitsvegan.com/wp-content/uploads/2021/04/Daiya-Honey-Mustard-Nutrition-Facts-162x300.png']

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

I'm working on creating a web scraping tool that generates a .csv report by using Python, Selenium, beautifulSoup, and pandas.
Unfortunately, I'm running into an issue with grabbing the "data-date" text from the HTML below. I am looking to pull the "2/4/2020" into the .csv my code is generating.
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
My python script starts off with the following:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
driver = webdriver.Chrome('C:\chromedriver.exe')
lastdatadate=[]
lastprocesseddate=[]
Then I have it log in to a website, enter my un/pw credentials, and click the continue/login button.
From there, I am using the following to parse the html, scrape the website, and pull the relevant data/text into a .csv:
content = driver.page_source
soup = bs(content, 'html.parser')
for a in soup.findAll('div', attrs={'class':'large-header-welcome'}):
datadate=a.find(?????)
processeddate=a.find('span', attrs={'id':'LargeHeader_dateText'})
lastdatadate.append(datadate.text)
lastprocesseddate.append(processeddate.text)
df = pd.DataFrame({'Last Data Date':lastdatadate,'Last Processed Date':lastprocesseddate})
df.to_csv('hqm.csv', index=False, encoding='utf-8')
So far, I've got it working for the "last processed date" component of the HTML, but I am having trouble getting it to pull the "last data date" from the HTML. It's there, I just don't know how to have python find it. I've tried using the find method but I have not been successful.
I've tried googling around and checking here for what I should try, but I've come up empty-handed so far. I think I'm having trouble what to search for.
Any insight would be much appreciated as I am trying to learn and get better. Thanks!
edit: here is a closer look of the HTML:
<div class="large-header-welcome">
<div class="row">
<div class="col-sm-6">
<h3 class="welcome-header">Welcome, <span id="LargeHeader_fullname">Rhett</span></h3>
<p class="">
<b>Site:</b> <span id="LargeHeader_Name">redacted</span>
<br />
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
</p>
</div>
To find one element use find()
processeddate=soup.find('span', attrs={'id':'LargeHeader_dateText'}).text
to find multple elements use
for item in soup.find_all('span', attrs={'id':'LargeHeader_dateText'}):
processeddate=item.text
Or you can use css selector select()
for item in soup.select('#LargeHeader_dateText'):
processeddate=item.text
EDIT
To get the attribute value data-date use following code
lastdatadate=[]
for item in soup.find_all('span',attrs={"id": "LargeHeader_dateText","data-date": True}):
processeddate=item['data-date']
lastdatadate(processeddate)
lastdatadate.append(processeddate)
Or css selector.
lastdatadate=[]
for item in soup.select('#LargeHeader_dateText[data-date]'):
processeddate=item['data-date']
print(processeddate)
lastdatadate.append(processeddate)
Both will give same output.however later one faster execution.

Categories

Resources