Python Selenium, Scraping LinkedIn: Looping through Work and Education Histories - python

I am scraping data from LinkedIn profiles in Python using Selenium. It is mostly working but I can't figure out how to extract information for each employer or school in their history section.
I am working from the following tutorial: https://www.linkedin.com/pulse/how-easy-scraping-data-from-linkedin-profiles-david-craven/
And I am looking at this profile: https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk
Here is a partial snippet of the HTML section I am struggling with:
<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
<h2 class="pv-profile-section__card-heading">
Experience
</h2>
<!----></header>
<ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more">
<li id="ember136" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view"> <section id="1762786165" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view"> <div class="display-flex justify-space-between full-width">
<div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/wagestream/" id="ember138" class="full-width ember-view"> <div class="pv-entity__logo company-logo">
<img src="https://media-exp1.licdn.com/dms/image/C560BAQEkzVWoORqWFQ/company-logo_100_100/0/1615996325297?e=1631145600&v=beta&t=SoZQKV09PqqYxYTzbjqV4XTJa7HkGUZRe4QT0jU5hmE" loading="lazy" alt="Wagestream" id="ember140" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
<h3 class="t-16 t-black t-bold">Senior Software Engineer</h3>
<p class="visually-hidden">Company Name</p>
<p class="pv-entity__secondary-title t-14 t-black t-normal">
Wagestream
<span class="pv-entity__secondary-title separator">Full-time</span>
</p>
<div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>Apr 2021 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">3 mos</span>
</h4>
</div>
<h4 class="pv-entity__location t-14 t-black--light t-normal block">
<span class="visually-hidden">Location</span>
<span>London, England, United Kingdom</span>
</h4>
<!---->
</div>
</a>
<!----> </div>
<!----> </div>
</section>
And this is followed by more "li" sections. So the overall history section can be identified with id="experience-section", work (as opposed to education) history can be identified in the "ul" section class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more". The information for the first job in the list can be identified with "li" section id="ember136".
I am trying to get job title, company, years in job etc. from this section but can't figure out how to do it. Here is a bit of python code to show what I have tried (skipping my log-in):
from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
# driver.get method() will navigate to a page given by the URL address
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
text=driver.page_source
sel = Selector(text)
# Using the "Copy xPath" option in Inspect in Google Chrome, I can manually extract the company name
sel.xpath('//*[#id="ember187"]/div[2]/p[2]/text()').extract_first()
# This will give me all of the text in the Work Experience section
stuff = driver.find_element_by_id("experience-section")
items = html_list.find_elements_by_tag_name("ul")
items = html_list.find_elements_by_tag_name("h3")
for item in items:
print(type(item))
text = item.text
print(text)
But these approaches are not great for an automated and systematic extraction of info from each job across profiles. What I would like to do is something like looping across "li" sections within each "ul" section, and within the "li" part, extract only the company name with class = "pv-entity__secondary-title t-14 t-black t-normal". But find_element_by_class_name only yields NoneTypes.
I'm not sure conceptually how to generate an iterable list of "ul" and "li" with selenium, and within each iteration extract specific bits of text using class names.

Here is a solution I came up with. I should point out I "cross posted" in a YouTube comment for the following tutorial: https://www.youtube.com/watch?v=W4Md-koupmE
Run the whole code but replace your email and password.
First, open the browser, sign into LinkedIn, and navigate to the relevant profile
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from time import sleep
# Path to the chromedriver.exe
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.get('https://www.linkedin.com')
# Log into LinkedIn
username = driver.find_element_by_id('session_key')
username.send_keys('mail#mail.com')
sleep(0.5)
password = driver.find_element_by_id('session_password')
password.send_keys('password')
sleep(0.5)
log_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
log_in_button.click()
sleep(3)
# The example profile I am trying to scrape
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
sleep(3)
If I just start trying to extract stuff, I will get an error. It turns out that I need to scroll down to the relevant section for it to load, otherwise no data is created:
# The experience section doesn't load until you scroll to it, this will scroll to the section
l= driver.find_element_by_xpath('//*[#id="oc-background-section"]')
driver.execute_script("arguments[0].scrollIntoView(true);", l)
To loop through the work experience, first I identify the "id" value for it, in this case "experience-section". Grab it with "find_element_by_id" method.
# Get stuff in work experience section
html_list = driver.find_element_by_id("experience-section")
This section contains a list of "li" elements (i.e. tag value "li"), each of which contains all the work info for each past job. Create a list of these WebElement types using "find_elements_by_tag_name".
# Jobs listed as li sections, create list of li
items = html_list.find_elements_by_tag_name("li")
Looking at the source code, I notice for instance that employer names can be identified by tag "p". This generates a list, and sometimes it contains multiple items. Make sure you select what you need:
x = items[0].find_elements_by_tag_name("p")
print(x[0].text)
# "Company Name"
print(x[1].text)
# "Wagestream Full-time"
Finally loop through the "li" sections, extracting relevant info, extract strings, and print desired info (or save as row in CSV):
# Loop through li list, extract each piece by tag name
for item in items:
name_job = item.find_elements_by_tag_name("h3")
name_emp = item.find_elements_by_tag_name("p")
more = item.find_elements_by_tag_name("h4")
job = name_job[0].text
emp = name_emp[1].text
# This just cleans up the string
yrs = [item for item in more[0].text.split('\n')][1]
loc = [item for item in more[2].text.split('\n')][1]
print(job)
print(emp)
print(yrs)
print(loc)
# terminates the application
driver.quit()

Related

How do I scrape this text from an HTML <span id> using Python, Selenium, and BeautifulSoup?

I'm working on creating a web scraping tool that generates a .csv report by using Python, Selenium, beautifulSoup, and pandas.
Unfortunately, I'm running into an issue with grabbing the "data-date" text from the HTML below. I am looking to pull the "2/4/2020" into the .csv my code is generating.
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
My python script starts off with the following:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
driver = webdriver.Chrome('C:\chromedriver.exe')
lastdatadate=[]
lastprocesseddate=[]
Then I have it log in to a website, enter my un/pw credentials, and click the continue/login button.
From there, I am using the following to parse the html, scrape the website, and pull the relevant data/text into a .csv:
content = driver.page_source
soup = bs(content, 'html.parser')
for a in soup.findAll('div', attrs={'class':'large-header-welcome'}):
datadate=a.find(?????)
processeddate=a.find('span', attrs={'id':'LargeHeader_dateText'})
lastdatadate.append(datadate.text)
lastprocesseddate.append(processeddate.text)
df = pd.DataFrame({'Last Data Date':lastdatadate,'Last Processed Date':lastprocesseddate})
df.to_csv('hqm.csv', index=False, encoding='utf-8')
So far, I've got it working for the "last processed date" component of the HTML, but I am having trouble getting it to pull the "last data date" from the HTML. It's there, I just don't know how to have python find it. I've tried using the find method but I have not been successful.
I've tried googling around and checking here for what I should try, but I've come up empty-handed so far. I think I'm having trouble what to search for.
Any insight would be much appreciated as I am trying to learn and get better. Thanks!
edit: here is a closer look of the HTML:
<div class="large-header-welcome">
<div class="row">
<div class="col-sm-6">
<h3 class="welcome-header">Welcome, <span id="LargeHeader_fullname">Rhett</span></h3>
<p class="">
<b>Site:</b> <span id="LargeHeader_Name">redacted</span>
<br />
<span class="import-popover"><span id="LargeHeader_glyphStatus" class="glyphicon glyphicon-ok-sign white"></span><b><span id="LargeHeader_statusText">Processing Complete</span></b><span id="LargeHeader_dateText" data-date="2/4/2020" data-delay="1" data-step="3" data-error="False">, Last Processed 2/5/2020</span></span>
</p>
</div>
To find one element use find()
processeddate=soup.find('span', attrs={'id':'LargeHeader_dateText'}).text
to find multple elements use
for item in soup.find_all('span', attrs={'id':'LargeHeader_dateText'}):
processeddate=item.text
Or you can use css selector select()
for item in soup.select('#LargeHeader_dateText'):
processeddate=item.text
EDIT
To get the attribute value data-date use following code
lastdatadate=[]
for item in soup.find_all('span',attrs={"id": "LargeHeader_dateText","data-date": True}):
processeddate=item['data-date']
lastdatadate(processeddate)
lastdatadate.append(processeddate)
Or css selector.
lastdatadate=[]
for item in soup.select('#LargeHeader_dateText[data-date]'):
processeddate=item['data-date']
print(processeddate)
lastdatadate.append(processeddate)
Both will give same output.however later one faster execution.

Having trouble savings links to a list variable with selenium

Practicing web scraping through selenium by opening user's dating profiles through a dating site. I need selenium to save a href link for every profile on the page but unfortunately it only saves the first profile on the list, rather than creating a list variable with all the links saved. All of the profiles start with the same two div class/style which is "member-thumbnail" and "position: absolute". Thank you for any help that you can offer.
Here is the website code:
<div class="member-thumbnail">
<div style="position: absolute;">
<a href="/Member/Details/LvL-Up">
<img src="//storage.com/imgcdn/m/t/502b24cb-3f75-49a1-a61a-ae80e18d86a0" class="presenceLine online">
</a>
</div>
</div>
Here is my code:
link_list = []
link_list = browser.find_element_by_css_selector('.member-thumbnail a').get_attribute('href')
length_link_list = len(link_list)
for i in range (0, length_link_list):
browser.get(link_list[i])
use find_elements_by_css_selector instead of find_element_by_css_selector
link
if you're going to loop through the whole list returned from find_elements_by_css_selector, consider using this instead, a bit more pythonic way.
link_list = browser.find_elements_by_css_selector('.member-thumbnail a')
for element in linklist:
browser.get(element.get_attribute('href'))

Using Python and BeautifulSoup to scrape list with variable orders and tags based on text strings

Details: MacOS, Python3, BeautifulSoup4
I am new to Python and even newer to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape html pages which do not heavily differentiate their tags by classes or div ids. In other words, I am trying to scrape the middle section of a list. The list will have an unpredictable amount of tags and elements (sometimes they use an unordered list, other times they are using a description list) so what I am scraping is fairly unpredictable, however, I do have two known variables and those would be the header string text I want to START at and the header string text I want to END at.
I have assembled the following example html to test this on:
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">First Section Title - Known Variable or String</h3>
</div>
</div>
<div>
<ul class="unstyled">
<li>Item1</li>
<li>Item2</li>
<li>Empty LI Tags Also Exist</li>
</ul>
<dl class="dl-horizontal">
<dt>Title of some description list</dt>
<dd>Another item may exist here</dd>
</dl>
</div>
<div>
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">Another Section Title</h3>
</div>
</div>
<ul class="unstyled">
<li>Item1</li>
<li></li>
</ul>
<dl class="dl-horizontal">
<dt>Another Description List Title</dt>
<dd>Another item may exist here</dd>
<dt>And here</dt>
<dd>And Here</dd>
</dl>
</div>
<div>
<div class="panel panel-default">
<div class="panel-heading">
<h3 class="panel-title">Section Title (String) I Wish To Stop At - Known Variable or String</h3>
</div>
</div>
</div>
Again, using the above model, I want to start at the first section I listed and end at the known text string of a particular section towards the bottom.
I have listed my Python script below. So far, the following Python is grabbing the correct information, however, I do not believe it will work under all circumstances, and there is probably a more efficient way to go about this. Here are some of the issues I believe are in my script:
My script is rather static - while it appears to start at the correct header, I have pieced out two sections separately as I do not believe my For loop is working the way it should be (I do not think ##Section 2 should be needed if written correctly).
Because my For loop is likely not doing what I probably think it is (I'd like it to iterate through the sections) I never had to define the stopping point (the string of text at the section I wish to stop at).
Since I am not convinced the loop is working correctly, I do not believe this will handle any curveballs I am thrown by the site - for example variable numbers of items on the list and if they add an additional section I would want between the "Beginning section" and "Ending section" defined.
I believe what needs to happen is:
Librarys need to be imported
Locate first section
Find next sibling
Keep finding siblings and returning text until the stop string matches
Python:
##Scrape
#import beautifulsoup and requests library
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(open("mock.html"), "html.parser")#BeautifulSoup(page.read())
#Begin by grabbing the section
stuff = soup.find_all(class_="panel-heading")
#Search for the first section title text string
next_elem = soup.find(text="First Section Title - Known Variable or String").findNext('li').contents[0]
#Attempt to scan the remainder of the section, starting with the next line item
next_next = next_elem.parent.find_next_sibling()
for item in next_next.findAll('li','dt','dd'):
if isinstance(item, Tag):
print(item.text)
print(next_elem)
print(next_next.text)
##Section 2 - I'd like to cut this out
s2_elem = soup.find(text="Another Section Title").findNext('li').contents[0]
s2_nxnx = s2_elem.parent.find_next_sibling()
s2_nxnxnx = s2_nxnx.parent.find_next_sibling()
print(s2_elem)
print(s2_nxnx.text)
print(s2_nxnxnx.text)
You could use a variable to spot when you are between search_start and search_end:
from bs4 import BeautifulSoup, Tag
import requests
search_start = "First Section Title - Known Variable or String"
search_end = "Section Title (String) I Wish To Stop At - Known Variable or String"
soup = BeautifulSoup(open("mock.html"), "html.parser")
start = False
for el in soup.find_all(['li', 'dt', 'dd', 'h3']):
if el.name == 'h3':
if el.text == search_start:
start = True
elif el.text == search_end:
break
elif start and isinstance(el, Tag):
print(el.text)
This would give you the following output:
Item1
Item2
Empty LI Tags Also Exist
Title of some description list
Another item may exist here
Item1
Another Description List Title
Another item may exist here
And here
And Here

Python/selenium webscraping

for link in data_links:
driver.get(link)
review_dict = {}
# get the size of company
size = driver.find_element_by_xpath('//[#id="EmpBasicInfo"]//span')
#location = ??? need to get this part as well.
my concern:
I am trying to scrape a website. I am using selenium/python to scrape the "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span, but I am not able to extract the text element from the website using xpath.I have tried getText, get attribute everything. Please, help!
This is the output for each iteration:I am not getting the text value.
Thank you in advance!
It seems you want only the text, instead of interacting with some element, one solution is to use BeautifulSoup to parse the html for you, with selenium getting the code built by JavaScript, you should first get the html content with html = driver.page_source, and then you can do something like:
html ='''
<div id="CompanyContainer">
<div id="EmpBasicInfo">
<div class="">
<div class="infoEntity"></div>
<div class="infoEntity">
<label>Industry</label>
<span class="value">Woodcliff</span>
</div>
<div class="infoEntity">
<label>Size</label>
<span class="value">501 to 1000 employees</span>
</div>
</div>
</div>
</div>
''' # Just a sample, since I don't have the actual page to interact with.
soup = BeautifulSoup(html, 'html.parser')
>>> soup.find("div", {"id":"EmpBasicInfo"}).findAll("div", {"class":"infoEntity"})[2].find("span").text
'501 to 1000 employees'
Or, of course, avoiding specific indexing and looking for the <label>Size</label>, it should be more readable:
>>> [a.span.text for a in soup.findAll("div", {"class":"infoEntity"}) if (a.label and a.label.text == 'Size')]
['501 to 1000 employees']
Using selenium you can do:
>>> driver.find_element_by_xpath("//*[#id='EmpBasicInfo']/div[1]/div/div[3]/span").text
'501 to 1000 employees'

Fetching name and email from a web page [duplicate]

This question already has an answer here:
How to get data off from a web page in selenium webdriver [closed]
(1 answer)
Closed 7 years ago.
I'm trying to fetch data off from a Link. I want to fetch name/email/location/etc content from the web page and paste it into the webpage. I have written the code for it always when i run this code it just stores a blank list.
Please help me to copy these data from the web page.
I want to fetch company name, email, phone number from this Link and put these contents in an excel file. I want to do the same for the all pages of the website. I have got the logic to fetch the the links in the browser and switch in between them. I'm unable to fetch the data from the website. Can anybody provide me an enhancement to the code i have written.
Below is the code i have written:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
from lxml import html
import requests
import xlwt
browser = webdriver.Firefox() # Get local session of firefox
# 0 wait until the pages are loaded
browser.implicitly_wait(3) # 3 secs should be enough. if not, increase it
browser.get("http://ae.bizdirlib.com/taxonomy/term/1493") # Load page
links = browser.find_elements_by_css_selector("h2 > a")
#print link
for link in links:
link.send_keys(Keys.CONTROL + Keys.RETURN)
link.send_keys(Keys.CONTROL + Keys.PAGE_UP)
#tree = html.fromstring(link.text)
time.sleep(5)
companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
The Html code is given below
<div class="content">
<div id="node-946273" class="node node-country node-promoted node-full clearfix">
<div class="content clearfix">
<div itemtype="http://schema.org/Corporation" itemscope="">
<fieldset>
<legend>Company Information</legend>
<div style="width:100%;">
<div style="float:right; width:340px; vertical-align:top;">
<br/>
<ul>
<li>
<strong>Company Name</strong>
:
<span itemprop="name">Sabbro - F.Z.C</span>
</li>
</ul>
when i use it it gives me a error that list' object has no attribute 'text'. Can somebody help me to enhance the code and make it work. I'm kind of like stuck forever on this issue.
companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
find_elements_by... return a list, you can either access first element of that list or use equivalent find_element_by... method that would get just the first element.

Categories

Resources