Extract Company names from Experience Section LinkedIn extract python - python

I need to extract the experience section and I need to scrape companies where he has worked(either previous or current) with python, I have tried but it is not working.
my code -
div = driver.find_element('div', class_='t-14 t-normal')
span = div.find_element('span', {'aria-hidden': 'true'})
text = span.get_text()
print(text)
Inspect element looks like -
this is just for a single company, btw all the inspect looks same for rest of the companies I checked.
<span class="t-14 t-normal">
<span aria-hidden="true">
<!---->
"Financial Times . Full-time"
<!---->
</span>
link - https://www.linkedin.com/in/fred-thompson-8a892a19b/
So I just want to extract all the companies which are there in the inspect element. I am unable to do, it seems tough to apply a logic which will loop through all companies and get the names.

You are mixing Selenium syntax with BeautifulSoup syntax.
If you want to do this with Selenium your syntax should be as following:
elements = driver.find_elements(By.CSS_SELECTOR, ".t-14.t-normal span[aria-hidden='true']")
for element in elements:
print(element.text)
However this will not work perfect since locators you are using here will match irrelevant elements too.

Related

Can't find element using by.XPATH

I'm trying to find one element using Selenium in Python using the following code:
element_A = driver.find_element(By.XPATH, '/html/body/div[5]/div[1]/div[2]/div[3]/div/div/div/div[1]/div[1]/div[2]/div[1]/div[3]/div[1]/i')
in this HTML code:
<div class="display-flex>
<span class=" cc-customer-info__value"="">212121 <i title="copiar" class="fa fa-clone _clipboard_" data-clipboard="212121 "></i> <br>
</div>
I want the number 212121, but I'm getting no such element error. The problem is this number can be different every time that I open the website. It's the number of the customer.
Is it possible to help me to locate this element?
I'm also trying to find two more elements:
Customer profile, it's also change
<text class="legend-graph font-menu" x="416" y="38">Customer Profile: A</text>
and Diamond, that also can vary
<span class="cc-customer-info__value ">Diamond</span>
Thank you!
For the first one, the span element has the text as number, not the tag.
driver.find_element(By.XPATH, '/html/body/div[5]/div[1]/div[2]/div[3]/div/div/div/div[1]/div[1]/div[2]/div[1]/div[3]/div[1]/span)
try this it may work, or try to shorten the xpath you are creating. this is absolute xpath.
try and use selenium ide to record the scenario and check what locator is being captured for the element.
you could also try devtools/ and other xpath locating tools which are available as chrome extentions.

Best XPath practices for extracting data from a field that varies in format

I was using Python 3.8, XPath and Scrapy where things just seemed to work. I took my XPath expressions for granted.
Now I'm must using Python 3.8, XPath and lxml.html and things are much less forgiving. For example, using this URL and this XPath:
//dt[text()='Services/Products']/following-sibling::dd[1]
I would return a paragraph or a list depending on what the innerhtml was. This is how I am attempting to extract the text now:
data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
which returns this: Services_Product[]
which is a list of "li" elements for his page, but other times this field can be any of these:
<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>
What is the best practice for extracting text from situations like this where the target field can be a number of different things?
I used this test code to see what my options are:
file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
print(elem[0][0].text)
That returned this:
Health
Health
doctors
Health
doctors
Which is not correct. Here's a screenshot of it in google chrome:
The Xpath tool in google chrome along with the html in question
Whats the best way to scrape this data using Python and Xpath - or other options?
Thank you.
After spending hours googling and then writing this post above, it just came to me:
old code:
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
and new code that returns a nice list of text:
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li/text()")
add the "/text()" on the end fixed it.

Find span value with xpath

I'm following this
tutorial for scraping information from a website after a login.
Now, part of the code makes use of a xpath variable to scrape specific content. I'm not familiair with xpath and after a lot of searching I can't find the right solution. I hope one of you guys can help me out!
I need the value within the "price" <span>:
<div class="price-box">
<span class="regular-price" id="product-price-64">
<span class="price">€ 4,90</span>
</span>
</div>
My piece of code right now is:
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//span[#class='price']/text()")
What should be the xpath code to get the information from within the <span>?
Edit: It seems indeed -as per the comments- that the initial page source came not good through.
your xpath looks almost fine, maybe you forgot a dot?
bucket_names = tree.xpath(".//span[#class='price']/text()")
Please check xpath submethod,
Your xpath seems correct.
Try with: xpath("//span[#class='price']").text

xpath in lxml to find id number based on href

I'm trying to rewrite someones library to parse some xml returned with requests. However they use lxml in a way I'm not used to. I believe it's using regular expression to find the data and while most of the library provided works, it doesn't work when the site being parsed has the file id in a list structure. Essnetially I get a page back and I'm looking for an id that matches the href athlete number. So say I want to just get id's for athlete 567377.
</div>
</a></div>
<ul class='list-entries'>
<li class='entity-details feed-entry' id='Activity-123120999590'>
<div class='avatar avatar-athlete avatar-default'>
<a class='avatar-content' href='/athletes/567377' >
</a>
</div>
</li>
<li class='entity-details feed-entry' id='Activity-16784940202'>
<div class='avatar avatar-athlete avatar-default'>
<a class='avatar-content' href='/athletes/5252525'>
</a>
</div>
The code:
lst_group_activity = parser.xpath(".//li[substring(#id, 1, 8)='Activity']")
Provides all list items perfectly but for all activities. I want to only have the one related to the right athlete. The library uses the following to use an #href to select the right athlete.
lst_athlethe_act_in_group_activity = parser.xpath(".//li[substring(#id, 1, 8)='Activity']/*[#href='/athletes/"+athlethe_id+"']/..")
However, this never seems to work. It finds the activity but then throws them all away.
Is there a better way to get this working? Any tutorial that can point me in the right direction to correlate to the next element.
The element with the href attribute isn't an immedite child of your li element, so your xpath is failing. You're matching:
.//li/*[#href="..."]
You want:
.//li/div/a[#href="..."]
(You could match * instead of a if you think another element might contain the href attribute, and you can match against .//li//a[#href="..."] if you think the path to the a element might not always be li/div/a).
So to find the li element:
parser.xpath(".//li[substring(#id, 1, 8)='Activity']/div/a[#href='/athletes/%s']/../.." % '5252525')
But you can also write that without the ../..:
parser.xpath(".//li[substring(#id, 1, 8)='Activity' and div/a/#href='/athletes/%s']" % '5252525')

Selenium-Python: Class containing link-text

I am using Python & Selenium to scrap the content of a certain webpage. Currently, I have the following problem: There are multiple div-classes with the same name, but each div-class has different content. I only need the information for one particular div-class. In the following example, I would need the information in the first "show_result"-class since there is the "Important-Element" within the link text:
<div class="show_result">
<a href="?submitaction=showMoreid=77" title="Go-here">
<span class="new">Important-Element</span></a>
Other text, links, etc within the class...
</div>
<div class="show_result">
<a href="?submitaction=showMoreid=78" title="Go-here">
<span class="new">Not-Important-Element</span></a>
Other text, links, etc within the class...
</div>
<div class="show_result">
<a href="?submitaction=showMoreid=79" title="Go-here">
<span class="new">Not-Important-Element</span></a>
Other text, links, etc within the class...
</div>
With the following code I can get the "Important-Element" and its link:
driver.find_element_by_partial_link_text('Important-Element'). However, I also need the other information within the same div-class "show-result". How can I refer to the entire div-class that contains the Important-Element in the link text? driver.find_elements_by_class_name('show_result') does not work since I do not know in which of the div-classes the Important-Element is located.
Thanks,
Finn
Edit / Update: Ups, I found the solution on my own using xpath:
driver.find_element_by_xpath("//div[contains(#class, 'show_result') and contains(., 'Important-Element')]")
I know you've found an answer but I believe it's wrong since you would also select the other nodes because Important-Element is still in Non-Important-Element.
Maybe it works for your specific case since that's not really the text you're after. But here are a few more answers:
//div[#class='show_result' and starts-with(.,'Important-Element')]
//div[span[text()='Important-Element']]
//div[contains(span/text(),'Important-Element') and not(contains(span/text(),'Non'))]
There are more ways to write this...
Ups, i found the solution on my own via xpath:
driver.find_element_by_xpath("//div[contains(#class, 'show_result') and contains(., 'Important-Element')]")

Categories

Resources