Code to extract a value from html code in Python - python

I need an urgent help for the issue described below.I am automating a
project using selenium python bindings
Scenario : creating a new member with a profile picture and add this
member to a group.Then check whether the profile picture given to the
member at the time of profile creation is the same when it appears in
the friends list
For this I would like to compare the image Ids at the time of profile
creation and in the friends list?
I have found out the image id using firebug.Image Id is given inside a
<div><a class=........Imageid=234563453.....................>
But how can I extract this Image Id from the ?
print self.driver.find_element_by_xpath("")._getattribute_(ImageId)
Can anybody provide me the code to extract this Imageid from <a class> ???

You can use regex.
s = "<a class=........Imageid=234563453.....................>"
m = re.search("Imageid=\d*",s)
print m.group().split('=')[1]

Related

How to find all elements of form Using Selenium

I am trying to get all the elements of form using selenium, but I can't seem to get it done. I need to do it dynamically without writing down the id or class of elements present in the form, The Selenium should detect the form and get all the elements name automatically that are present in the form.
The problem I'm facing is that the form is using action instead of class.
Here is the website
driver = webdriver.Chrome()
driver.get("http://stevens.ekkel.ai")
#find all form input fields via form name
content = driver.find_element_by_class_name('form')
print(content)
You want to print the Name, People, date, Message values of the form?
content = driver.find_element_by_tag_name('form')
for input in content.find_elements_by_xpath('./p/input'):
print(input.get_attribute('name'))
The following would look for any html element with an attribute of name inside the form and print it.
content.find_elements_by_xpath('.//*[#name]')
You can use find_element_by_tag_name:
content = driver.find_element_by_tag_name('form')
This might have problems when there are multiple forms to choose from. Then you have to use find_elements_by_tag_name and potentially choose manually later.

Unable to extract data using xpath within a script tag

I'm trying to extract subscriber count of a channel using scrapy and I have figured out a script tag within which subscriber count is there but when I test it, I get black data. Please help.
split = '\"subscriberCountText\":{\"simpleText\"'
response.xpath("//script[contains(.,'" + split + "')]").extract()
You can search for the text "subscriberCountText":{"simpleText" within a channel's about source code page, but how do you extract that.
you need to add the channel id in this google_api
and you will get all the information of this channel in json format.
e.g your given chanels id is "UCqwUrj10mAEsqezcItqvwEw" you need to add this into "id" parameter of the api.Final URL will be "https://www.googleapis.com/youtube/v3/channels?id=UCqwUrj10mAEsqezcItqvwEw&part=snippet%2CcontentDetails%2Cstatistics&key=AIzaSyAWpx46-G9ZByLe8Nk_wqtUekCXvTPM2oI"

Extracting HTML tag content with xpath from a specific website

I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.

Selenium Webscraping Twitter - Getting hold of tweet timestamp?

When inspecting a twitter results page, within the following class:
<small class="time">
....
</small>
Is a timestamp for each tweet 'data-time':
<span class="_timestamp js-short-timestamp js-relative-timestamp" data-time="1510698047" data-time-ms="1510698047000" data-long-form="true" aria-hidden="true">12m</span>
Within selenium i am using the following code:
tweet_date = browser.find_elements_by_class_name('_timestamp')
But looking at a single entry only returns, in this case, 12m.
How is it possible to access one of the other properties within the class within selenium?
I usually use find_elements_by_xpath, this will let you grab a specific element from a page without worrying about names. Or so that's how it seems to work.
EDIT
Alright so I think I've got it figured out. First, find element by xpath and assign.
ts=browser.find_elements_by_xpath('//*[#id="stream-item-tweet-929138668551380992"]/div/div[2]/div[1]/small/a/span')
Forgot that if you use "elements" instead of "element" you'll need to add something like this.
ts=ts[0]
Then you can use the get_attribute method to get the info associated with 'data-time' in the html.
raw_time=ts.get_attribute('data-time')
Returns
raw_time == '1510358895'
Thank you to SuperStew who found the key to the answer - get_attribute()
My final solution for anyone wondering:
tweet_date = browser.find_elements_by_class_name("_timestamp")
And then for any date in that list:
tweet_date[1].get_attribute('data-time')

python - how can I select a class and get the number of another class with whitespace between them using selenium

Problem: automate answer to a website that has new messages.
Breakdown:
Login into the website DONE
Get to the page where is the messages DONE
Find new messages class NOT DONE
Get the number of the new message that appear on url NOT DONE YET
I'm using Selenium to automate this process in Python 2.5
Can't select classes with whitespace in them
I was looking through the webite source and I noticed every time there is a new message a new class pop up class: notifycircle new fnormal abs nowrap br3 cfff lheight16.
Showing the red icon that we have new message
The class that appear showing that there is a new message
As you can see there is white space between them and I can't use find_element_by_class_name from Selenium library. I know this is a classic I can't select classes that has white space between them - Python Selenium. I have tried using find_element_by_css_selector without luck either.
...
browser.get("websitehere")
#if found certain class I will proced to the final goal
if browser.find_element_by_css_selector(".notifycircle.new.fnormal.abs.nowrap.br3.cfff.lheight16"):
print "Found element"
Select the number of the new message
As you can see in image 2: id_answer_row_#### means the number of the message. I would like to grab that message number. How can I achieve this goal?
Do you really need to examine that span element? The tr element includes a class "unreaded". If that is indicating whether the element is unread, then you could do something like this:
unread_answers = browser.find_elements_by_css_selector("tr.unreaded")
which should give you the list of webelements that are unread.
You can create your loop to go through this list to extract the number from the id by using something like:
for unread_row in unread_answers:
row_id = unread_row.get_attribute("id")
m = re.search('answer_row_\d*', row_id)
row_number = m.group(1)
(note, my python may be a little rough, and this is all untested, but you should be able to get the idea)
You can use class name for element identification as follow:
browser.find_element_by_xpath(//tr[#class="notifycircle new fnormal abs nowrap br3 cfff lheight16"]')

Categories

Resources