How to get data-timestamp using python/selenium - python

Below is the html of the table I want to extract the data-timestamp from.
The webpage is at https://nl.soccerway.com/national/argentina/primera-division/20182019/regular-season/r47779/matches/?ICID=PL_3N_02
So far I tried verious variants I found on here but nothing seemed to work. Can someone help me to extract the (for example) 1536962400. So in other words I want to extract every data-timestamp value of the table. Any suggestions are more than welcome! I have used selenium/python to extract table data from the website but data-timestamp always gives errors.

data-timestamp is an attribute of tr element, you can try this:
element_list = driver.find_elements_by_xpath("//table[contains(#class,'matches')]/tbody/tr")
for items in element_list:
print(items.get_attribute('data-timestamp'))

Related

I found a span on a website that is not visible and I can't scrape it! Why?

Currently I'm trying to scrape data from a website. Therefore I'm using Selenium.
Everything is working as it should. Until I realised I have to scrape a tooltiptext.
I found already different threads on stackoverflow that are providing an answer. Anyway I did not manage to solve this issue so far.
After a few hours of frustration I realised the following:
This span has nothing to do with the tooltip I guess. Because the tooltip looks like this:
There is actually a span that I can't read. I try to read it like this:
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.text)
So Selenium finds this element. But unfortunatly '.text' returns nothing. Why is it always empty ?
And what for is the span from the first screenshot ? Btw. it is not displayed at the Website as well.
Since you've mentioned Selenium finds this element, I would assume you must have print the len of bewertung list
something like
print(len(bewertung))
if this list has some element in it, you could probably use innerText
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.get_attribute("innerText"))
Note that, you are using find_elements which won't throw any error instead if it does not find the element it will return an empty list.
so if you use find_element instead, it would throw the exact error.
Also, I think you've xpath for the span (Which does not appear in UI, sometime they don't appear until some actions are triggered.)
You can try to use this xpath instead:
//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']
Something like this in code:
bewertung = driver.find_elements_by_xpath("//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']")
for item in bewertung:
print(item.text)

Need Selenium to return the class title content of given HTML

Using Selenium to perform some webscraping. Have it log in to a site, where an HTML table of data is returned with five values at a time. I'm going to have Selenium scrape a particular bit of data off the table, write to a file, click next, and repeat with the next five.
New automation script. I've a myriad of variations of get_attribute, find_elements_by_class_name, etc. Example:
pnum = prtnames.get_attribute("title")
for x in prtnames:
print('pnum')
Here's the HTML from one of the returned values:
<div class="text-container prtname"><span class="PrtName" title="P011">P011</span></div>
I need to get that "P011" value. Obviously Selenium doesn't have "find_elements_by_title", and there is no HTML id for the value. The Xpath for that line of HTML is:
//*[#id="printerConnectTable"]/tbody/tr[5]/td/table/tbody/tr[1]/td[2]/div/span
But I don't see a reference to "title" or "P011" in that Xpath.
pnum = prtnames.get_attribute("title")
AttributeError: 'list' object has no attribute 'get_attribute'
It's like get_attribute doesn't exist, but there is some (albeit not much) documentation on it.
Fundamentally I'd like to grab that "P011" value and print to console, then I know Selenium is working with the right data.
P.S. I'm self-taught with all of this, I'm automating a sysadmin task.
I think the problem is that prtnames is a list of element, not a specific element. You can use a list comprehension if you want a list of the attributes of titles for the list of prtnames.
pnums = [x.get_attribute('title') for x in prtnames]

How to click one by one to get data from website by using selenium python

I am trying to get data from the website but I want to select first 1000 link open one by one and get data from there.
I have tried:
list_links = driver.find_elements_by_tag_name('a')
for i in list_links:
print (i.get_attribute('href'))
through this getting extra links which are not required.
for example: https://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1,2,3,4,5,%3E5&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&cityName=Mumbai
we will get more than 50k link. How to open only first 1000 link has in below with properties photos.
Edit
I have tried this also:
driver.find_elements_by_xpath("//div[#class='.l-srp__results.flex__item']")
driver.find_element_by_css_selector('a').get_attribute('href')
for matches in driver:
print('Liking')
print (matches)
#matches.click()
time.sleep(5)
But getting error: TypeError: 'WebDriver' object is not iterable
Why I am not getting link by using this line: driver.find_element_by_css_selector('a').get_attribute('href')
Edit 1
I am trying to sort links as per below but getting error
result = re.findall(r'https://www.magicbricks.com/propertyDetails/', my_list)
print (result)
Error: TypeError: expected string or bytes-like object
or Tried
a = ['https://www.magicbricks.com/propertyDetails/']
output_names = [name for name in a if (name[:45] in my_list)]
print (output_names)
Not getting anything.
All links are in list. Please suggest
Thank you in advance. Please suggest
Selenium is not a good idea for web scraping. I would suggest you to use JMeter which is FREE and Open Source.
http://www.testautomationguru.com/jmeter-how-to-do-web-scraping/
If you want to use selenium, the approach you are trying to follow is not a stable approach - clicking and grabbing the data. Instead I would suggest you to follow this - something similar here. The example is in java. But you could get the idea.
driver.get("https://www.yahoo.com");
Map<Integer, List<String>> map = driver.findElements(By.xpath("//*[#href]"))
.stream() // find all elements which has href attribute & process one by one
.map(ele -> ele.getAttribute("href")) // get the value of href
.map(String::trim) // trim the text
.distinct() // there could be duplicate links , so find unique
.collect(Collectors.groupingBy(LinkUtil::getResponseCode)); // group the links based on the response code
More info is here.
http://www.testautomationguru.com/selenium-webdriver-how-to-find-broken-links-on-a-page/
I believe you should collect all the elements in list which having tag name "a" with "href" properties which is not null.
Then traverse through the list and click on element one by one.
Create a list of type WebElement and store all the valid links.
Here you can apply more filters or conditions i.e. link contains some characters or some other condition.
To Store the WebElement in list you can use driver.findEelements() this method will return list of type WebElement.

How to skip certain tags with BeautifulSoup?

I'm a beginner in Python and currently I'm trying to write a simple script using BeautifulSoup to extract some information from a web page and write it to a CSV file. What I'm trying to do here, is to go through all the lists on the web page. In the specific HTML file which I'm looking to work with, only one 'ul' has an id and I wish to skip that one and save all the other list elements in an array. My code doesn't work and I can't figure out how to solve my problem.
for ul in content_container.findAll('ul'):
if 'id' in ul:
continue
else:
for li in ul.findAll('li'):
list.append(li.text)
print(li.text)
here when I print the list out, I still see the elements from the ul with the id. I know it's a simple problem but I'm stuck at the moment. Any help would be appreciated
You are looking for id=False. Use this:
for ul in content_container.find_all('ul', id=False):
for li in ul.find_all('li'):
list.append(li.text)
print(li.text)
This will ignore all tags that have id as an attribute. Also, your approach was nearly correct. You just need to check whether id is present in tag attributes, and not in tag itself (as you are doing). So, use if 'id' in ul.attrs() instead of if 'id' in ul
try this
all_uls = content_container.find_all('ul')
#assuming that the ul with id is the first ul
for i in range(1, len(all_uls)):
print(all_uls[i])

trouble getting text from xpath entry in python

I am on the website
http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b
and trying to scrape the data from the tables. When I pull the xpath from one entry, say the pitcher
"Terry Mulholland," I retrieve this:
pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tbody/tr[2]/td/a)
When I try to print pitcher[0].text for pitcher in printers, I get [] rather than the text, Any idea why?
The problem is, last tbody doesn't exist in the original source. If you get that xpath via some browser, keep in mind that browsers can guess and add missing elements to make html valid.
Removing the last tbody resolves the problem.
In : import lxml.html as html
In : site = html.parse("http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b")
In : pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tr[2]/td/a")
In : pitchers[0].text
Out: 'Terry Mulholland'
But I need to add that, the xpath expression you are using is pretty fragile. One div added in some convenient place and now you have a broken script. If possible, try to find better references like id or class that points to your expected location.

Categories

Resources