How do I extract the text from this HTML tag? - python

trying to build a web scraper in python to extract the dates from an HTML tag. I've attached what the tag looks like, I'm trying to specifically extract the date. I'm able to isolate the tag itself, using: dates = {item.find('span', {'class': 'POSITIVE'})} and this retrieves me the whole tag.
To only get the text, I'm trying to use something along the lines of:
dates = {item.find('span', {'class': 'POSITIVE'}).text}
However when I do I get a traceback saying: "AttributeError: 'Nonetype' object has no text". I've used .get_text() and this doesn't work either, giving a similar traceback. How do I change my code to get the date text? I'm new to coding so I'm sorry if this is a basic problem.

Related

How to get non element text adjacent to a tag using Scrapy?

I am trying to scrape a page using Scrapy Framework.
<div class="info"><span class="label">Establishment year</span> 2014</div>
The tag I want to deal with looks like above. I want to get the value 2014. I can't use info or label class as they are common through the page.
So, I tried below xpath but I am getting null:
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling").get()
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling::text()").get()
Any clue what can be the issue?
Since you are trying to extract it in between the tag you should use the tag at the end. I don't know what website you are trying to scrape but here is an example of me scraping in between the 'a' tag on this website http://books.toscrape.com/ Here is the code I used for it
response.xpath("(//h3)[1]/a/text()").extract_first()
In your second line of code you did not use the function for extracting text right. The one you are using is for CSS selector. For Xpath if would be /text(), not ::text(). For you code I think you should try one of these options. Let me know if it helps.
response.xpath("//span[contains(text(),'Establishment year')]/div/text()").get()
or
response.xpath("//span[contains(text(),'Establishment year')]/span/text()").get()
Extract direct text children (/text()) from the parent element:
>>> from parsel import Selector
>>> selector = Selector(text='<div class="info"><span class="label">Establishment year</span> 2014</div>')
>>> selector.xpath('//*[#class="info"]/text()').get()
' 2014'

Extracting the value of an attribute with a hyphen using Beautiful Soup 4

I am traversing through my tags using beautifulsoup 4. I have the following tag contents and am unable to extract the atttribute value of the 'data-event-name' attribute. I want '15:02' from this.
This is the html I need to extract 15:02 from
I have tried many many things but am unable to get this value. I tried using the re package, getattr python, find, find_all, etc, etc. this is one example of something I tried:
for racemeetnum,r1_a in enumerate(r1, start=1):
event1 = getattr(r1_a, 'data-event-name') # doesnt work
<
Thank you #Jack Fleming. I managed to sort this last night. In the end my issue wasn't that I couldn't find the attribute, it was that I wasn't trapping the errors when the attribute wasn't found. I surrounded code with a try/except and it worked fine.
Thanks for responding!

How to get data-timestamp using python/selenium

Below is the html of the table I want to extract the data-timestamp from.
The webpage is at https://nl.soccerway.com/national/argentina/primera-division/20182019/regular-season/r47779/matches/?ICID=PL_3N_02
So far I tried verious variants I found on here but nothing seemed to work. Can someone help me to extract the (for example) 1536962400. So in other words I want to extract every data-timestamp value of the table. Any suggestions are more than welcome! I have used selenium/python to extract table data from the website but data-timestamp always gives errors.
data-timestamp is an attribute of tr element, you can try this:
element_list = driver.find_elements_by_xpath("//table[contains(#class,'matches')]/tbody/tr")
for items in element_list:
print(items.get_attribute('data-timestamp'))

re.search webpage source with a carriage return mid string causing errors in Python

Using Python 2.6.6
I'm trying to grab the title of youtube links using mechanize browser, and while it does work on links to actual videos, linking to a channel's page, or their playlists, etc, causes it to crash.
The relevant code segment:
ytpage = br.open(ytlink).read()
yttitle = re.search('<title>(.*)</title>', ytpage)
yttitle = yttitle.group(1)
The error:
yttitle = yttitle.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
The only difference I can see is that a direct video link lays out the title tags on a single line in the source whereas every other youtube page seems to put a carriage return/newline in the middle of the title tags.
Anyone know how I can get around that carriage return, assuming that is the problem?
Cheers.
You can use re.DOTALL flag, which will make . match everything including a newline.
Documentation
So your second line of code should look like:
yttitle = re.search('<title>(.*)</title>', ytpage, re.DOTALL)
By the way to extract data from webpage it might be easier to use Beautiful Soup.

BeautifulSoup fails at finding element using regex because of extra <br> [duplicate]

I'm trying to extract the text from inside a <dt> tag with a <span> inside on www.uszip.com:
Here is an example of what I'm trying to get:
<dt>Land area<br><span class="stype">(sq. miles)</span></dt>
<dd>14.28</dd>
I want to get the 14.28 out of the tag. This is how I'm currently approaching it:
Note: soup is the BeautifulSoup version of the entire webpage's source code:
soup.find("dt",text="Land area").contents[0]
However, this is giving me a
AttributeError: 'NoneType' object has no attribute 'contents'
I've tried a lot of things and I'm not sure how to approach this. This method works for some of the other data on this page, like:
<dt>Total population</dt>
<dd>22,234<span class="trend trend-down" title="-15,025 (-69.77% since 2000)">▼</span></dd>
Using soup.find("dt",text="Total population").next_sibling.contents[0] on this returns '22,234'.
How should I try to first identify the correct tag and then get the right data out of it?
Unfortunately, you cannot match tags with both text and nested tags, based on the contained text alone.
You'd have to loop over all <dt> without text:
for dt in soup.find_all('dt', text=False):
if 'Land area' in dt.text:
print dt.contents[0]
This sounds counter-intuitive, but the .string attribute for such tags is empty, and that is what BeautifulSoup is matching against. .text contains all strings in all nested tags combined, and that is not matched against.
You could also use a custom function to do the search:
soup.find_all(lambda t: t.name == 'dt' and 'Land area' in t.text)
which essentially does the same search with the filter encapsulated in a lambda function.

Categories

Resources