How to extract last modified date of a link using beautiful soup - python

Using this code:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
Here is the data I extracted with Beautiful Soup:
<pre>Name Last modified Size</pre>
<hr/>
<pre>
../
0.1.0/
21-Oct-2020 14:06 -
</pre>
I am trying to get the 'Last Modified' data associated with the 'a' tag. In this example, I want to make a dict with the key being '0.1.0' (I know how to extract this) and the value being '21-Oct-2020 14:06'.
EDIT
OK, so after playing around I was able to get the text:
(Pdb) soup.findAll("pre")[1].get_text()
'../\n0.1.0/ 21-Oct-2020 14:06 -\n'
I guess just iterating around each 'pre' tag will do it
thx

You can use regex for that !!
import re
data = re.findall(r'\d{2}\-\w{3}\-\d{4} \d{2}\:\d{2}',requests.get(url))[0]

Related

BeautifulSoup: how to find all the about attributes from html string

In a text file, these items have the same structure and I would like to parse it with beautiful soup.
An extract:
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser') #
I know data are not really a pure html code.
I would like to extract all "about" section for example.
print(soup.find_all('about')) => it returns an empty array!
Perhaps I use a wrong parser?
Thanks a lot.
Best regards.
Théo
If you check the documentation carefully for find_all, it looks for tags with the specified name.
So in this case, you should look for the text tag(s) and then retrieve the about attribute from them.
A working example would look like this:
from bs4 import BeautifulSoup
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser')
# to get the 'about' attribute from the first text element
print(soup.find_all('text')[0]['about'])
# to get the 'about' attributes from all the text elements, as a list
print([text['about'] for text in soup.find_all('text')])
Output:
Le Pen|Macron
['Le Pen|Macron']

Extracting href from 'a' element with text only attribute

I am trying to build a function in a python webscraper that moves to the next page in a list of results. I am having trouble locating the element in beautiful soup as the link is found at the end of many other tags, and doesn't have any attributes such as class or ID.
Here is a snippet of the html:
<a href="http://www.url?&=page=2">
Next
</a>
I have been reading the bs4 documentation trying to understand how I can extract the URL, but I am coming up stumped. I am thinking that it could be done by either:
finding the last .a['href'] in the parent element, as it is always the last one.
finding the href based on the fact that it always has text of 'Next'
I don't know how to write something that would solve either 1. or 2.
Am I along the right lines? Does anyone have any suggestions to achieve my goal? Thanks
To find <a> tag that contains text Next, you can do:
from bs4 import BeautifulSoup
txt = '''
<a href="http://www.url?&=page=2">
Next
</a>'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select_one('a:contains("Next")')['href'])
Prints:
http://www.url?&=page=2
Or:
print(soup.find('a', text=lambda t: t.strip() == 'Next')['href'])
To get the last <a> tag inside some element, you can index the ResultSet with [-1]:
from bs4 import BeautifulSoup
txt = '''
<div id="block">
Some other link
Next
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select('div#block > a')[-1]['href'])

How do I scrape only the numbers and not the text following the numbers?

Following is extract from the HTML code I would like to web scrape from. Given:
<tbody>
<tr>
<th>SAT Math</th>
<td>"541 average"</td>
</tr>
</tbody>
I am using Python and Beautiful Soup to web scrape and extract out the 541 but my problem is:
Once I do extract the "541 average" how to get rid of all the excess material - for example for the GPA how do I get rid of the "average"?
Thank you so much, I would be extremely grateful to anyone who can help!
(Sorry I am a beginner to Python and web scraping)
Current code:
import urllib2
from bs4 import BeautifulSoup
import csv
from datetime import datetime
quote_page = 'https://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId='+str(i)
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("div", attrs={"id":"section8"})
name_box = soup.find('div', attrs={'class': 'cp_left'}).find('h1')
name = name_box.text.strip() # strip() is used to remove starting and trailing
print name
datasets = []
for row in table.find_all("tr")[1:]:
if ((zip(th.get_text() for th in row.find_all("th")))!=[(u'SAT Math',)]):
continue
dataset = zip((th.get_text() for th in row.find_all("th")), (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
for dataset in datasets:
for field in dataset:
print format(field[1])
If you'll always have " average" text in the result you can try to extract only the number with the use of regular expression.
You basically need to manipulate the string.
Something like this:
import re
s = "541 average"
extractTheNumber = re.findall('(\d+?)\s', s)
print(extractTheNumber[0])
Which will match as many consecutive number characters till a space is found (the space is excluded from the the match.)
Try your regex on this tool, which might be very useful: https://regex101.com/

Wrangle facebook page for the all the friend name using Beautiful soup in python

I am trying to do the following: Wrangle my own Facebook page for all my friend names. I have initially stored my webpage as a local file and then parse it using BeautifulSoup. And I have used find_all method to look for all the elements of the following pattern: <div class="name">friend name</div>. However, when I run the code,I get an empty list containing no matches? So what do you is the problem
Thanks
Code snippet:
from bs4 import BeautifulSoup
page = "myFacebook.html" with open(page, "r") as html:
page_html = BeautifulSoup(html, "lxml")
print page_html.find_all(attrs = {"class" : "name"})
I was able to extract the required information using python's regular expression
library -re.
Following is the simple script to pull friend names from facebook page stored locally as html file :
import re
page = "friends.html"
f = open(page,'r')
s = f.read()
regX = r'<div class="fsl fwb fcb">.*?</div>'## Making it less greedy using ?
for elem in re.findall(regX,s):
print re.findall(r'<a.*>(.*)</a>',elem)[0]## getting the names text

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4

Categories

Resources