How to extract element from a webpage with special class name? - python

I have a txt file filed with multiple urls, each url is an article with text and their corresponding SDG (example of one article 1)
The text parts of an article are in balises 'div.text.-normal.content' and then in 'p'
And the SDGs are in 'div.tax-section.text.-normal.small' and then in 'span'
To extract them I use the following lines of code :
data = []
with open('urls_news.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
try:
soup = BeautifulSoup(response.text,"html.parser")
text = soup.select_one('div.text-normal').get_text(strip=True)
topic = soup.select_one('div.tax-section').get_text(strip=True)
data.append(
{
'text':text,
'topic': topic,
}
)
pd.DataFrame(data).to_excel('text_2.xlsx', index = False, header=True)
except AttributeError:
print (" ")
time.sleep(3)
But I have no result, I've previously used this code to extract same type of information from an other website with clearer class name. I'va also tried to enter "div.text.-normal.content" and "div.tax-section.text.-normal.small" but same result.
I think that the classes i'm calling in this exemple are wrong. I would like to know what i've missed in theses classes names.

To select the text you can go with:
soup.select_one('div.text.-normal.content').get_text(strip=True)
Think there is something wrong with the names of the classes, just chain them with a . for every whitespace between them.
or:
soup.select_one('div.c-single-content').get_text(strip=True)
To get the topics as mentioned you can go with:
'^^'.join([topic.get_text(strip=True) for topic in soup.select_one('div.tax-section.text.-normal.small').select('a')])

Related

Output non JSON data from regex web scraping to a JSON file

I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)

Why is BeautifulSoup(...).find(...) returning None?

I have some problem with code (I use bs4):
elif 'temperature' in query:
speak("where?")
miejsce=takecommand().lower()
search = (f"Temperature in {miejsce}")
url = (f'https://www.google.com/search?q={search}')
r = requests.get(url)
data = BeautifulSoup(r.text , "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"In {search} there is {temp}")
and the error is:
temp = data.find("div", class_="BNeawe").text
AttributeError: 'NoneType' object has no attribute 'text'
Could you help me please
data.find("div", class_="BNeawe") didnt return anything, so i believe google changed how it displays weather since you last ran this code successfully.
If you search for yourself 'Weather in {place}' then right click the weather widget and choose Inspect Element (browser dependent), you can look for yourself at where the data is in the page, and see which class the data is under.
It appears it was previously under the BNeawe class.
elif "temperature" in query or "temperatures" in query:
search = "Temperature in New York"
url = f"https://www.google.com/search?q={search}:"
r = requests.get(url)
data = BeautifulSoup(r.text, "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"Currently, the temperature in your region is {temp}")
Try this one, you were experiencing your proble in line 5 which is '(r.text, "html.parser")'
try to avoid these comma space mistakes in the code...
Best practice would be to use directly api google / weather - If you wanna scrape,try to avoid selecting your elements by classes, cause they are often that dynamic.
Instead focus on id if possible or use HTML structure:
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=temperature"
response = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language':'en-US,en;q=0.5'}, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text)
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break

Find Multiple Tags in Beutifulsoup4 and insert them into one string

A have a code that gets your pastebin's data
def user_key():
user_key_data = {'api_dev_key': 'my-dev-key',
'api_user_name': 'my-username',
'api_user_password': 'my-password'}
req = urllib.request.urlopen('https://pastebin.com/api/api_login.php',
urllib.parse.urlencode(user_key_data).encode('utf-8'),
timeout=7)
return req.read().decode()
def user_pastes()
data = data = {'api_dev_key': 'my_dev_key',
'api_user_key': user_key(),
'api_option': 'list'}
req = urllib.request.urlopen('https://pastebin.com/api/api_post.php',
urllib.parse.urlencode(data).encode('utf-8'), timeout=7)
return req.read().decode()
Every Paste has a unique html tag e.g. url, title, paste key, etc.
The Above code will print these out per paste.
I made a code that only takes certain tags. the paste url, paste title and the paste key
my_pastes = []
src = user_pastes()
soup = BeautifulSoup(src, 'html.parser')
for paste in soup.findAll(['paste_url', 'paste_title', 'paste_key']):
my_pastes.append(paste.text)
print(my_pastes)
What I want is to join the url, title and key per paste together into one string.
I tried using the .join method but it only joins the chars. (might not make sense but you'll see when you try it)
Unrelated to the problem.
What I'll do once they're joined. split them again and put them in a PyQt5 table
So This is kind of the answer but I'm still looking for a more simpler code
title = []
key = []
url = []
src = user_pastes()
soup = BeautifulSoup(src, 'html.parser')
for paste_title in soup.findAll('paste_title'):
title.append(paste_title.text)
for paste_key in soup.findAll('paste_key'):
key.append(paste_key.text)
for paste_url in soup.findAll('paste_url'):
url.append(paste_url.text)
for i in range(len(title)):
print(title[i], key[i], url[i])
Maybe from this answer you'll get the idea of what I want to achieve since the post was kind of confusing since I can't really express what I want

How do I get the first 3 sentences of a webpage in python?

I have an assignment where one of the things I can do is find the first 3 sentences of a webpage and display it. Find the webpage text is easy enough, but I'm having problems figuring out how I find the first 3 sentences.
import requests
from bs4 import BeautifulSoup
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if (t.parent.name not in blacklist):
output += '{} '.format(t)
tempout = output.split('.')
for i in range(tempout):
if (i >= 3):
tempout.remove(i)
output = '.'.join(tempout)
print(output)
Finding sentences out of text is difficult. Normally you would look for characters that might complete a sentence, such as '.' and '!'. But a period ('.') could appear in the middle of a sentence as in an abbreviation of a person's name, for example. I use a regular expression to look for a period followed by either a single space or the end of the string, which works for the first three sentences, but not for any arbitrary sentence.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
needed = 3 - len(sentences)
found = len(matches)
n = min(found, needed)
for i in range(n):
sentences.append(matches[i])
if len(sentences) == 3:
break
print(sentences)
Prints:
['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]
To scrape the first three sentences, just add these lines to ur code:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
Output:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
Hope that this helps!
Actually using beautify soup you can filter by the class "article_text post" seeing source code:
myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)
And get the inner text of p element
Use this instead of soup = BeautifulSoup(html_page, 'html.parser')

Beautifulsoup can't find text

I'm trying to write a scraper in python using urllib and beautiful soup. I have a csv of URLs for news stories, and for ~80% of the pages the scraper works, but when there is a picture at the top of the story the script no longer pulls the time or the body text. I am mostly confused because soup.find and soup.find_all don't seem to produce different results. I have tried a variety of different tags that should capture the text as well as 'lxml' and 'html.parser.'
Here is the code:
testcount = 0
titles1 = []
bodies1 = []
times1 = []
data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
titlemess = soup.find(id="title").get_text() #getting the title
titlestring = str(titlemess) #make it a string
title = titlestring.replace("\n", "").replace("\r","")
titles1.append(title)
bodymess = soup.find(class_="article").get_text() #get the body with markup
bodystring = str(bodymess) #make body a string
body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
bodies1.append(body) #add to list for export
timemess = soup.find('span',{"class":"time"}).get_text()
timestring = str(timemess)
time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
times1.append(time)
testcount = testcount +1 #counter
print(testcount)
except Exception as e:
print(testcount, e)
And here are some of the results I get (those marked 'nonetype' are the ones where the title was successfully pulled but body/time is empty)
1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm
2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm 'NoneType' object has no attribute 'get_text'
Any help would be much appreciated! Thanks.
EDIT: I don't have '10 reputation points' so I can't post more links to test but will comment with them if you need more examples of pages.
The issue is that there is no class="article" on the website with the picture in it and same with the "class":"time". Consequently, it seems that you'll have to detect whether there's a picture on the website or not and then if there is a picture, search for the date and text as follows:
For the date, try:
timemess = soup.find(id="pubtime").get_text()
For the body text, it seems that the article is rather just the caption for the picture. Consequently, you could try the following:
bodymess = soup.find('img').findNext().get_text()
In brief, the soup.find('img') finds the image and findNext() goes to the next block which, coincidentally, contains the text.
Thus, in your code, I would do something as follows:
try:
bodymess = soup.find(class_="article").get_text()
except AttributeError:
bodymess = soup.find('img').findNext().get_text()
try:
timemess = soup.find('span',{"class":"time"}).get_text()
except AttributeError:
timemess = soup.find(id="pubtime").get_text()
As a general flow in web scraping, I usually go to the website itself using a browser and find the elements in the website backend in the browser first.

Categories

Resources