So the class i am looking through has no attribute name. I have tried both of these. When I search =containers = soup.find_all('div', class_='today_nowcard-temp') it returns the line that has the value i want. But i have no way of extracting the number. Because the line i want has no class attribute name i cant get the number directly from the find function. if i do containers = soup.find('span', class_=None it doesn't return the vale i want. This is the whole thing of what i have so far:
import requests
from bs4 import BeautifulSoup
url = 'https://weather.com/en-CA/weather/today/l/8663b88e4a1c7d6068b7f33e360396ac1c89f3dde9533082cd342aef06ad1e87'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
containers = soup.find_all('div', class_='today_nowcard-temp')
print(containers)
Use find() instead of find_all() and then do find_next('span')
import requests
from bs4 import BeautifulSoup
url = 'https://weather.com/en-CA/weather/today/l/8663b88e4a1c7d6068b7f33e360396ac1c89f3dde9533082cd342aef06ad1e87'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
containers = soup.find('div', class_='today_nowcard-temp')
print(containers.find_next('span').text)
If you want to use find_all() then iterate it like below.
containers = soup.find_all('div', class_='today_nowcard-temp')
for item in containers:
print(item.find_next('span').text)
Or you can use css selector.
print(soup.select_one("div.today_nowcard-temp > span").text)
Related
I'm just trying to get data from a webpage called "Elgiganten" and its url: https://www.elgiganten.se/
I want to get the products name and its url. When I tried to get the a tag so I got an empty list, but I could get the span tag though taht they were in the same div tag.
Here is the whole code:
from bs4 import BeautifulSoup
import requests
respons = requests.get("https://www.elgiganten.se")
soup = BeautifulSoup(respons.content, "lxml")
g_data = soup.find_all("div", {"class": "col-flex S-order-1"})
for item in g_data:
print(item.contents[1].find_all("span")[0])
print(item.contents[1].find_all("a", {"class": "product-name"}))
I hope that anyone can tell me why the a tag seems to be invisible, and can fix the issue.
Go for the a-tags directly. You can extract the product name and the url both from that tag:
from bs4 import BeautifulSoup
import requests
respons = requests.get("https://www.elgiganten.se")
soup = BeautifulSoup(respons.content, "lxml")
g_data = soup.find_all("a", {"class": "product-name"}, href=True)
for item in g_data:
print(item['title'], item['href'])
If you wish to stick to the way you started, the following is how you can achieve that:
import requests
from bs4 import BeautifulSoup
respons = requests.get("https://www.elgiganten.se")
soup = BeautifulSoup(respons.text,"lxml")
for item in soup.find_all(class_="mini-product-content"):
product_name = item.find("span",class_="table-cell").text
product_link = item.find("a",class_="product-name").get("href")
print(product_name,product_link)
Try:
g_data = soup.find_all("a", class_="product-name")
I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)
Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))
You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help
I am curious why I cannot get the div-elements with this class as follows (which worked before but on different sites). Maybe it is an issue with this site?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "https://www.docmorris.de/produkte/abnehmen"
page=requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, features="lxml")
divs=soup.find_all("div",attrs={"class": "l-product mod-standard product
list-item ff-slider"})
print(divs)
prints an empty array. I want all div elements with class 'l-product mod-standard product list-item ff-slider'
You only need a single class of the multi-value and that will be less fragile. Also, remove headers.
from bs4 import BeautifulSoup
import requests
url = "https://www.docmorris.de/produkte/abnehmen"
page = requests.get(url)
soup = BeautifulSoup(page.content, features="lxml")
divs = soup.select('.l-product')
print(divs)
The multi-valued (more fragile) would be:
divs = soup.select('.l-product.mod-standard.product.list-item.ff-slider')
Or (as in comments - ensure on one line) :
divs = soup.find_all("div",attrs={"class": "l-product mod-standard product list-item ff-slider"})
I'm new-ish to python and started experimenting with Beautiful Soup 4. I tried writing code that would get all the links on one page then with those links repeat the prosses until I have an entire website parsed.
import bs4 as bs
import urllib.request as url
links_unclean = []
links_clean = []
soup = bs.BeautifulSoup(url.urlopen('https://pythonprogramming.net/parsememcparseface/').read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
print(links_clean)
while True:
for link in links_clean:
soup = bs.BeautifulSoup(url.urlopen(link).read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
links_clean = list(dict.fromkeys(links_clean))
input()
But I'm now getting this error:
'NoneType' object is not callable
line 20, in
soup = bs.BeautifulSoup(url.urlopen(link).read(),
'html.parser')
Can you pls help.
Be careful when importing modules as something. In this case, url on line 2 gets overridden in your for loop when you iterate.
Here is a shorter solution that will also give back only URLs containing https as part of the href attribute:
from bs4 import BeautifulSoup
from urllib.request import urlopen
content = urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(content, "html.parser")
base = soup.find('body')
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
if 'https' in link['href']:
print(link['href'])
However, this paints an incomplete picture as not all links are captured because of errors on the page with HTML tags. May I recommend also the following alternative, which is very simple and works flawlessly in your scenario (note: you will need the package Requests-HTML):
from requests_html import HTML, HTMLSession
session = HTMLSession()
r = session.get('https://pythonprogramming.net/parsememcparseface/')
for link in r.html.absolute_links:
print(link)
This will output all URLs, including both those that reference other URLs on the same domain and those that are external websites.
I would consider using an attribute = value css selector and using the ^ operator to specify that the href attributes begin with https. You will then only have valid protocols. Also, use set comprehensions to ensure no duplicates and Session to re-use connection.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)
I am trying to parse through a div class from an html table on Amazon, and when I run the code, find_all() sometimes returns the right div classes that I am looking for, and other times it will return an empty list. Any ideas on why the results vary?
I am pulling from this url: https://www.amazon.com/dp/B0767653BK
My code:
req = requests.get('https://www.amazon.com/dp/B0767653BK')
page = req.text
BSoup = BeautifulSoup(page, 'html.parser')
divClass = Bsoup.find_all('div', class_='a-section a-spacing-none a-padding-none overflow_ellipsis')
It is better to use a beautifulsoup selector when trying to find all elements with a combination of CSS classes:
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.amazon.com/dp/B0767653BK')
soup = BeautifulSoup(req.text, 'html.parser')
for div_class in soup.select('div.a-section.a-spacing-none.a-padding-none.overflow_ellipsis'):
print div_class.get_text(strip=True)
This is preferable as it allows the four class elements to be present in any order. So if the page decides to change the ordering of the classes, it will still find them.
Take a look at Searching by CSS class in the documenation.