BeautifulSoup does something weird and I can't figure out why.
import requests
from bs4 import BeautifulSoup
url = "nsfw"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cards = soup.find_all("div", {"class": "card-body"})
cards.pop(0)
cards.pop(0)
cards.pop(0) # i really like to pop
texte = []
print(soup)
for i, card in enumerate(cards):
texte.append(card.text)
if i == len(cards)-1:
print(card)
Now what I expect it to do is get the divs and to put the text of the divs into the array. And it does work. For the first 8 out of 9 divs. The 9th div is extremly shortened. Result of the print:
<div class="card-body" id="card_Part_9"><p class="storytext"><span class="brk2_firstwords">“Door’s open,” Brendan shouted.</span></p>
<p class="storytext">Jeffrey</p></div>
But on the website itself it doesn't end there. Here is a screenshot: https://i.imgur.com/CmvYzfJ.png
Why does this happen? What can I do to prevent this? I have already tried to change the parser, but that does not change the result. The site does not use Javascript to load content.
Structure when opening with a browser: https://pastebin.com/N2bPYFBD
But when I print(soup) I get:
<p class="storytext">Jeffrey</p></div></div></div></div></div></div></div></body></html> entered the apartment```
Seems like the html.parser messes up the DOM. The lxml-parser works for me:
import requests
from bs4 import BeautifulSoup
url = "six-pack-thingy"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
cards = soup.find_all("div", {"class": "card-body"})
texte = [card.text for card in cards[3:]]
Thought I could post my scribble as well:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('six-pack-thingy')
elems = driver.find_elements_by_class_name('card-body')
texte = [t.text for t in elems[3:]]
You will have to get some webdriver to run selenium, though. Are you familiar with that?
Related
from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
a = 'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page='
for c in range(8):
#a = f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={c}'
cd = driver.get(a+str(c))
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
fetch_data = bs.find_all('div', {'class': 's-expand-height.s-include-content-margin.s-latency-cf-section.s-border-bottom'})
for f_data in fetch_data:
product_name = f_data.find('span', {'class': 'a-size-medium.a-color-base.a-text-normal'})
print(product_name + '\n')
Now The problem here is that, Webdriver successfully visits 7 pages, But doesn't provide any output or an error.
Now I don't know where M in going wrong.
Any suggestions, reference to a article that provides solution about this problem will be always welcomed.
You are not selecting the right div tag to fetch the products using BeautifulSoup, leading to no output.
Try the following snippet:-
#range of pages
for i in range(1,20):
driver.get(f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={i}')
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
#get search results
products=bs.find_all('div',{'data-component-type':"s-search-result"})
#for each product in search result print product name
for i in range(0,len(products)):
for product_name in products[i].find('span',class_="a-size-medium a-color-base a-text-normal"):
print(product_name)
You can print bs or fetch_data to debug.
Anyway
In my opinion, you can use requests or urllib to get page_source instead of selenium
I am new to python and web scraping. I wrote some code for scraping quotes and the corresponding author name from https://www.brainyquote.com/topics/inspirational-quotes and ended with no result. Here is the code i used for the purpose,
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"C:\Users\Sandheep\Desktop\chromedriver.exe")
product = []
prices = []
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
for a in soup.findAll("a", href=True, attrs={"class": "clearfix"}):
quote = a.find("a", href=True, attrs={"title": "view quote"}).text
author = a.find("a", href=True, attrs={"class": "bq-aut"}).text
product.append(quote)
prices.append(author)
print(product)
print(prices)
I am not getting where i need to edit to get the result.
THANKS IN ADVANCE!!!!
As I understand site has this information in attribute alt of images. Also, quote and author separated by ' - '.
So you need to iterate by soup.find_all('img'), the function to fetch result may look like:
def fetch_quotes(soup):
for img in soup.find_all('img'):
try:
quote, author = img['alt'].split(' - ')
except ValueError:
pass
else:
yield {'quote': quote, 'author': author}
Then, use it like: print(list(fetch_quotes(soup)))
Also, note, it is often that you can replace using selenium to pure requests, e.g.:
import requests
from bs4 import BeautifulSoup
content = requests.get("https://www.brainyquote.com/topics/inspirational-quotes").content
soup = BeautifulSoup(content, "lxml")
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"ChromeDriver path")
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
root_tag=["div", {"class":"m-brick grid-item boxy bqQt r-width"}]
quote_author=["a",{"title":"view author"}]
quote=[]
author=[]
all_data = soup.findAll(root_tag[0], root_tag[1])
for div in all_data:
try:
quote.append(div.find_all("a",{"title":"view quote"})[1].text)
author.append(div.find(quote_author[0], quote_author[1]).text)
except:
continue
The output Will be:
for i in range(len(author)):
print(quote[i])
print(author[i])
break
Start by doing what's necessary; then do what's possible; and suddenly you are doing the impossible.
Francis of Assisi
I'm trying to take a name from a HTML page with BeautifulSoup:
import urllib.request
from bs4 import BeautifulSoup
nightbot = 'https://nightbot.tv/t/tonyxzero/song_requests'
page = urllib.request.urlopen(nightbot)
soup = BeautifulSoup(page, 'html5lib')
list_item = soup.find('strong', attrs={'class': 'ng-binding'})
print (list_item)
But when i print print(list_item) i get a none as reply. There is a way to fix it?
Webpage is rendered by javascript. So you have to use a package like selenium to get what you want.
You can try this:
CODE:
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://nightbot.tv/t/tonyxzero/song_requests')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
list_item = soup.find('strong', attrs={'class': 'ng-binding'})
print (list_item)
RESULT:
<strong class="ng-binding" ng-bind="$state.current.title">Song Requests: TONYXZERO</strong>
I'm trying to get the length of a YouTube video with BeautifulSoup and by inspecting the site I can find this line: <span class="ytp-time-duration">6:14:06</span> which seems to be perfect to get the duration, but I can't figure out how to do it.
My script:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
soup = BeautifulSoup(response.text, "html5lib")
mydivs = soup.findAll("span", {"class": "ytp-time-duration"})
print(mydivs)
The problem is that the output is []
From the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/
soup.findAll('div', attrs={"class":u"ytp-time-duration"})
But I guess that your syntax is a shortcut, so I would probably consider that when loading the youtube page, the div.ytp-time-duration is not present. It is only added after loading. Maybe that's why the soup does not pick it up.
I am still new to python, and especially BeautifulSoup. I've been reading up on this stuff for a few days and playing around with bunch of different codes and getting mix results. However, on this page is the Bitcoin Price I would like to scrape. The price is located in:
<span class="text-large2" data-currency-value="">$16,569.40</span>
Meaning that, I'd like to have my script print only that line where the value is. My current code prints the whole page and it doesn't look very nice, since it's printing a lot of data. Could anybody please help to improve my code?
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find('text-large2', attrs={'class': 'stripe'})
for row in soup.findAll('div'):
for cell in row.findAll('tr'):
print cell.text
And this is a snip of the output I get after running the code. It doesn't look very nice or readable.
#SourcePairVolume (24h)PriceVolume (%)Updated
1BitMEXBTC/USD$3,280,130,000$15930.0016.30%Recently
2BithumbBTC/KRW$2,200,380,000$17477.6010.94%Recently
3BitfinexBTC/USD$1,893,760,000$15677.009.41%Recently
4GDAXBTC/USD$1,057,230,000$16085.005.25%Recently
5bitFlyerBTC/JPY$636,896,000$17184.403.17%Recently
6CoinoneBTC/KRW$554,063,000$17803.502.75%Recently
7BitstampBTC/USD$385,450,000$15400.101.92%Recently
8GeminiBTC/USD$345,746,000$16151.001.72%Recently
9HitBTCBCH/BTC$305,554,000$15601.901.52%Recently
Try this:
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find("div", {"class" : "col-xs-6 col-sm-8 col-md-4 text-left"
}).find("span", {"class" : "text-large2"})
for i in div:
print i
This prints 16051.20 for me.
Later Edit: and if you put the above code in a function and loop it it will constantly update. I get different values now.
This works. But I think you use older version of BeautifulSoup, try pip install bs4 in command prompt or PowerShell
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
value = soup.find('span', {'class': 'text-large2'})
print(''.join(value.stripped_strings))