I am new to Python and I've seen multiple tutorial videos online about web scraping.
This is the element from the targeted website:
<span class="status ng-binding"> 14 </span>
And this is my coding:
import requests
import bs4
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
res = requests.get('https://gleam.io/cevFk/castrio-october-streaming-pc-giveaway?gsr=cevFk-SxAtZtT4Ir', headers=headers)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print(soup.select("#status ng-binding"))
I'm trying to extract/output the number (which is 14) from the targeted website. Am I doing something wrong? Any answers is greatly appreciated.
soup.find('span',{'class':'classname'},recursive=True).text
can add more attributes:
{'attr':'value','attr':'value'}
find_all() returns a list
recusrive searches in nested tags
ewrewr
The problem is that the expected span is missing in the raw response.
You can verify by following these steps.
response = requests.get(url, headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
#print soup.prettify()
# print(soup.prettify())
print('title',soup.title)
print('fiind all spans', soup.find_all('span'))
Related
Hello developers,
I am trying to learn how to scrape, and I came across this website here.
I want to get the link to the map using requests and beautifulsoup library.
This is what I have done so far
URL = 'https://www.zoopla.com/new-homes/details/61239885/?search_identifier=0dcdcfea4b6e6e84e1a93c25c4c0d808'
headers = {'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
res = requests.get(URL, timeout=15,
headers = headers)
soup = BeautifulSoup(res.content, 'html.parser')
soup.find('div', class_=re.compile('MapInner'))
<div class="css-1md2b5a-MapInner e74mx470"></div>
I do able to locate the parent tag, but there is no img tag, I believe it is because it has been not rendered.
What can be done here, If I don't want to use selenium?
Im learning Beautiful Soup and I dont know what I could be doing wrong, I'm using soup.find on an id, and Ive tried this on multiple different sites, and I run it and it always returns None.
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
def stock_check():
page = requests.get(site, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', id = 'productTitle')
print(title)
stock_check()
There are 3 errors in your code:
1.incorrect locator
2.not invoking text
3.not inject cookies
Now your code is working fine:
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
cookies={'session':'141-2320098-4829807'}
def stock_check():
page = requests.get(site, headers = headers,cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', attrs={'id':'productTitle'})
print(title.get_text(strip=True))
stock_check()
Output:
PlayStation 5 Console
The answers of HedgeHog and Fazlul are of course correct, but I want to comment on this.
When you scrape something from the web and try to extract tags from the HTML but get nothing, first check the whole HTML document you recieved to make sure it's what you expected. Personally I just print out soup.prettify() to debug this, as explained in BeautifulSoup's Quick Start:
Another nifty trick if the HTML is impractical to read is to paste it into a HTML previewer like this one, and we get the answer quickly.
BeautifulSoup can be a great tool, but you need to pay attention when using it.
I'm trying to scrape a website, but it returns blank, can you help please? what am i missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://ks.wjx.top/jq/50921280.aspx'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.text)
To get a response, add the User-Agent header to requests.get(), otherwise, the website thinks that your a bot, and will block you.
import requests
from bs4 import BeautifulSoup
URL = "https://ks.wjx.top/jq/50921280.aspx"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
print(soup.prettify())
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.amazon.com/s?k=iphone+5s&ref=nb_sb_noss')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
all = soup.find_all("span", {"class": "a-size-medium a-color-base a-text-normal"})
print(all)
so this is my simple script of python trying to scrape a page in amazon but not all the html is returned in the "soup" variable therefor i get nothing when trying to find a specific series of tags and extract them.
Try the below code, it should do the trick for you.
You actually missed to add headers in your code
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://www.amazon.com/s?k=iphone+5s&ref=nb_sb_noss'
response = requests.get(url, headers=headers)
print(response.text)
soup = BeautifulSoup(response.content, features="lxml")
my_all = soup.find_all("span", {"class": "a-size-medium a-color-base a-text-normal"})
print(my_all)
Here is my python code:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
it works for google.com and many other websites, but it doesn't work for amazon.com.
I can open amazon.com in my browser, but the resulting "soup" is still none.
Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
So I doubt whether Amazon and App Annie block scraping.
Add a header, then it will work.
from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"
# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup
You can try this:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
In python arbitrary text is called a string and it must be enclosed in quotes(" ").
I just ran into this and found that setting any user-agent will work. You don't need to lie about your user agent.
response = HTTParty.get #url, headers: {'User-Agent' => 'Httparty'}
Add a header
import urllib2
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup