Web scraping with beautifulsoup not finding anything

Web scraping with beautifulsoup not finding anything - python

I'm trying to scrape coinmarketcap.com just to get an update of a certain currency price, also just to learn how to web scrape. I'm still a beginner and can't figure out where I'm going wrong, because whenever I try to run it, it just tells me there are none. Although I know that line does exist. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('data-currency-price data-usd=')
print (price)

If you are going to be doing alot of this consider doing a single call using the official API and get all the prices. Then extract what you want. The following is from the site with an amendment by me to show the desired value for electroneum. The API guidance shows how to retrieve one at a time as well, though that requires a higher plan than the basic.
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start': '1',
'limit': '5000',
'convert': 'USD',
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': 'yourKey',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
# print(response.text)
data = json.loads(response.text)
print(data['data'][64]['quote']['USD']['price'])
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
You can always deploy a loop and check against a desired list e.g.
interested = ['Electroneum','Ethereum']
for item in data['data']:
if item['name'] in interested:
print(item)
For your current example:
You can use an attribute selector for data-currency-value
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
soup.select_one('[data-currency-value]').text

You can use the class attribute to get the value.
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('span' ,attrs={"class" : "h2 text-semi-bold details-panel-item--price__value"})
print (price.text)
Output :
0.006778

You can get the value like this:
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find("span", id="quote_price").get('data-usd')
print (price)

You should try to be more specific in how you want to FIND the item.
you currently are using soup.find('') I am not sure what you have put inside this as you wrote data-currency-price data-usd=
Is that an ID a class name?
why not try finding the item using an ID.
soup.find(id="link3")
or find by tag
soup.find("relevant tag name like div or a")
or something like this
find_this = soup.find("a", id="ID HERE")

import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
x=soup(id="quote_price").text
print (x)
Look for ID better,or search through soup.find_all(text="data-currency-price data-usd")[1].text

Related

Extracting json when web scraping

I was following a python guide on web scraping and there's one line of code that won't work for me. I'd appreciate it if anybody could help me figure out what the issue is, thanks.
from bs4 import BeautifulSoup
import json
import re
import requests
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
Error Message:
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
AttributeError: 'NoneType' object has no attribute 'string'
Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

Main issue in my opinion is that you should add an user-agent to your request, so that you get expected HTML:
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
Note: Almost and first at all - Take a deeper look into your soup, to check if expected information is available.
Example
import re
import json
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1))
json_text

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.

Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.

As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

Unable to parse a product title and it's price from a webpage

I'm trying to fetch product title and it's price from a webpage but every time when I run my script I get this error `` instead of the content. I checked out page source where the selectors I've used within my script are there.
Site link
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://www.amazon.com/dp/B01DOLQ0BY'
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
product_name = soup.select_one("#productTitle").get_text(strip=True)
product_price = soup.select_one("[id='priceblock_ourprice']").text
print(product_name,product_price)
How can I get the product name and it's price from aforementioned site?

Change the header to the one the server expects
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.9'}
res = requests.get('https://www.amazon.com/dp/B01DOLQ0BY/', headers=headers)
soup = BeautifulSoup(res.text,"lxml")
product_name = soup.select_one("#productTitle").get_text(strip=True)
product_price = soup.select_one("[id='priceblock_ourprice']").text
print(product_name,product_price)
For different products you will need to find a selector that is common across all asins. For the two supplied you can use:
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.9','User-Agent':'Mozilla/4.0'}
asins = ['B013TCZVVS','B01DOLQ0BY']
with requests.Session() as s:
s.headers = headers
for asin in asins:
res = s.get(f'https://www.amazon.com/dp/{asin}/')
soup = BeautifulSoup(res.text,"lxml")
product_name = soup.select_one("#productTitle").get_text(strip=True)
product_price = soup.select_one(".comparison_baseitem_column .a-offscreen").text
print(product_name,product_price)

Instead of res.text try res.body
Also as a debugging technique, print the request's response. That will help you see what data is being returnd from the request with your current config.

Web scraping with python how to get to the text

I'm trying to get the text from a website but can't find a way do to it. How do I need to write it?
link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html"
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser')
info = soup.find('div', attrs={'class':'text14'})
name = info.text.strip()
print(name)
This is how it looks:
i'm getting none everytime

import requests
from bs4 import BeautifulSoup
import json
link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html"
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser')
info = soup.findAll('script',attrs={'type':"application/ld+json"})[0].text.strip()
jsonDict = json.loads(info)
print(jsonDict['articleBody'])
The page seems to store all the article data in json in the <script> tag so try this code.

The solution is :
info = soup.find('meta', attrs={'property':'og:description'})
It gave me the text i needed

Scraping Table from Nasdaq (beginner)

url = https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts
I am trying to grab the table from the url above. However, when I try to find the table using beautifulsoup, I am unsuccessful. I simply get an empty list.
Please help.
Thanks
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts")
soup = BeautifulSoup(page.text, "lxml")
item = soup.find(class_="genTable")
print(item)

I think the code is quite clear, but if you have any questions feel free to ask.
headers = {"Referer": "https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts"}
data = {"id":2,"method":"BL_TradeHalt.GetTradeHalts","params":"[]","version":"1.1"}
url = "https://www.nasdaqtrader.com/RPCHandler.axd"
req = requests.post(url, json=data, headers=headers)
result = req.json()['result']
soup = BeautifulSoup(result, 'html.parser')
table = soup.find('table')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping with beautifulsoup not finding anything - python

Related

Extracting json when web scraping

Soup works on one IMBD page but not on another. How to solve?

Unable to parse a product title and it's price from a webpage

Web scraping with python how to get to the text

Scraping Table from Nasdaq (beginner)

Categories

Resources