Extracting json when web scraping

Extracting json when web scraping - python

I was following a python guide on web scraping and there's one line of code that won't work for me. I'd appreciate it if anybody could help me figure out what the issue is, thanks.
from bs4 import BeautifulSoup
import json
import re
import requests
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
Error Message:
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
AttributeError: 'NoneType' object has no attribute 'string'
Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

Main issue in my opinion is that you should add an user-agent to your request, so that you get expected HTML:
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
Note: Almost and first at all - Take a deeper look into your soup, to check if expected information is available.
Example
import re
import json
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1))
json_text

Related

how to put web scraped data into a list

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)

It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

Web scraping with python how to get to the text

I'm trying to get the text from a website but can't find a way do to it. How do I need to write it?
link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html"
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser')
info = soup.find('div', attrs={'class':'text14'})
name = info.text.strip()
print(name)
This is how it looks:
i'm getting none everytime

import requests
from bs4 import BeautifulSoup
import json
link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html"
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser')
info = soup.findAll('script',attrs={'type':"application/ld+json"})[0].text.strip()
jsonDict = json.loads(info)
print(jsonDict['articleBody'])
The page seems to store all the article data in json in the <script> tag so try this code.

The solution is :
info = soup.find('meta', attrs={'property':'og:description'})
It gave me the text i needed

Web scraping with beautifulsoup not finding anything

I'm trying to scrape coinmarketcap.com just to get an update of a certain currency price, also just to learn how to web scrape. I'm still a beginner and can't figure out where I'm going wrong, because whenever I try to run it, it just tells me there are none. Although I know that line does exist. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('data-currency-price data-usd=')
print (price)

If you are going to be doing alot of this consider doing a single call using the official API and get all the prices. Then extract what you want. The following is from the site with an amendment by me to show the desired value for electroneum. The API guidance shows how to retrieve one at a time as well, though that requires a higher plan than the basic.
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start': '1',
'limit': '5000',
'convert': 'USD',
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': 'yourKey',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
# print(response.text)
data = json.loads(response.text)
print(data['data'][64]['quote']['USD']['price'])
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
You can always deploy a loop and check against a desired list e.g.
interested = ['Electroneum','Ethereum']
for item in data['data']:
if item['name'] in interested:
print(item)
For your current example:
You can use an attribute selector for data-currency-value
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
soup.select_one('[data-currency-value]').text

You can use the class attribute to get the value.
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('span' ,attrs={"class" : "h2 text-semi-bold details-panel-item--price__value"})
print (price.text)
Output :
0.006778

You can get the value like this:
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find("span", id="quote_price").get('data-usd')
print (price)

You should try to be more specific in how you want to FIND the item.
you currently are using soup.find('') I am not sure what you have put inside this as you wrote data-currency-price data-usd=
Is that an ID a class name?
why not try finding the item using an ID.
soup.find(id="link3")
or find by tag
soup.find("relevant tag name like div or a")
or something like this
find_this = soup.find("a", id="ID HERE")

import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
x=soup(id="quote_price").text
print (x)
Look for ID better,or search through soup.find_all(text="data-currency-price data-usd")[1].text

Beautfil Soup Error with simple script

I am running Beautiful Soup 4.5 with Python 3.4 on Windows 7. Here is my script:
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
url = 'https://scholar.google.com'
response = http.request('GET', url)
html2 = response.read()
soup = BeautifulSoup([html2])
print (type(soup))
Here is the error I am getting:
TypeError: Expected String or Buffer
I have researched and there seem to be no fixes except going to an older version of Beautiful Soup which I don't want to do. Any help would be much appreciated.

Not sure why are you putting the html string into the list here:
soup = BeautifulSoup([html2])
Replace it with:
soup = BeautifulSoup(html2)
Or, you can also pass the response file-like object, BeautifulSoup would read it for you:
response = http.request('GET', url)
soup = BeautifulSoup(response)
It is also a good idea to specify a parser explicitly:
soup = BeautifulSoup(html2, "html.parser")

finding unique web links using python

I am writing a program to extract unique web links from www.stevens.edu( it is an assignment ) but there is one problem. My program is working and extracting links for all sites except www.stevens.edu for which i am getting output as 'none'. I am very frustrated with this and need help.i am using this url for testing - http://www.stevens.edu/
import urllib
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = bs (html)
tags = soup ('a')
for tag in tags:
print tag.get('href',None)
please guide me here and let me know why it is not working with www.stevens.edu?

The site check the User-Agent header, and returns different html base on it.
You need to set User-Agent header to get proper html:
import urllib
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # <--
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting json when web scraping - python

Related

how to put web scraped data into a list

Web scraping with python how to get to the text

Web scraping with beautifulsoup not finding anything

Beautfil Soup Error with simple script

finding unique web links using python

Categories

Resources