scrape blocks of data from webpage API

scrape blocks of data from webpage API - python

I try to collect block data which forms a small table from a webpage. Pls see my codes below.
`
import requests
import re
import json
import sys
import os
import time
from lxml import html,etree
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.investing.com/instruments/OptionsDataAjax'
params = {'pair_id': 525, ## SPX
'date': 1536555600, ## 2018-9-4
'strike': 'all', ## all prices
'callspots': 'calls',#'call_andputs',
'type':'analysis', # webpage viewer
'bringData':'true',
}
headers = {'User-Agent': Chrome/39.0.2171.95 Safari/537.36'}
def R(text, end='\n'): print('\033[0;31m{}\033[0m'.format(text), end=end)
def G(text, end='\n'): print('\033[0;32m{}\033[0m'.format(text), end=end)
page = requests.get(url, params=params,headers = headers)
if page.status_code != 200:
R('ERROR CODE:{}'.format(page.status_code))
sys.exit
G('Problem in connection!')
else:
G('OK')
soup = BeautifulSoup(page.content,'lxml')
spdata = json.loads(soup.text)
print(spdata['data'])`
This result--spdata['data'] gives me a str, I just want to get following blocks in this str. There are many such data blocks in this str with the same format.
SymbolSPY180910C00250000
Delta0.9656
Imp Vol0.2431
Bid33.26
Gamma0.0039
Theoretical33.06
Ask33.41
Theta-0.0381
Intrinsic Value33.13
Volume0
Vega0.0617
Time Value-33.13
Open Interest0
Rho0.1969
Delta / Theta-25.3172
I use json and BeautifulSoup here, maybe regular expression will help but I don't know much about re. To get the result, any approach is appreciated. Thanks.

Add this after your code:
regex = r"((SymbolSPY[1-9]*):?\s*)(.*?)\n[^\S\n]*\n[^\S\n]*"
for match in re.finditer(regex, spdata['data'], re.MULTILINE | re.DOTALL):
for line in match.group().splitlines():
print (line.strip())
Outputs
OK
SymbolSPY180910C00245000
Delta0.9682
Imp Vol0.2779
Bid38.26
Gamma0.0032
Theoretical38.05
Ask38.42
Theta-0.0397
Intrinsic Value38.13
Volume0
Vega0.0579
Time Value-38.13
Open Interest0
Rho0.1934
Delta / Theta-24.3966
SymbolSPY180910P00245000
Delta-0.0262
Imp Vol0.2652
...

Related

scrape data from interactive chart using selenium, bs4 or requests

I want to scrape data from charts on this page: http://188.166.44.172/match/live-stats/100941310
I tried requests and bs4, but failed to get any data, I also tried with selenium and no data as well.
Here's the code using requests:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(u, timeout=30, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
for i in soup.find_all('rect'):
if i.has_attr("onmouseover"):
text = i.get('onmouseover')
print(text)
And the code using selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
u = "http://188.166.44.172/match/live-stats/100941310"
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for i in soup.find_all('rect'): #I also tried soup.select('*')
if i.has_attr("onmouseover"):
text = i.get('onmouseover')
print(text)
Is there's any way to scrape data from those charts using python ?

The reason you're not getting anything back is because all the charts are generated dynamically by JavaScript and with, for example, bs4 you won't read any of this.
However, the data for the charts is embedded in the HTML. You could parse that and plot.
Here's how:
import ast
import re
import requests
import matplotlib.pyplot as plt
target_url = "http://188.166.44.172/match/live-stats/100941310"
page_source = requests.get(target_url).text
raw_attack_data = ast.literal_eval(
re.search(r"var all_attack = (\[.*\])", page_source).group(1),
)
all_attack = [i[1] for i in raw_attack_data if isinstance(i, list)]
plt.plot(all_attack, label="attack")
plt.legend(loc="lower right")
plt.show()
This should give you a plot like this:
As I've said, eveyrthing you need is in the source code, so you'd have to play around with the values.
The source looks like this:
Where the first value of the inner list is the game time and the second value is the stat that's plotted on the charts.
Note that some arrays have values in {}. These' are those icon markers on the charts. You can filter them out with isistance(i, list) since these can be easily parsed as dicts, as I've shown above.
EDIT:
Yes, it's possible to get division and team info, as everything is in the HTML. I've reworked the initial answer a bit and came up with this:
import ast
import re
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
target_url = "http://188.166.44.172/match/live-stats/100941310"
page_source = requests.get(target_url).text
def get_match_info(html_source: str) -> list:
return [
i.getText(strip=True) for i
in BeautifulSoup(html_source, "lxml").select("h1 a")
]
def get_stats(html_source: str, search_str: str) -> tuple:
raw_data = ast.literal_eval(
re.search(fr"var {search_str} = (\[.*\])", html_source).group(1),
)
filtered = [i[1] for i in raw_data if isinstance(i, list)]
game_time = [i[0] for i in raw_data if isinstance(i, list)]
return game_time, filtered
division, home, away = get_match_info(page_source)
time_, attack_home = get_stats(page_source, "dangerous_home")
_, attack_away = get_stats(page_source, "dangerous_away")
plt.suptitle(f"{division} - {home} v {away}")
plt.ylabel("Attack")
plt.xlabel("Game time")
plt.plot(time_, attack_home, color="blue", label=home)
plt.plot(time_, attack_away, color="black", label=away)
plt.legend(loc="lower right")
plt.show()
This produces a plot:

Get data of a XML with criteria

i have the following code:
import pandas as pd
import urllib.parse
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = 'http://windte2001.acepta.com/v01/E67EBB4910CFDCB067EB7D85FBA6E5511D0E64A9'.replace('/v01/', '/depot/')
x = urlopen(url)
new = x.read()
soup = BeautifulSoup(new, "lxml-xml")
result = soup.find_all(['NmbItem','QtyItem'])
that brings the next result in xml
<NmbItem>SERV. MANEJO DE LIQUIDOS</NmbItem>, <QtyItem>22.00</QtyItem>, <NmbItem>SERV. MANEJO DE RESPEL</NmbItem>, <QtyItem>1.00</QtyItem>]
All i need if NmbItem contains 'LIQUIDOS' bring me the 'QtyItem' in this case is 22
How can i do this with python in this xml?

You can use regular expression.
import re
from bs4 import BeautifulSoup
soup=BeautifulSoup(new,'xml')
result=soup.find('NmbItem',text=re.compile("LIQUIDOS")).find_next('QtyItem').text
print(result)

You can do like this:
result = soup.find_all(['NmbItem'])
for item in result:
if 'LIQUIDOS' in item.text:
print(list(item.next_siblings)[3].text)

extract array from web fetched json file python

i want to fetch live quotes from this page in python 3
link
the quotes here is stored in json file in array "JsonData"
i want to get the value stored in LTP inside the json file
from urllib.request import urlopen
import json
url = ("https://ewmw.edelweiss.in/api/Market/Process/GetFutureValue/BANKNIFTY/05%20Apr%202018")
response = urlopen(url)
data = response.read().decode("utf-8")
y = json.loads(data['JsonData'])
print(y)

Replace
y = json.loads(data['JsonData'])
with
y = json.loads(data.decode('string-escape').strip('"'))
print(y) #And then access the required variable.
print(y["JsonData"])
Note: Your data is escaped. that is why i have used .decode('string-escape').strip('"')

You can try ast :
But before using eval() read this post from NedBat
from urllib.request import urlopen
import json
import ast
url = ("https://ewmw.edelweiss.in/api/Market/Process/GetFutureValue/BANKNIFTY/05%20Apr%202018")
response = urlopen(url)
data = response.read().decode("utf-8")
load_data= ast.literal_eval(data)
convert_to = json.loads(load_data)
print(convert_to['JsonData'])
output:
{'LTP': '24339.3', 'ChgPer': '-0.68', 'ArticleUrl': '', 'Url': '/quotes/index-future/BANKNIFTY~2018-04-26', 'CoCode': 'BANKNIFTY', 'Date': '2018-04-26T00:00:00', 'Chg': '-165.55'}

You can use eval as suggested below, but that's not such a good thing to do, really. So you can go with this:
import requests, json
r = requests.get('https://ewmw.edelweiss.in/api/Market/Process/GetFutureValue/BANKNIFTY/05%20Apr%202018').json()
data = json.loads(r)
print(data['JsonData'])
Or alternatively, if you insist on using urllib, then just add another y = json.loads(y), it's not a good solution, so you want to change that later, but a bit better than eval. Complete code:
from urllib.request import urlopen
import json
url = ("https://ewmw.edelweiss.in/api/Market/Process/GetFutureValue/BANKNIFTY/05%20Apr%202018")
response = urlopen(url)
data = response.read().decode("utf-8")
y = json.loads(data)
y = json.loads(y)
print(y['JsonData'])
If it was me though, I'd go with the first one, cleaner, shorter, better.

Web Scraping with BeautifulSoup4 failing

I was making a program that would collect the value of the cryptocurrency verge. This script did the trick:
import urllib2
from bs4 import BeautifulSoup
url=("https://coinmarketcap.com/currencies/verge/")
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
price=find_value.text
Though the issue was that the result was in USD and i lived in Australia. So what i then did was put that value in a USD to AUD converter to find my value. I tried with the following code:
url2="http://www.xe.com/currencyconverter/convert/?
Amount="+price+"&From=USD&To=AUD"
print url2
page2=urllib2.urlopen(url2)
soup2=BeautifulSoup(page2,"html.parser")
find_value2=soup.find('span',attrs={'class':'uccResultAmount'})
print find_value2
The result was that i would get the right url though i would get the wrong result. Could anybody tell me where i am going wrong?Thank You

You can use regular expressions to scrape the currency converter:
import urllib
from bs4 import BeautifulSoup
import re
def convert(**kwargs):
url = "http://www.xe.com/currencyconverter/convert/?Amount={amount}&From={from_curr}&To={to_curr}".format(**kwargs)
data = str(urllib.urlopen(url).read())
val = map(float, re.findall("(?<=uccResultAmount'>)[\d\.]+", data))
return val[0]
url="https://coinmarketcap.com/currencies/verge/"
page=urllib.urlopen(url)
soup=BeautifulSoup(page,"html.parser")
find_value=soup.find('span',attrs={'class':'text-large2'})
print convert(amount = float(find_value.text), from_curr = 'USD', to_curr = 'AUD')
Output:
0.170358

Request a html file

This link lets me get a random item from database. However, I would like to automatically retrieve items using Python. Here's my code:
import sys
from urllib.parse import urlencode
from urllib.request import urlopen
# parameters
data = {}
data["query"] = "reviewd:yes+AND+organism:9606"
data["random"] = "yes"
url_values = urlencode(data)
url = "http://www.uniprot.org/uniprot/"
full_url = url + '?' + url_values
data = urlopen(full_url)
out = open("1.html", 'w')
out.write(str(data.read()))
However, I cannot get the desired page. Anyone knows what's wrong with my code? I'm using Python 3.x.

You have several issues:
reviewd is misspelled, it should be reviewed
The base url needs to have /uniprot/ at the end
You need to use space instead of + in your query string
Here is what that would look like:
import sys
from urllib.parse import urlencode
from urllib.request import urlopen
# parameters
data = {}
data["query"] = "reviewed:yes AND organism:9606"
data["random"] = "yes"
url_values = urlencode(data)
url = "http://www.uniprot.org/uniprot/"
full_url = url + '?' + url_values
data = urlopen(full_url)
out = open("1.html", 'w')
out.write(str(data.read()))
This produces the following URL:
http://www.uniprot.org/uniprot/?query=reviewed%3Ayes+AND+organism%3A9606&random=yes

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrape blocks of data from webpage API - python

Related

scrape data from interactive chart using selenium, bs4 or requests

Get data of a XML with criteria

extract array from web fetched json file python

Web Scraping with BeautifulSoup4 failing

Request a html file

Categories

Resources