Scraping data over websockets

Scraping data over websockets - python

I am trying to get the daily price data from this specific webpage:
https://www.londonstockexchange.com/stock/CS1/amundi/company-page
Those data are represented in the chart.
I run out of idea to try to reach those data. I assume that those data are transfered though one of the websocket connection that is made and retrievable in the browser console.
enter image description here
I tried to simulate the websocket connection and send the same binary than the front app.
from websocket import create_connection
s = create_connection("wss://82-99-29-151.infrontservices.com/wsrt/2/4")
hex_1 = "3e000000010..."
hex_2 = "13000000010..."
hex_3 = "1e000000010..."
ws.send(binascii.unhexlify(hex_1))
ws.send(binascii.unhexlify(hex_2))
ws.send(binascii.unhexlify(hex_3))
result = ws.recv()
Then I tried to decode this response with all the possible encoding as follow:
import binascii
from encodings.aliases import aliases
for v in [v for k, v in aliases.items()]:
try:
print(result.decode(v))
except:
print(f"ERROR {v}")
And naturally, I have no interpretable output that I can exploit. I could think that a cipher is used here. But I have no more idea how to investigate further.
Do you have any idea about that? :)
Thanks in advance !
AL Ko
EDIT 1
enter image description here
We can see one the datapoint with the value 16990 for a given date. This is what I am looking for is the whole time series of the chart.

After you read my comment and get informed about scraping, and decide to proceed carefully,
Python can retrieve this JSON with just a few lines of code
import requests
url = "https://api.londonstockexchange.com/api/gw/lse/instruments/alldata/CS1"
response = requests.get(url=url).json()
# print some data from the json
print(response_json)
print(response_json.get("description"))
print(response_json.get("bid"))
I found this data using the "network" tab, a few more show up when you hit "reload", but they seem to be empty.

Related

Receiving Error message: JSONDecodeError when attempting to use API

I am following along with Python for Data Analysis and am on Chapter 6 looking at using APIs.
I wish to connect to sources provided by National Grid on their Data Portal. They provide a number of URLs (e.g. several found here, https://data.nationalgrideso.com/ancillary-services/obligatory-reactive-power-service-orps-utilisation). I want to read these directly into pandas rather than download the Excel/csv file and then open that.
I am receiving the error message
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
after attempting the following:
import requests
import codecs
import json
url = 'https://data.nationalgrideso.com/backend/dataset/7e142b03-8650-4f46-8420-7ce1e84e1e5b/resource/a61c6c26-62ec-41e1-ae25-ed95f4562274/download/reactive-utilisation-data-apr-2020-mar-2022.csv'
resp = request.get(url)
decoded_data = codecs.decode(resp.text.encode(), 'utf-8-sig')
data = json.loads(decoded_data)
I understand that I need to use 'utf-8-sig' due to a particular BOM appearing on the first line otherwise.
I have looked at answers regarding the same error message but nothing is working for me at present. The API is working in the browser and I am receiving a response of 200 and data is being returned. Perhaps I am missing something more fundamental in the approach?

Downloading multiple CSV files from a website with login and multiple pages

Goal: I want to download daily data for every day going back several years off a website.
This website has a login and on each page it only has 7 CSV files, then you have to click previous one etc to view the previous 7. Ideally I wanna download all of these into one folder for all the daily data.
The link to the donwloading of the files does follow a very simple format which I attempted to take advantage of:
https://cranedata.com/publications/download/mfi-daily-data/issue/2020-09-08/
with the ending only changing for each date disregarding weekends.
I have attempted to modify several versions of code but ultimatly have not found anything that works.
#!/usr/bin/env ipython
# --------------------
import requests
import shutil
import datetime
# -----------------------------------------------------------------------------------
dates=[datetime.datetime(2019,1,1)+datetime.timedelta(dval) for dval in range(0,366)];
# -----------------------------------------------------------------------------------
for dateval in dates:
r = requests.get('https://cranedata.com/publications/download/mfi-daily-data/issue/'+dateval.strftime('%Y-%m-%d'), stream=True)
if r.status_code == 200:
with open(dateval.strftime('%Y%m%d')+".csv", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
# ---------------------------------------------------------------------------------
This does seem to work for other files on other websites but the CSV files do not actually have data when I download them.
This is what my excel files say instead of the actual data:
https://prnt.sc/ugju49

You need to add authentication information to the request. This is either done using a "header", or in via a "cookie". You can use a requests.Session object to simplify this for both.
To be able to give you more details is not possible without knowing which technologies are used for authentication.
Chances are (by the looks of the site) that it uses a server-side session. So there should be something like a "session id" or "sid" in your headers when "talking to the back-end". You need to open the browser's "developer tools" and look closely at the "request headers". Also the response and response-headers when you perform a login.
If you are very lucky, just using a requests.Session might be enough as long as you perform a login in the beginning of the session. Something like this:
#!/usr/bin/env ipython
import requests
import shutil
import datetime
dates=[datetime.datetime(2019,1,1)+datetime.timedelta(dval) for dval in range(0,366)];
with requests.Session() as sess:
sess.post(authentication-details)
for dateval in dates:
r = sess.get('https://cranedata.com/publications/download/mfi-daily-data/issue/'+dateval.strftime('%Y-%m-%d'), stream=True)
if r.status_code == 200:
with open(dateval.strftime('%Y%m%d')+".csv", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
If this does not work you need to closely inspect the network-tab in the developer tools and pick out the interesting bits and reproduce them in your code.
Which parts are the "interesting bits" depends on the back-end, and without further details, I cannot say.

Too many open files: '/home/USER/PATH/SERVICE_ACCOUNT.json' when calling Google's Natural Language API

I'm working on a Sentiment Analysis project using the Google Cloud Natural Language API and Python, this question might be similar to this other question, what I'm doing is the following:
Reads a CSV file from Google Cloud Storage, file has approximately 7000 records.
Converts the CSV into a Pandas DataFrame.
Iterates over the dataframe and calls the Natural Language API to perform sentiment analysis on one of the dataframe's columns, on the same for loop I extract the score and magnitude from the result and add those values to a new column on the dataframe.
Store the result dataframe back to GCS.
I'll put my code below, but prior to that, I just want to mention that I have tested it with a sample CSV with less than 100 records and it works well, I am also aware about the quota limit of 600 requests per minute, reason why I put a delay on each iteration, still, I'm getting the error I specify at the title.
I'm also aware about the suggestion of increasing the ulimit, but I don't think that's a good solution.
Here's my code:
from google.cloud import language_v1
from google.cloud.language_v1 import enums
from google.cloud import storage
from time import sleep
import pandas
import sys
pandas.options.mode.chained_assignment = None
def parse_csv_from_gcs(csv_file):
df = pandas.read_csv(f, encoding = "ISO-8859-1")
return df
def analyze_sentiment(text_content):
client = language_v1.LanguageServiceClient()
type_ = enums.Document.Type.PLAIN_TEXT
language = 'es'
document = {"content": text_content, "type": type_, "language": language}
encoding_type = enums.EncodingType.UTF8
response = client.analyze_sentiment(document, encoding_type=encoding_type)
return response
gcs_path = sys.argv[1]
output_bucket = sys.argv[2]
output_csv_file = sys.argv[3]
dataframe = parse_csv_from_gcs(gcs_path)
for i in dataframe.index:
print(i)
response = analyze_sentiment(dataframe.at[i, 'FieldOfInterest'])
dataframe.at[i, 'Score'] = response.document_sentiment.score
dataframe.at[i, 'Magnitude'] = response.document_sentiment.magnitude
sleep(0.5)
print(dataframe)
dataframe.to_csv("results.csv", encoding = 'ISO-8859-1')
gcs = storage.Client()
gcs.get_bucket(output_bucket).blob(output_csv_file).upload_from_filename('results.csv', content_type='text/csv')
The 'analyze_sentiment' function is very similar to what we have in Google's documentation, I just modified it a little, but it does pretty much the same thing.
Now, the program is raising that error and crashes when it reaches a record between 550 and 700, but I don't see the correlation between the service account JSON and calling the Natural Language API, so I also think that when I call the the API, it opens the account credential JSON file but doesn't close it afterwards.
I'm currently stuck with this issue and ran out of ideas, so any help will be much appreciated, thanks in advance =)!
[UPDATE]
I've solved this issue by extracting the 'client' out of the 'analyze_sentiment' method and passing it as a parameter, as follows:
def analyze_sentiment(ext_content, client):
<Code>
Looks like every time it reaches this line:
client = language_v1.languageServiceClient()
It opens the account credential JSON file and it doesn't get closed,
so extracting it to a global variable made this work =).

I've updated the original post with the solution for this, but in any case, thanks to everyone that saw this and tried to reply =)!

How do I load JSON into Couchbase Headless Server in Python?

I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!

Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!

I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.

Python-Can't Pull JSON Format from JSON Source

I'm trying to scrape data from Verizon's buyback pricing site. I found the source of the information while going through "Net" requests in my browser. The site is in JSON format, but nothing I do will let me download that data https://www.verizonwireless.com/vzw/browse/tradein/ajax/deviceSearch.jsp?act=models&car=Verizon&man=Apple&siz=large
I can't remember everything I've tried, but here are the issues I'm having. Also, I'm not sure how to insert multiple code blocks.
import json,urllib,requests
res=urllib.request.urlopen(url)
data=json.loads(res)
TypeError: the JSON object must be str, not 'bytes'
import codecs
reader=codecs.getreader('utf-8')
obj=json.load(reader(res))
ValueError: Expecting value: line 1 column 1 (char 0)
#this value error happens with other similar attempts, such as....
res=requests.get(url)
res.json()#Same error Occurs
At this point I've researched many hours and can't find a solution. I'm assuming that the site is not formatted normally or I'm missing something obvious. I see the JSON requests/structure in my web developer tools.
Does anybody have any ideas or solutions for this? Please let me know if you have questions.

You need to send a User-Agent HTTP header field. Try this program:
import requests
url='https://www.verizonwireless.com/vzw/browse/tradein/ajax/deviceSearch.jsp?act=models&car=Verizon&man=Apple&siz=large'
# Put your own contact info in next line
headers = {'User-agent':'MyBot/0.1 (+user#example.com)'}
r = requests.get(url, headers=headers)
print(r.json()['models'][0]['name'])
Result:
iPhone 6S

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping data over websockets - python

Related

Receiving Error message: JSONDecodeError when attempting to use API

Downloading multiple CSV files from a website with login and multiple pages

Too many open files: '/home/USER/PATH/SERVICE_ACCOUNT.json' when calling Google's Natural Language API

How do I load JSON into Couchbase Headless Server in Python?

Python-Can't Pull JSON Format from JSON Source

Categories

Resources