Learning Json scraping with R coming from Python - python

I'm moving to R from python and am trying to use my python skills to become familiar with scraping json with R. I am having some issues viewing and scraping what I would like to. I'm pretty sure I have the For loops down but I am unsure on how to select keys and return their content. I have read some documents but being new to R its a little tough to understand. For this I created a quick script with python to show what I am trying to do in Rstudio.
import requests
from pprint import pprint
start = '2018-10-03'
end = '2018-10-10'
req = requests.get('https://statsapi.web.nhl.com/api/v1/schedule?startDate=' + str(start) + '&endDate=' + str(end) + '&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data = req.json()['dates']
for info in data:
date = info['date']
games = info['games']
for game in games:
gamePk = game['gamePk']
print(date, gamePk)
Below is what I have started with but I am having an issue understanding where I can view my json other than print data which locks up R. I would like to be able to view the dictionaries and keys as I go. The other question is how would i then add the key-values to a "vector? or df?" and write them out. I am familiar with exporting to a file but curious as to how I add the values to a df. Would that be bind? or would i not have to do that?
library(jsonlite)
start <- as.Date(c('2018-10-03'))
end <- as.Date(c('2019-04-15'))
url <- paste0('https://statsapi.web.nhl.com/api/v1/schedule?startDate=', start,'&endDate=', end,'&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data <- fromJSON(url)
To expound on my issue here is a further sample of where the struggle lies.
library(jsonlite)
start <- as.Date(c('2018-10-03'))
end <- as.Date(c('2018-10-04'))
url <- paste0('https://statsapi.web.nhl.com/api/v1/schedule?startDate=', start,'&endDate=', end,'&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data <- fromJSON(url)
date <- data$dates$date
game_id <- data$dates$games
game <- NULL
for (ids in game_id) {
pk <- ids$gamePk
game <- rbind(game, pk)
}
I figured the "pk" would be in 1 column but its in multiple columns and I receive a In rbind: number of columns of result is not a multiple of a vector length

Related

Daily leaderboard or price tracking data

I'll just start from scratch since I feel like I'm lost with all the different possibilities. What I will be talking about is leaderboard but could apply to price tracking as well.
My goal is to scrape data from a website (the all time leaderboard / hidden), put it in a .csv file and update it daily at noon.
What I have succeeded so far : scraping the data.
Tried scraping with BS4 but since the data is hidden, I couldn't be specific enough to only get the all-time points. I find it's a success because I'm able to get a table with all the data I need and the date as a header. My problem with this solution is 1) unuseful data populating the csv 2) table is vertical and not horizontal
Scraped data with CSS selector but I have abandoned this idea because soemtimes the page won't load and the data wasn't scraped. Found out that there's a json file containing the data right away
Json scraping seems to be the best option, but having trouble creating a csv file that's OK to make a graph with.
This is what brings me to what I'm struggling with : storing the data in a table that looks like this where the grey area is the points and the DATE1 is the moment the data has been scraped :
I'd like not to manipulate the data in the csv file too much. If the table would look like what I picture above, then it's gonna be easier to make a graph afterwards but I'm having trouble. The best I did is creating a table that looks like this AND that's vertical and not horizontal.
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
Thank you for your help.
Here's the code
import pandas as pd
import time
timestr = time.strftime("%Y-%m-%d %H:%M")
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, index=['name'], columns=['points','name'])
table['date'] = pd.Timestamp.today().strftime('%m-%d-%Y')
table.to_csv('products.csv', index=True, encoding='utf-8')
If what I want is not possible, I might just scrape individually for each member, make one CSV file per member and make a graph that refers to those different files.
So, I've played around with your question a bit and here's what I came up with.
Basically, your best bet for data storage is a light weight database, as suggested in the comments. However, with a bit of planning, a few hoops to jump, and some hacky code you could get away with a simple (sort of) JSON that eventually ends up as a .csv file that looks like this:
Note: the values are the same as I don't want to wait a day or two for the leader-board to actually update.
What I did was rearranging the data that came back from the request to the API and built a structure that looks like this:
"BobTheElectrician": {
"id": 7160010,
"rank": 14,
"score_data": {
"2020-10-24 18:45": 4187,
"2020-10-24 18:57": 4187,
"2020-10-24 19:06": 4187,
"2020-10-24 19:13": 4187
}
Every player is your main key that has, among others, a scores_data value. This in turn is a dict that holds points value for each day you run the script.
Now, the trick is to get this JSON to look like the .csv you want. The question is - how?
Well, since you intend to update all players' data (I just assumed that) they all should have the same number of entries for score_data.
The keys for score_data are your timestamps. Grab any player's score_data keys and you have the date headers, right?
Having said that, you can build your .csv rows the same way: grab player's name and all their point values from score_data. This should get you a list of lists, right? Right.
Then, when you have all this, you just dump that to a .csv file and there you have it!
Putting it all together:
import csv
import json
import os
import random
import time
from urllib.parse import urlencode
import requests
API_URL = "https://community.koodomobile.com/widget/pointsLeaderboard?"
LEADERBOARD_FILE = "leaderboard_data.json"
def get_leaderboard(period: str = "allTime", max_results: int = 20) -> list:
payload = {"period": period, "maxResults": max_results}
return requests.get(f"{API_URL}{urlencode(payload)}").json()
def dump_leaderboard_data(leaderboard_data: dict) -> None:
with open("leaderboard_data.json", "w") as jf:
json.dump(leaderboard_data, jf, indent=4, sort_keys=True)
def read_leaderboard_data(data_file: str) -> dict:
with open(data_file) as f:
return json.load(f)
def parse_leaderboard(leaderboard: list) -> dict:
return {
item["name"]: {
"id": item["id"],
"score_data": {
time.strftime("%Y-%m-%d %H:%M"): item["points"],
},
"rank": item["leaderboardPosition"],
} for item in leaderboard
}
def update_leaderboard_data(target: dict, new_data: dict) -> dict:
for player, stats in new_data.items():
target[player]["rank"] = stats["rank"]
target[player]["score_data"].update(stats["score_data"])
return target
def leaderboard_to_csv(leaderboard: dict) -> None:
data_rows = [
[player] + list(stats["score_data"].values())
for player, stats in leaderboard.items()
]
random_player = random.choice(list(leaderboard.keys()))
dates = list(leaderboard[random_player]["score_data"])
with open("the_data.csv", "w") as output:
w = csv.writer(output)
w.writerow([""] + dates)
w.writerows(data_rows)
def script_runner():
if os.path.isfile(LEADERBOARD_FILE):
fresh_data = update_leaderboard_data(
target=read_leaderboard_data(LEADERBOARD_FILE),
new_data=parse_leaderboard(get_leaderboard()),
)
leaderboard_to_csv(fresh_data)
dump_leaderboard_data(fresh_data)
else:
dump_leaderboard_data(parse_leaderboard(get_leaderboard()))
if __name__ == "__main__":
script_runner()
The script also checks if you have a JSON file that pretends to be a proper database. If not, it'll write the leader-board data. Next time you run the script, it'll update the JSON and spit out a fresh .csv file.
Hope this answer will nudge you in the right direction.
Hey since you are loading it in a panda frame it makes the operations fairly simple. I ran your code first
import pandas as pd
import time
timestr = time.strftime("%Y-%m-%d %H:%M")
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, index=['name'], columns=['points','name'])
table['date'] = pd.Timestamp.today().strftime('%m-%d-%Y')
Then I added a few more lines of code to modify the panda frame table to your need.
idxs = table['date'].index
for i,val in enumerate(idxs):
table.at[ val , table['date'][i] ] = table['points'][i]
table = table.drop([ 'date', 'points' ], axis = 1)
In the above snippet I am using pandas frames ability to assign values using indexes. So first I get index values for the date column then I go through each of them to add column for the required date(values from date column) and get the corresponding points according to the indexes we pulled earlier
This gives me the following output:
name 10-24-2020
Dennis 52570.0
Dinh 40930.0
Sophia 26053.0
Mayumi 25300.0
Goran 24689.0
Robert T 19843.0
Allan M 19768.0
Bernard Koodo 14404.0
nim4165 13629.0
Timo Tuokkola 11216.0
rikkster 7338.0
David AKU 5774.0
Ranjan Koodo 4506.0
BobTheElectrician 4170.0
Helen Koodo 3370.0
Mihaela Koodo 2764.0
Fred C 2542.0
Philosoraptor 2122.0
Paul Deschamps 1973.0
Emilia Koodo 1755.0
I can then save this to csv using last line from your code. Similar you can pull data for more dates and format it to add it to the same panda frame
table.to_csv('products.csv', index=True, encoding='utf-8')

How to work with multiple of the same key?

I have a large dictionary that contains weather data. You can take a look at it here
This weather data is for multiple days, and I want to get all of the values from one key. How would I do this?
Here is a simplified version of the dictionary:
'data': { 'day1' : {'weather_discription': 'cloudy'},
'day2' : {'weather_discription': 'clear'}
}
I tried to use this code:
import requests
r = requests.get('data website')
res = r.json()
print(res['weather_discription'])
You need a loop to get them all.
for day, data in res['data'].items():
print(f"Weather on {day} was {data['weather_description']}")

Trying to translate my Matlab routine for loading and preparing data to Python. Stuck in pandas

this is my first post and I have been struggling with this problem for a few days now. The following code is a Matlab code that I usually use to load my data (.csv files) and prepare them for further calculations.
%I use this later on to predefine my array because they need to have to same length for calculations
maxind = 400;
% i is given a vector (like test_numbers = [1 2 3]) I get from the user so I can iterate over the numbers of test specimen
for i = test_numbers
% Setup the Import Options and import the data
opts = delimitedTextImportOptions("NumVariables", 4);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ";";
% Specify column names and types
opts.VariableNames = ["time", "force", "displ_1", "displ_2"];
opts.VariableTypes = ["double", "double", "double", "double"];
% Specify file level properties
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";
% Import the data
%Here I build the name and read the files/ the csv files are in the same folder as the main program.
data_col = readtable(['specimen_name',num2str(i),'.csv'], opts);
Data.force(:,i)=nan(maxind,1);
Data.force(1:length(data_col.time),i)=data_col.force;
Data.displ(:,i)=nan(maxind,1);
Data.displ(1:length(data_col.time),i)=nanmean([data_col.displ_1,data_col.displ_2]')';
Data.time(:,i)=nan(maxind,1);
Data.time(1:length(data_col.time),i)=data_col.time;
Data.name(i)={['specimen_name',num2str(i)]};
% Clear temporary variables
clear opts
end
Now I have to use Python instead of Matlab and I started with pandas to read my csv as DataFrame.
Now my question. Is there a way to access my data like in this part of my Matlab code or should I not use dataframes in the first place to do something like that? (I know I can access my data with the name of the column, but I got stuck by trying to refer my data like Data.force(1:length(data_col.time),i) in a new dataframe)
Data.force(:,i)=nan(maxind,1);
Data.force(1:length(data_col.time),i)=data_col.force;
Data.displ(:,i)=nan(maxind,1);
Data.displ(1:length(data_col.time),i)=nanmean([data_col.displ_1,data_col.displ_2]')';
Data.time(:,i)=nan(maxind,1);
Data.time(1:length(data_col.time),i)=data_col.time;
Data.name(i)={['specimen_name',num2str(i)]};
Many thanks in advance for your help.

Checking HTTP Status (Python)

Is there a way to check the HTTP Status Code in the code below, as I have not used the request or urllib libraries which would allow for this.
from pandas.io.excel import read_excel
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8) #Creates the dataframes
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
#Providing correct names
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
In pandas:
read_excel try to use urllib2.urlopen(urllib.request.urlopen instead in py3x) to open the url and get .read() of response immediately without store the http request like:
data = urlopen(url).read()
Though you need only part of the excel, pandas will download the whole excel each time. So, I voted #jonnybazookatone.
It's better to store the excel to your local, then you can check the status code and md5 of file first to verify data integrity or others.

Python Dataset package & looping / updating rows --

I am trying to retrieve the contents of my sqlite3 database and updating this data utilizing a scraper in a for loop.
The presumed flow is as follows:
Retrieve all rows from the dataset
For each row, find the URL column and fetch some additional (updated) data
Once this data has been obtained, upsert (update, add columns if not existent) this data to the row the URL was taken from.
I love the dataset package because of 'upsert', allowing it to dynamically add whatever columns I may have added to the database if non-existent.
My code produces an error I can't explain, however.
'ResourceClosedError: This result object is closed.'
How would I go about obtaining my goal without running into this? The following snippet recreates my issue.
import dataset
db = dataset.connect('sqlite:///test.db')
# Add two dummy rows
testrow1 = {'TestID': 1}
testrow2 = {'TestID': 2}
db['test'].upsert(testrow1, ['TestID'])
db['test'].upsert(testrow2, ['TestID'])
print("Inserted testdata before loop")
# This works fine
testdata = db['test'].all()
for row in testdata:
print row
# This gives me an 'ResourceClosedError: This result object is closed.' error?
i = 1 # 'i' here exemplifies data that I'll add through my scraper.
testdata = db['test'].all()
for row in testdata:
data = {'TestID': i+1000}
db['test'].upsert(data, ['TestID'])
print("Upserted within loop (i = " + str(i) + ")")
i += 1
The issue might be you are querying the dataset and accessing the result object (under 'this works fine") and reading it all in a loop and then immediately trying to do another loop again with upserts on the same result object. The error is telling you that the resource has been closed, basically once you read it the connection is closed automatically (as a feature!). (see this answer about 'automatic closing' for more on the why and ways to get around it.)
Given that result resources tend to get closed, try fetching the results again at the beginning of your upsert loop:
i = 1 # 'i' here exemplifies data that I'll add through my scraper.
testdata = db['test'].all()
for row in testdata:
data = {'TestID': i}
db['test'].upsert(data, ['TestID'])
print("Upserted within loop (i = " + str(i) + ")")
i += 1
Edit: See comment, the above code would change the testdata inside the loop and thus still gives the same error, so a way to get around this is to read the data into an array first and then loop through that array to do the updates. Something like:
i = 1 # 'i' here exemplifies data that I'll add through my scraper.
testdata = [row for row in db['test'].all()]
for row in testdata:
data = {'TestID': i}
db['test'].upsert(data, ['TestID'])
print("Upserted within loop (i = " + str(i) + ")")
i += 1

Categories

Resources