I'll just start from scratch since I feel like I'm lost with all the different possibilities. What I will be talking about is leaderboard but could apply to price tracking as well.
My goal is to scrape data from a website (the all time leaderboard / hidden), put it in a .csv file and update it daily at noon.
What I have succeeded so far : scraping the data.
Tried scraping with BS4 but since the data is hidden, I couldn't be specific enough to only get the all-time points. I find it's a success because I'm able to get a table with all the data I need and the date as a header. My problem with this solution is 1) unuseful data populating the csv 2) table is vertical and not horizontal
Scraped data with CSS selector but I have abandoned this idea because soemtimes the page won't load and the data wasn't scraped. Found out that there's a json file containing the data right away
Json scraping seems to be the best option, but having trouble creating a csv file that's OK to make a graph with.
This is what brings me to what I'm struggling with : storing the data in a table that looks like this where the grey area is the points and the DATE1 is the moment the data has been scraped :
I'd like not to manipulate the data in the csv file too much. If the table would look like what I picture above, then it's gonna be easier to make a graph afterwards but I'm having trouble. The best I did is creating a table that looks like this AND that's vertical and not horizontal.
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
name,points,date
Dennis,52570,10-23-2020
Dinh,40930,10-23-2020
Thank you for your help.
Here's the code
import pandas as pd
import time
timestr = time.strftime("%Y-%m-%d %H:%M")
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, index=['name'], columns=['points','name'])
table['date'] = pd.Timestamp.today().strftime('%m-%d-%Y')
table.to_csv('products.csv', index=True, encoding='utf-8')
If what I want is not possible, I might just scrape individually for each member, make one CSV file per member and make a graph that refers to those different files.
So, I've played around with your question a bit and here's what I came up with.
Basically, your best bet for data storage is a light weight database, as suggested in the comments. However, with a bit of planning, a few hoops to jump, and some hacky code you could get away with a simple (sort of) JSON that eventually ends up as a .csv file that looks like this:
Note: the values are the same as I don't want to wait a day or two for the leader-board to actually update.
What I did was rearranging the data that came back from the request to the API and built a structure that looks like this:
"BobTheElectrician": {
"id": 7160010,
"rank": 14,
"score_data": {
"2020-10-24 18:45": 4187,
"2020-10-24 18:57": 4187,
"2020-10-24 19:06": 4187,
"2020-10-24 19:13": 4187
}
Every player is your main key that has, among others, a scores_data value. This in turn is a dict that holds points value for each day you run the script.
Now, the trick is to get this JSON to look like the .csv you want. The question is - how?
Well, since you intend to update all players' data (I just assumed that) they all should have the same number of entries for score_data.
The keys for score_data are your timestamps. Grab any player's score_data keys and you have the date headers, right?
Having said that, you can build your .csv rows the same way: grab player's name and all their point values from score_data. This should get you a list of lists, right? Right.
Then, when you have all this, you just dump that to a .csv file and there you have it!
Putting it all together:
import csv
import json
import os
import random
import time
from urllib.parse import urlencode
import requests
API_URL = "https://community.koodomobile.com/widget/pointsLeaderboard?"
LEADERBOARD_FILE = "leaderboard_data.json"
def get_leaderboard(period: str = "allTime", max_results: int = 20) -> list:
payload = {"period": period, "maxResults": max_results}
return requests.get(f"{API_URL}{urlencode(payload)}").json()
def dump_leaderboard_data(leaderboard_data: dict) -> None:
with open("leaderboard_data.json", "w") as jf:
json.dump(leaderboard_data, jf, indent=4, sort_keys=True)
def read_leaderboard_data(data_file: str) -> dict:
with open(data_file) as f:
return json.load(f)
def parse_leaderboard(leaderboard: list) -> dict:
return {
item["name"]: {
"id": item["id"],
"score_data": {
time.strftime("%Y-%m-%d %H:%M"): item["points"],
},
"rank": item["leaderboardPosition"],
} for item in leaderboard
}
def update_leaderboard_data(target: dict, new_data: dict) -> dict:
for player, stats in new_data.items():
target[player]["rank"] = stats["rank"]
target[player]["score_data"].update(stats["score_data"])
return target
def leaderboard_to_csv(leaderboard: dict) -> None:
data_rows = [
[player] + list(stats["score_data"].values())
for player, stats in leaderboard.items()
]
random_player = random.choice(list(leaderboard.keys()))
dates = list(leaderboard[random_player]["score_data"])
with open("the_data.csv", "w") as output:
w = csv.writer(output)
w.writerow([""] + dates)
w.writerows(data_rows)
def script_runner():
if os.path.isfile(LEADERBOARD_FILE):
fresh_data = update_leaderboard_data(
target=read_leaderboard_data(LEADERBOARD_FILE),
new_data=parse_leaderboard(get_leaderboard()),
)
leaderboard_to_csv(fresh_data)
dump_leaderboard_data(fresh_data)
else:
dump_leaderboard_data(parse_leaderboard(get_leaderboard()))
if __name__ == "__main__":
script_runner()
The script also checks if you have a JSON file that pretends to be a proper database. If not, it'll write the leader-board data. Next time you run the script, it'll update the JSON and spit out a fresh .csv file.
Hope this answer will nudge you in the right direction.
Hey since you are loading it in a panda frame it makes the operations fairly simple. I ran your code first
import pandas as pd
import time
timestr = time.strftime("%Y-%m-%d %H:%M")
url_all_time = 'https://community.koodomobile.com/widget/pointsLeaderboard?period=allTime&maxResults=20&excludeRoles='
data = pd.read_json(url_all_time)
table = pd.DataFrame.from_records(data, index=['name'], columns=['points','name'])
table['date'] = pd.Timestamp.today().strftime('%m-%d-%Y')
Then I added a few more lines of code to modify the panda frame table to your need.
idxs = table['date'].index
for i,val in enumerate(idxs):
table.at[ val , table['date'][i] ] = table['points'][i]
table = table.drop([ 'date', 'points' ], axis = 1)
In the above snippet I am using pandas frames ability to assign values using indexes. So first I get index values for the date column then I go through each of them to add column for the required date(values from date column) and get the corresponding points according to the indexes we pulled earlier
This gives me the following output:
name 10-24-2020
Dennis 52570.0
Dinh 40930.0
Sophia 26053.0
Mayumi 25300.0
Goran 24689.0
Robert T 19843.0
Allan M 19768.0
Bernard Koodo 14404.0
nim4165 13629.0
Timo Tuokkola 11216.0
rikkster 7338.0
David AKU 5774.0
Ranjan Koodo 4506.0
BobTheElectrician 4170.0
Helen Koodo 3370.0
Mihaela Koodo 2764.0
Fred C 2542.0
Philosoraptor 2122.0
Paul Deschamps 1973.0
Emilia Koodo 1755.0
I can then save this to csv using last line from your code. Similar you can pull data for more dates and format it to add it to the same panda frame
table.to_csv('products.csv', index=True, encoding='utf-8')
Related
Good Evening
Hi everyone, so i got the following JSON file from Walmart regarding their product items and price.
so i loaded up jupyter notebook, imported pandas and then loaded it into a Dataframe with custom columns as shown in the pics below.
now this is what i want to do:
make new columns named as min price and max price and load the data into it
how can i do that ?
Here is the code in jupyter notebook for reference.
i also want the offer price as some items dont have minprice and maxprice :)
EDIT: here is the PYTHON Code:
import json
import pandas as pd
with open("walmart.json") as f:
data = json.load(f)
walmart = data["items"]
wdf = pd.DataFrame(walmart,columns=["productId","primaryOffer"])
print(wdf.loc[0,"primaryOffer"])
pd.set_option('display.max_colwidth', None)
print(wdf)
Here is the JSON File:
https://pastebin.com/sLGCFCDC
The following code snippet on top of your code would achieve the required task:
min_prices = []
max_prices = []
offer_prices = []
for i,row in wdf.iterrows():
if('showMinMaxPrice' in row['primaryOffer']):
min_prices.append(row['primaryOffer']['minPrice'])
max_prices.append(row['primaryOffer']['maxPrice'])
offer_prices.append('N/A')
else:
min_prices.append('N/A')
max_prices.append('N/A')
offer_prices.append(row['primaryOffer']['offerPrice'])
wdf['minPrice'] = min_prices
wdf['maxPrice'] = max_prices
wdf['offerPrice'] = offer_prices
Here we are checking for the 'showMinMaxPrice' element from the json in the column named 'primaryOffer'. For cases where the minPrice and maxPrice is available, the offerPrice is shown as 'N/A' and vice-versa. These are first stored in lists and later added to the dataframe as columns.
The output for wdf.head() would then be:
I have a large dictionary that contains weather data. You can take a look at it here
This weather data is for multiple days, and I want to get all of the values from one key. How would I do this?
Here is a simplified version of the dictionary:
'data': { 'day1' : {'weather_discription': 'cloudy'},
'day2' : {'weather_discription': 'clear'}
}
I tried to use this code:
import requests
r = requests.get('data website')
res = r.json()
print(res['weather_discription'])
You need a loop to get them all.
for day, data in res['data'].items():
print(f"Weather on {day} was {data['weather_description']}")
I have a dataframe, each cell saves a dictionary. Before exporting the dataframe, I could call each cell as an individual dataframe.
However, after saving the dataframe as csv and reopening this each cell became string so I could not turn the cell I called into a dataframe anymore.
The output should look like this
After saving the dataframe as csv, dictionary became string
I was surprising to learn after my research on Stackoverflow, there were not many people experienced same issue as I'm having. I wondered whether my practice is wrong. I only found two posts related to my issue. Here is the one (dict objects converting to string when read from csv to dataframe pandas python).
I basically tried json, ast.literal_eval and yaml but none of these could solve my issue.
This is the first part of my code(I created this four list to store my data which I called from an api)
tickers4 = []
last_1st_bs4 = []
last_2nd_bs4 = []
last_3rd_bs4 = []
for i in range(len(tickers)):
try:
ticker = tickers.loc[i, 'ticker']
ann_yr = 2018
yr_1st = intrinio.financials_period(ticker, str(ann_yr-1), fiscal_period='FY', statement='balance_sheet')
yr_2nd = intrinio.financials_period(ticker, str(ann_yr-2), fiscal_period='FY', statement='balance_sheet')
yr_3rd = intrinio.financials_period(ticker, str(ann_yr-3), fiscal_period='FY', statement='balance_sheet')
tickers4.append(ticker)
last_1st_bs4.append(yr_1st)
last_2nd_bs4.append(yr_2nd)
last_3rd_bs4.append(yr_3rd)
print('{} Feeding data {}'.format(i, ticker))
except:
tickers4.append(ticker)
last_1st_bs4.append(0)
last_2nd_bs4.append(0)
last_3rd_bs4.append(0)
print('{} Error {}'.format(i, ticker))
Second part: I put them into a dataframe and saved as csv
BS = pd.DataFrame()
BS['ticker'] = tickers4
BS['BS_2017'] = last_1st_bs4
BS['BS_2016'] = last_2nd_bs4
BS['BS_2015'] = last_3rd_bs4
BS.to_csv('Balance_Sheet_2015_2017.csv')
now, I need read this csv in another notebook
BS = pd.read_csv('./Balance_Sheet_2015_2017.csv', index_col=0)
BS.loc[9, 'BS_2017']
here is the result I got:
' cashandequivalents shortterminvestments notereceivable \\\nyear \n2017 2.028900e+10 5.389200e+10 1.779900e+10 \n\n accountsreceivable netinventory othercurrentassets \\\nyear \n2017 1.787400e+10 4.855000e+09 1.393600e+10 \n\n totalcurrentassets netppe longterminvestments \\\nyear \n2017 1.286450e+11 3.378300e+10 1.947140e+11 \n\n othernoncurrentassets ... \\\nyear ... \n2017 1.817700e+10 ... \n\n commitmentsandcontingencies commonequity retainedearnings \\\nyear \n2017 0.0 3.586700e+10 9.833000e+10 \n\n aoci totalcommonequity totalequity \\\nyear \n2017 -150000000.0 1.340470e+11 1.340470e+11 \n\n totalequityandnoncontrollinginterests totalliabilitiesandequity \\\nyear \n2017 1.340470e+11 3.753190e+11 \n\n currentdeferredrevenue noncurrentdeferredrevenue \nyear \n2017 7.548000e+09 2.836000e+09 \n\n[1 rows x 30 columns]'
Thanks for your help.
CSV is not an appropriate format for saving dictionaries (and honestly, putting dictionaries into DataFrames isn't a great data structure). You should try writing the DataFrame to json instead: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
I had this same error once. I solved it by using DataFrame.to_pickle() instead of DataFrame.to_csv().
Everything in a CSV file is plain text, even the numerical values. When you load a CSV file into a spreadsheet program, there are parsers which look for strings which are recognizable as numbers, or dates, and convert them accordingly.
A CSV file can't easily hold the more complex Python objects, but Pandas won't throw an error if you place Python objects in a DataFrame. It converts them to their string representations.
I'm moving to R from python and am trying to use my python skills to become familiar with scraping json with R. I am having some issues viewing and scraping what I would like to. I'm pretty sure I have the For loops down but I am unsure on how to select keys and return their content. I have read some documents but being new to R its a little tough to understand. For this I created a quick script with python to show what I am trying to do in Rstudio.
import requests
from pprint import pprint
start = '2018-10-03'
end = '2018-10-10'
req = requests.get('https://statsapi.web.nhl.com/api/v1/schedule?startDate=' + str(start) + '&endDate=' + str(end) + '&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data = req.json()['dates']
for info in data:
date = info['date']
games = info['games']
for game in games:
gamePk = game['gamePk']
print(date, gamePk)
Below is what I have started with but I am having an issue understanding where I can view my json other than print data which locks up R. I would like to be able to view the dictionaries and keys as I go. The other question is how would i then add the key-values to a "vector? or df?" and write them out. I am familiar with exporting to a file but curious as to how I add the values to a df. Would that be bind? or would i not have to do that?
library(jsonlite)
start <- as.Date(c('2018-10-03'))
end <- as.Date(c('2019-04-15'))
url <- paste0('https://statsapi.web.nhl.com/api/v1/schedule?startDate=', start,'&endDate=', end,'&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data <- fromJSON(url)
To expound on my issue here is a further sample of where the struggle lies.
library(jsonlite)
start <- as.Date(c('2018-10-03'))
end <- as.Date(c('2018-10-04'))
url <- paste0('https://statsapi.web.nhl.com/api/v1/schedule?startDate=', start,'&endDate=', end,'&hydrate=team(leaders(categories=[points,goals,assists],gameTypes=[P])),linescore,broadcasts(all),tickets,game(content(media(epg),highlights(scoreboard)),seriesSummary),radioBroadcasts,metadata,seriesSummary(series),decisions,scoringplays&leaderCategories=&site=en_nhl&teamId=&gameType=&timecode=')
data <- fromJSON(url)
date <- data$dates$date
game_id <- data$dates$games
game <- NULL
for (ids in game_id) {
pk <- ids$gamePk
game <- rbind(game, pk)
}
I figured the "pk" would be in 1 column but its in multiple columns and I receive a In rbind: number of columns of result is not a multiple of a vector length
I do have a bunch of files containing atmospheric measurements in one directory. Fileformat is NetCDF. Each file has a timestamp (variable 'basetime'). I can read all files and plot individual measurement events (temperature vs. altitude).
What I need to do next is "group the files by day" and plot all measurements taken at one single day together in one plot. Unfortunately I have no clue how to do that.
One idea is to use the variable 'measurement_day' as it is defined in the code below.
For each day I normally do have four different files containing temp. and altitude.
Ideally the data of those four different files should be grouped (e.g. for plotting)
I hope my question is clear. Can anyone please help me.
EDIT: I try to use a dictionary now but I have trouble to determine whether one entry already exists for one measurement day. Please see edited code below
from netCDF4 import Dataset
data ={} # was edited
for f in listdir(path):
if isfile(join(path,f)):
full_path = join(path,f)
f = Dataset(full_path, 'r')
basetime = f.variables['base_time'][:]
altitude = f.variables['alt'][:]
temp = f.variables['tdry'][:]
actual_date = strftime("%Y-%m-%d %H:%M:%S", gmtime(basetime))
measurement_day = strftime("%Y-%m-%d", gmtime(basetime))
# check if dict entries for day already exist, if not create empty dict
# and lists inside
if len(data[measurement_day]) == 0:
data[measurement_day] = {}
else: pass
if len(data[measurement_day]['temp']) == 0:
data[measurement_day]['temp'] = []
data[measurement_day]['altitude'] = []
else: pass
I get the following error message:
Traceback (most recent call last):... if len(data[measurement_day]) == 0:
KeyError: '2009/05/28'
Can anyone please help me.
I will try. Though I'm not totally clear on what you already have.
I can read all files and plot individual measurement events
(temperature vs. altitude). What I need to do next is "group the files
by day" and plot all measurements taken at one single day together in
one plot.
From this, I am assuming that you know how to plot the information given a list of Datasets. To get that list of Datasets, try something like this.
from netCDF4 import Dataset
# a dictionary of lists that hold all the datasets from a given day
grouped_datasets = {}
for f in listdir(path):
if isfile(join(path,f)):
full_path = join(path,f)
f = Dataset(full_path, 'r')
basetime = f.variables['base_time'][:]
altitude = f.variables['alt'][:]
temp = f.variables['tdry'][:]
actual_date = strftime("%Y-%m-%d %H:%M:%S", gmtime(basetime))
measurement_day = strftime("%Y-%m-%d", gmtime(basetime))
# if we haven't encountered any datasets from this day yet...
if measurement_day not in grouped_datasets:
# add that day to our dict
grouped_datasets[measurement_day] = []
# now append our dataset to the correct day (list)
grouped_datasets[measurement_day].append(f)
Now you have a dictionary keyed on measurement_day. I'm not sure how you are graphing your data, so this is as far as I can get you. Hope it helps, good luck.