Converting NBA play by play specific .json to .csv - python
I am trying to build a database containing play by play data for several seasons of NBA games, for my Msc. in economics dissertation. Currently I am extracting games from the NBA's API (see example) and splitting each game into a different .json file using this routine (duly adapted for p-b-p purposes), thus yielding .json files as (first play example):
{"headers": ["GAME_ID", "EVENTNUM", "EVENTMSGTYPE", "EVENTMSGACTIONTYPE", "PERIOD", "WCTIMESTRING", "PCTIMESTRING", "HOMEDESCRIPTION", "NEUTRALDESCRIPTION", "VISITORDESCRIPTION", "SCORE", "SCOREMARGIN"], "rowSet": [["0041400406", 0, 12, 0, 1, "9:11 PM", "12:00", null, null, null, null, null], ["0041400406", 1, 10, 0, 1, "9:11 PM", "12:00", "Jump Ball Mozgov vs. Green: Tip to Barnes", null, null, null, null]
I plan on creating a loop to convert all of the generated .json files to .csv, such that it allows me to proceed to econometric analysis in stata. At the moment, I am stuck in the first step of this procedure: the creation of the json to CSV conversion process (I will design the loop afterwards). The code I am trying is:
f = open('pbp_0041400406.json')
data = json.load(f)
f.close()
with open("pbp_0041400406.csv", "w") as file:
csv_file = csv.writer(file)
for rowSet in data:
csv_file.writerow(rowSet)
f.close()
However, the yielded CSV files are showing awkward results: one line reading h,e,a,d,e,r,s and another reading r,o,w,S,e,t, thus not capturing the headlines or rowSet(the plays themselves).
I have tried to solve this problem taking into account the contributes on this thread, but I have not been able to do it. Can anybody please provide me some insight into solving this problem?
[EDIT] Replacing rowset with data in the original code also yielded the same results.
Thanks in advance!
try this:
import json
import csv
with open('json.json') as f:
data = json.load(f)
with open("pbp_0041400406.csv", "w") as fout:
csv_file = csv.writer(fout, quotechar='"')
csv_file.writerow(data['headers'])
for rowSet in data['rowSet']:
csv_file.writerow(rowSet)
Resulting CSV:
GAME_ID,EVENTNUM,EVENTMSGTYPE,EVENTMSGACTIONTYPE,PERIOD,WCTIMESTRING,PCTIMESTRING,HOMEDESCRIPTION,NEUTRALDESCRIPTION,VISITORDESCRIPTION,SCORE,SCOREMARGIN
0041400406,0,12,0,1,9:11 PM,12:00,,,,,
0041400406,1,10,0,1,9:11 PM,12:00,Jump Ball Mozgov vs. Green: Tip to Barnes,,,,
I think you might have made a mistake regarding the structure of the json input. There are three keys at the top level. resultSets -> list whose first element is a dictionary with key 'rowSet'. That's what I think you want to iterate over.
f = open('playbyplay', 'r')
data = json.load(f)
f.close()
print data.keys()
rows = data['resultSets'][0]['rowSet']
with open("pbp_0041400406.csv", "w") as file:
csv_file = csv.writer(file)
for rowSet in rows:
csv_file.writerow(rowSet)
Output data:
0041300402,0,12,0,1,8:13 PM,12:00,,,,,,0,0,,,,,,0,0,,,,,,0,0,,,,,
0041300402,1,10,0,1,8:13 PM,12:00,Jump Ball Duncan vs. Bosh: Tip to Wade,,,,,4,1495,Tim Duncan,1610612759,San Antonio,Spurs,SAS,5,2547,Chris Bosh,1610612748,Miami,Heat,MIA,5,2548,Dwyane Wade,1610612748,Miami,Heat,MIA
0041300402,2,5,2,1,8:13 PM,11:45,Green STEAL (1 STL),,James Lost Ball Turnover (P1.T1),,,5,2544,LeBron James,1610612748,Miami,Heat,MIA,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,0,0,,,,,
0041300402,3,1,1,1,8:14 PM,11:26,Green 18' Jump Shot (2 PTS) (Splitter 1 AST),,,0 - 2,2,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,4,201168,Tiago Splitter,1610612759,San Antonio,Spurs,SAS,0,0,,,,,
0041300402,4,6,2,1,8:14 PM,11:03,Green S.FOUL (P1.T1),,,,,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,1,0,,,,,
0041300402,5,3,11,1,8:14 PM,11:03,,,Lewis Free Throw 1 of 2 (1 PTS),1 - 2,1,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,0,0,,,,,,0,0,,,,,
0041300402,7,3,12,1,8:14 PM,11:03,,,MISS Lewis Free Throw 2 of 2,,,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,0,0,,,,,,0,0,,,,,
0041300402,9,4,0,1,8:15 PM,11:01,Splitter REBOUND (Off:0 Def:1),,,,,4,201168,Tiago Splitter,1610612759,San Antonio,Spurs,SAS,0,0,,,,,,0,0,,,,,
0041300402,10,1,5,1,8:15 PM,10:52,Duncan 1' Layup (2 PTS) (Parker 1 AST),,,1 - 4,3,4,1495,Tim Duncan,1610612759,San Antonio,Spurs,SAS,4,2225,Tony Parker,1610612759,San Antonio,Spurs,SAS,0,0,,,,,
Issue solved! Using #MaxU code and a previously constructed .CSV containing all gameid, every nba game since the 01-02 season can be directly scraped via .JSON and converted to CSV using the following code: (Credits for #MaxU)
from __future__ import print_function
import json
import csv
import requests
u_a = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36"
url_pattern = "http://stats.nba.com/stats/playbyplayv2?GameID=%(GameID)s&StartPeriod=%(StartPeriod)s&EndPeriod=%(EndPeriod)s&tabView=%(tabView)s"
def write_csv(game_id, resultSet):
fn = resultSet['name'] + '_' + str(game_id) + '.csv'
# ignore unimportant resultsets ...
if resultSet['name'] not in ['PlayByPlay', 'PlayBlahBlah']:
return
with open(fn, 'w') as fout:
csv_file = csv.writer(fout, quotechar='"')
csv_file.writerow(resultSet['headers'])
for rowSet in resultSet['rowSet']:
csv_file.writerow(rowSet)
def process_game_id(game_id, tabView='playbyplay',
start_period='0', end_period='0'):
url_parms = {
'GameID': game_id,
'StartPeriod': start_period,
'EndPeriod': end_period,
'tabView': tabView,
}
r = requests.get((url_pattern % url_parms), headers={"USER-AGENT":u_a})
if r.status_code == requests.codes.ok:
data = json.loads(r.text)
for rset in data['resultSets']:
write_csv(url_parms['GameID'], rset)
else:
r.raise_for_status()
if __name__ == '__main__':
#
# assuming that the 'games.csv' file contains all Game_IDs ...
#
with open('games.csv', 'r') as f:
csv_reader = csv.reader(f, delimiter=',')
for row in csv_reader:
process_game_id(row[<column_num_containing_Game_ID>])
Any further questions on this data, plz do PM me. Happy coding everyone!
Related
Trouble with nested json where key index dynamically change
scraping an API for NBA Player Props. It's a nested json where I filtered for my desired output. import json from urllib.request import urlopen import csv import os # Delete CSV os.remove('Path') jsonurl = urlopen('https://sportsbook.draftkings.com//sites/US-SB/api/v4/eventgroups/88670846/categories/583/subcategories/5001') games = json.loads(jsonurl.read())['eventGroup']['offerCategories'][8]['offerSubcategoryDescriptors'][0]['offerSubcategory']['offers'] # Open a new file as our CSV file with open('Path', "a", newline='', encoding='utf-8') as csv_file: csv_writer = csv.writer(csv_file) # Add a header csv_writer.writerow(["participant", "line", "oddsDecimal"]) for game in games: for in_game in game: outcomes = in_game.get('outcomes') for outcome in outcomes: # Write CSV csv_writer.writerow([ outcome["participant"], outcome["line"], outcome["oddsDecimal"] ]) So my issue is here that I have index "8" at "offerCategories" hardcoded in my code (also "0" at "Sub") and it's dynamically changing from day to day at this provider. Not familiar with this stuff but can't figure out how to query this through string with "name": "Player Combos" (index=8 in this example) Thx in advice!
Given the data structure you have, you need to iterate through the 'offerCategories' sub-list looking for the map with the appropriate 'name' key. It seems that your use of '0' as an index is fine, since there is only a single value to choose in that case: import json from urllib.request import urlopen jsonurl = urlopen( 'https://sportsbook.draftkings.com//sites/US-SB/api/v4/eventgroups/88670846/categories/583/subcategories/5001') games = json.loads(jsonurl.read()) for offerCategory in games['eventGroup']['offerCategories']: if offerCategory['name'] == 'Player Combos': offers = offerCategory['offerSubcategoryDescriptors'][0]['offerSubcategory']['offers'] print(offers) break Result: [[{'providerOfferId': '129944623', 'providerId': 2, 'providerEventId': '27323732', 'providerEventGroupId': '42648', 'label': 'Brandon Clarke Points + Assists + Rebounds...
Tweepy, how to pull count integer and use it
I trust all is well with everyone here. My apologies if this has been answered before, though I am trying to do the following. cursor = tweepy.Cursor( api.search_tweets, q = '"Hello"', lang = 'en', result_type = 'recent', count = 2 ) I want to match the number in count, to the number of json objects I will be iterating through. for tweet in cursor.items(): tweet_payload = json.dumps(tweet._json,indent=4, sort_keys=True) I have tried several different ways to write the data, though it would appear that the following does not work (currently is a single fire): with open("Tweet_Payload.json", "w") as outfile: outfile.write(tweet_payload) time.sleep(.25) outfile.close() This is what it looks like put together. import time import tweepy from tweepy import cursor import Auth_Codes import json twitter_auth_keys = { "consumer_key" : Auth_Codes.consumer_key, "consumer_secret" : Auth_Codes.consumer_secret, "access_token" : Auth_Codes.access_token, "access_token_secret" : Auth_Codes.access_token_secret } auth = tweepy.OAuthHandler( twitter_auth_keys["consumer_key"], twitter_auth_keys["consumer_secret"] ) auth.set_access_token( twitter_auth_keys["access_token"], twitter_auth_keys["access_token_secret"] ) api = tweepy.API(auth) cursor = tweepy.Cursor( api.search_tweets, q = '"Hello"', lang = 'en', result_type = 'recent', count = 2 ) for tweet in cursor.items(): tweet_payload = json.dumps(tweet._json,indent=4, sort_keys=True) with open("Tweet_Payload.json", "w") as outfile: outfile.write(tweet_payload) time.sleep(.25) outfile.close() Edit: Using the suggestion by Mickael also, the current code tweet_payload = [] for tweet in cursor.items(): tweet_payload.append(tweet._json) print(json.dumps(tweet_payload, indent=4,sort_keys=True)) with open("Tweet_Payload.json", "w") as outfile: outfile.write(json.dumps(tweet_payload,indent=4, sort_keys=True)) time.sleep(.25) Just loops, I am not sure why thats the case when the count is 10. I thought it would run just 1 call for 10 results or less, then end.
Opening the file with the write mode erases its previous data so, if you want to add each new tweet to the file, you should use the append mode instead. As an alternative, you could also store all the tweets' json in a list and write them all at once. That should be more efficient and the list at the root of your json file will make it valid. json_tweets = [] for tweet in cursor.items(): json_tweets.append(tweet._json) with open("Tweet_Payload.json", "w") as outfile: outfile.write(json.dumps(json_tweets,indent=4, sort_keys=True)) On a side note, the with closes the file automatically, you don't need to do it.
Combine two python scripts for web search
I'm trying to download files from a site and due to search result limitations (max 300), I need to search each item individually. I have a csv file that has a complete list which I've written some basic code to return the ID# column. With some help, I've got another script that iterates through each search result and downloads a file. What I need to do now is to combine the two so that it will search each individual ID# and download the file. I know my loop is messed up here, I just can't figure out where and if I'm even looping in the right order import requests, json, csv faciltiyList = [] with open('Facility List.csv', 'r') as f: csv_reader = csv.reader(f, delimiter=',') for searchterm in csv_reader: faciltiyList.append(searchterm[0]) url = "https://siera.oshpd.ca.gov/FindFacility.aspx" r = requests.get(url+"?term="+str(searchterm)) searchresults = json.loads(r.content.decode('utf-8')) for report in searchresults: rpt_id = report['RPT_ID'] reporturl = f"https://siera.oshpd.ca.gov/DownloadPublicFile.aspx?archrptsegid={rpt_id}&reporttype=58&exportformatid=8&versionid=1&pageid=1" r = requests.get(reporturl) a = r.headers['Content-Disposition'] filename = a[a.find("filename=")+9:len(a)] file = open(filename, "wb") file.write(r.content) r.close() The original code I have is here: import requests, json searchterm="ALAMEDA (COUNTY)" url="https://siera.oshpd.ca.gov/FindFacility.aspx" r=requests.get(url+"?term="+searchterm) searchresults=json.loads(r.content.decode('utf-8')) for report in searchresults: rpt_id=report['RPT_ID'] reporturl=f"https://siera.oshpd.ca.gov/DownloadPublicFile.aspx?archrptsegid={rpt_id}&reporttype=58&exportformatid=8&versionid=1&pageid=1" r=requests.get(reporturl) a=r.headers['Content-Disposition'] filename=a[a.find("filename=")+9:len(a)] file = open(filename, "wb") file.write(r.content) r.close() The searchterm ="ALAMEDA (COUNTY)" results in more than 300 results, so I'm trying to replace "ALAMEDA (COUNTY)" with a list that'll run through each name (ID# in this case) so that I'll get just one result, then run again for the next on the list
CSV - just 1 line Tested with a CSV file with just 1 line: 406014324,"HOLISTIC PALLIATIVE CARE, INC.",550004188,Parent Facility,5707 REDWOOD RD,OAKLAND,94619,1,ALAMEDA,Not Applicable,,Open,1/1/2018,Home Health Agency/Hospice,Hospice,37.79996,-122.17075 Python code This script reads the IDs from the CSV file. Then, it fetches the results from URL and finally writes the desired contents to the disk. import requests, json, csv # read Ids from csv facilityIds = [] with open('Facility List.csv', 'r') as f: csv_reader = csv.reader(f, delimiter=',') for searchterm in csv_reader: facilityIds.append(searchterm[0]) # fetch and write file contents url = "https://siera.oshpd.ca.gov/FindFacility.aspx" for facilityId in facilityIds: r = requests.get(url+"?term="+str(facilityId)) reports = json.loads(r.content.decode('utf-8')) # print(f"reports = {reports}") for report in reports: rpt_id = report['RPT_ID'] reporturl = f"https://siera.oshpd.ca.gov/DownloadPublicFile.aspx?archrptsegid={rpt_id}&reporttype=58&exportformatid=8&versionid=1&pageid=1" r = requests.get(reporturl) a = r.headers['Content-Disposition'] filename = a[a.find("filename=")+9:len(a)] # print(f"filename = {filename}") with open(filename, "wb") as o: o.write(r.content) Repl.it link
Editing a downloaded CSV in memory before writing
Forewarning: I am very new to Python and programming in general. I am trying to use Python 3 to get some CSV data and making some changes to it before writing it to a file. My problem lies in accessing the CSV data from a variable, like so: import csv import requests csvfile = session.get(url) reader = csv.reader(csvfile.content) for row in reader: do(something) This returns: _csv.Error: iterator should return strings, not int (did you open the file in text mode?) Googling revealed that I should be feeding the reader text instead of bytes, so I also attempted: reader = csv.reader(csvfile.text) This also does not work as the loop works through it letter by letter instead of line by line. I also experimented with TextIOWrapper and similar options with no success. The only way I have managed to get this to work is by writing the data to a file, reading it, and then making changes, like so: csvfile = session.get(url) with open("temp.txt", 'wb') as f: f.write(csvfile.content) with open("temp.txt", 'rU', encoding="utf8") as data: reader = csv.reader(data) for row in reader: do(something) I feel like this is far from the most optimal way of doing this, even if it works. What is the proper way to read and edit the CSV data directly from memory, without having to save it to a temporary file?
you don't have to write to a temp file, here is what I would do, using the "csv" and "requests" modules: import csv import requests __csvfilepathname__ = r'c:\test\test.csv' __url__ = 'https://server.domain.com/test.csv' def csv_reader(filename, enc = 'utf_8'): with open(filename, 'r', encoding = enc) as openfileobject: reader = csv.reader(openfileobject) for row in reader: #do something print(row) return def csv_from_url(url): line = '' datalist = [] s = requests.Session() r = s.get(url) for x in r.text.replace('\r',''): if not x[0] == '\n': line = line + str(x[0]) else: datalist.append(line) line = '' datalist.append(line) # at this point you already have a data list 'datalist' # no need really to use the csv.reader object, but here goes: reader = csv.reader(datalist) for row in reader: #do something print(row) return def main(): csv_reader(__csvfilepathname__) csv_from_url(__url__) return if __name__ == '__main__': main () not very pretty, and probably not very good in regards to memory/performance, depending on how "big" your csv/data is HTH, Edwin.
extracting data from CSV file using a reference
I have a csv file with several hundred organism IDs and a second csv file with several thousand organism IDs and additional characteristics (taxonomic information, abundances per sample, etc) I am trying to write a code that will extract the information from the larger csv using the smaller csv file as a reference. Meaning it will look at both smaller and larger files, and if the IDs are in both files, it will extract all the information form the larger file and write that in a new file (basically write the entire row for that ID). so far I have written the following, and while the code does not error out on me, I get a blank file in the end and I don't exactly know why. I am a graduate student that knows some simple coding but I'm still very much a novice, thank you import sys import csv import os.path SparCCnames=open(sys.argv[1],"rU") OTU_table=open(sys.argv[2],"rU") new_file=open(sys.argv[3],"w") Sparcc_OTUs=csv.writer(new_file) d=csv.DictReader(SparCCnames) ids=csv.DictReader(OTU_table) for record in ids: idstopull=record["OTUid"] if idstopull[0]=="OTUid": continue if idstopull[0] in d: new_id.writerow[idstopull[0]] SparCCnames.close() OTU_table.close() new_file.close()
I'm not sure what you're trying to do in your code but you can try this: def csv_to_dict(csv_file_path): csv_file = open(csv_file_path, 'rb') csv_file.seek(0) sniffdialect = csv.Sniffer().sniff(csv_file.read(10000), delimiters='\t,;') csv_file.seek(0) dict_reader = csv.DictReader(csv_file, dialect=sniffdialect) csv_file.seek(0) dict_data = [] for record in dict_reader: dict_data.append(record) csv_file.close() return dict_data def dict_to_csv(csv_file_path, dict_data): csv_file = open(csv_file_path, 'wb') writer = csv.writer(csv_file, dialect='excel') headers = dict_data[0].keys() writer.writerow(headers) # headers must be the same with dat.keys() for dat in dict_data: line = [] for field in headers: line.append(dat[field]) writer.writerow(line) csv_file.close() if __name__ == "__main__": big_csv = csv_to_dict('/path/to/big_csv_file.csv') small_csv = csv_to_dict('/path/to/small_csv_file.csv') output = [] for s in small_csv: for b in big_csv: if s['id'] == b['id']: output.append(b) if output: dict_to_csv('/path/to/output.csv', output) else: print "Nothing." Hope that will help.
You need to read the data into a data structure, assuming OTUid is unique you can store this into a dictionary for fast lookup: with open(sys.argv[1],"rU") as SparCCnames: d = csv.DictReader(SparCCnames) fieldnames = d.fieldnames data = {i['OTUid']: i for i in d} with open(sys.argv[2],"rU") as OTU_table, open(sys.argv[3],"w") as new_file: Sparcc_OTUs = csv.DictWriter(new_file, fieldnames) ids = csv.DictReader(OTU_table) for record in ids: if record['OTUid'] in data: Sparcc_OTUs.writerow(data[record['OTUid']])
Thank you everyone for your help. I played with things and consulted with an advisor, and finally got a working script. I am posting it in case it helps someone else in the future. Thanks! import sys import csv input_file = csv.DictReader(open(sys.argv[1], "rU")) #has all info ref_list = csv.DictReader(open(sys.argv[2], "rU")) #reference list output_file = csv.DictWriter( open(sys.argv[3], "w"), input_file.fieldnames) #to write output file with headers output_file.writeheader() #write headers in output file white_list={} #create empty dictionary for record in ref_list: #for every line in my reference list white_list[record["Sample_ID"]] = None #store into the dictionary the ID's as keys for record in input_file: #for every line in my input file record_id = record["Sample_ID"] #store ID's into variable record_id if (record_id in white_list): #if the ID is in the reference list output_file.writerow(record) #write the entire row into a new file else: #if it is not in my reference list continue #ignore it and continue iterating through the file