Trouble with nested json where key index dynamically change - python
scraping an API for NBA Player Props. It's a nested json where I filtered for my desired output.
import json
from urllib.request import urlopen
import csv
import os
# Delete CSV
os.remove('Path')
jsonurl = urlopen('https://sportsbook.draftkings.com//sites/US-SB/api/v4/eventgroups/88670846/categories/583/subcategories/5001')
games = json.loads(jsonurl.read())['eventGroup']['offerCategories'][8]['offerSubcategoryDescriptors'][0]['offerSubcategory']['offers']
# Open a new file as our CSV file
with open('Path', "a", newline='', encoding='utf-8') as csv_file:
csv_writer = csv.writer(csv_file)
# Add a header
csv_writer.writerow(["participant", "line", "oddsDecimal"])
for game in games:
for in_game in game:
outcomes = in_game.get('outcomes')
for outcome in outcomes:
# Write CSV
csv_writer.writerow([
outcome["participant"],
outcome["line"],
outcome["oddsDecimal"]
])
So my issue is here that I have index "8" at "offerCategories" hardcoded in my code (also "0" at "Sub") and it's dynamically changing from day to day at this provider. Not familiar with this stuff but can't figure out how to query this through string with "name": "Player Combos" (index=8 in this example)
Thx in advice!
Given the data structure you have, you need to iterate through the 'offerCategories' sub-list looking for the map with the appropriate 'name' key. It seems that your use of '0' as an index is fine, since there is only a single value to choose in that case:
import json
from urllib.request import urlopen
jsonurl = urlopen(
'https://sportsbook.draftkings.com//sites/US-SB/api/v4/eventgroups/88670846/categories/583/subcategories/5001')
games = json.loads(jsonurl.read())
for offerCategory in games['eventGroup']['offerCategories']:
if offerCategory['name'] == 'Player Combos':
offers = offerCategory['offerSubcategoryDescriptors'][0]['offerSubcategory']['offers']
print(offers)
break
Result:
[[{'providerOfferId': '129944623', 'providerId': 2, 'providerEventId': '27323732', 'providerEventGroupId': '42648', 'label': 'Brandon Clarke Points + Assists + Rebounds...
Related
How to create variables to loop over files and emerge in dataframe?
I want to create a DataFrame with data for Tennis matches of a specific player 'Lenny Hampel'. For this I downloaded a lot of .json files with data for of his matches - all in all there are around 100 files. As it is a json file i need to convert every single file into a dict, to get it into the dataframe in the end. Finally I need to concatenate each file to the dataframe. I could hard-code it, however it is kind of silly I think, but I could not find a proper way to iterate trough this. Could you help me understand how I could create a loop or smth else in order to code it the smart way? from bs4 import BeautifulSoup import requests import json import bs4 as bs import urllib.request from urllib.request import Request, urlopen import pandas as pd import pprint with open('lenny/2016/lenny2016_match (1).json') as json_file: lennymatch1 = json.load(json_file) player = [item for item in lennymatch1["stats"] if item["player_fullname"] == "Lenny Hampel"] with open('lenny/2016/lenny2016_match (2).json') as json_file: lennymatch2 = json.load(json_file) player2 = [item for item in lennymatch2["stats"] if item["player_fullname"] == "Lenny Hampel"] with open('lenny/2016/lenny2016_match (3).json') as json_file: lennymatch3 = json.load(json_file) player33 = [item for item in lennymatch3["stats"] if item["player_fullname"] == "Lenny Hampel"] with open('lenny/2016/lenny2016_match (4).json') as json_file: lennymatch4 = json.load(json_file) player4 = [item for item in lennymatch4["stats"] if item["player_fullname"] == "Lenny Hampel"] tabelle1 = pd.DataFrame.from_dict(player) tabelle2 = pd.DataFrame.from_dict(player2) tabelle3 = pd.DataFrame.from_dict(player33) tabelle4 = pd.DataFrame.from_dict(player4) tennisstats = [tabelle1, tabelle2, tabelle3, tabelle4] result = pd.concat(tennisstats) result
Well, it seems so basic knowledge that I don't understand why you ask for this. # --- before loop --- tennisstats = [] # --- loop --- for filename in ["lenny/2016/lenny2016_match (1).json", "lenny/2016/lenny2016_match (2).json"]: with open(filename) as json_file: lennymatch = json.load(json_file) player = [item for item in lennymatch["stats"] if item["player_fullname"] == "Lenny Hampel"] tabele = pd.DataFrame.from_dict(player) tennisstats.append(tabele) # --- after loop --- result = pd.concat(tennisstats) If filenames are similar and they have only different number for number in range(1, 101): filename = f"lenny/2016/lenny2016_match ({number}).json" with open(filename) as json_file: and rest is the same as in first version. If all files are in the same folder then maybe you should use os.listdir() directory = "lenny/2016/" for name in os.listdir(directory): filename = directory + name with open(filename) as json_file: and rest is the same as in first version.
Write data into csv
I am crawling data from Wikipedia and it works so far. I can display it on the terminal, but I can't write it the way I need it into a csv file :-/ The code is pretty long, but I paste it here anyway and hope that somebody can help me. import csv import requests from bs4 import BeautifulSoup def spider(): url = 'https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9F-_und_Mittelst%C3%A4dte_in_Deutschland' code = requests.get(url).text # Read source code and make unicode soup = BeautifulSoup(code, "lxml") # create BS object table = soup.find(text="Rang").find_parent("table") for row in table.find_all("tr")[1:]: partial_url = row.find_all('a')[0].attrs['href'] full_url = "https://de.wikipedia.org" + partial_url get_single_item_data(full_url) # goes into the individual sites def get_single_item_data(item_url): page = requests.get(item_url).text # Read source code & format with .text to unicode soup = BeautifulSoup(page, "lxml") # create BS object def getInfoBoxBasisDaten(s): return str(s) == 'Basisdaten' and s.parent.name == 'th' basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0] basisdaten_list = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:', 'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:', 'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin', 'Oberbürgermeister', 'Oberbürgermeisterin'] with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: fieldnames = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:', 'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:', 'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin', 'Oberbürgermeister', 'Oberbürgermeisterin'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore') writer.writeheader() for i in basisdaten_list: wanted = i current = basisdaten.parent.parent.nextSibling while True: if not current.name: current = current.nextSibling continue if wanted in current.text: items = current.findAll('td') print(BeautifulSoup.get_text(items[0])) print(BeautifulSoup.get_text(items[1])) writer.writerow({i: BeautifulSoup.get_text(items[1])}) if '<th ' in str(current): break current = current.nextSibling print(spider()) The output is incorrect in 2 ways. The cells are their right places and only one city is written, all others are missing. It looks like this: But it should look like this + all other cities in it:
'... only one city is written ...': You call get_single_item_data for each city. Then inside this function you open the output file with the same name, in the statement with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: which will overwrite the output file each time you call the function. Each variable is written to a new row: In the statement writer.writerow({i: BeautifulSoup.get_text(items[1])}) you write the value for one variable to a row. What you need to do instead is to make a dictionary for values before you start looking for page values. As you accumulate the values from the page you shove them into the dictionary by field name. Then after you have found all of the values available you call writer.writerow.
beautifulsoup to csv: putting paragraph of text into one line
I have a bunch of web text that I'd like to scrape and export to a csv file. The problem is that the text is split over multiple lines on the website and that's how beautifulsoup reads it. When I export to csv, all the text goes into one cell but the cell has multiple lines of text. When I try to read the csv into another program, it interprets the multiple lines in a way that yields a nonsensical dataset. The question is, how do I put all the text into a single line after I pull it with beautifulsoup but before I export to csv? Here's a simple working example demonstrating the problem of multiple lines (in fact, the first few lines in the resulting csv are blank, so at first glance it may look empty): import csv import requests from bs4 import BeautifulSoup def main(): r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield") soup = BeautifulSoup(r.text,"html.parser") with open('Temp.csv', 'w', encoding='utf8', newline='') as f: writer = csv.writer(f,delimiter=",") abstract=soup.find("article").text writer.writerow([abstract]) if __name__ == '__main__': main() UPDATE: there have been some good suggestions, but it's still not working. The following code still produces a csv file with line breaks in a cell: import csv import requests from bs4 import BeautifulSoup with open('Temp.csv', 'w', encoding='utf8', newline='') as f: writer = csv.writer(f,delimiter=',') r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield") soup = BeautifulSoup(r.text,'lxml') find_article = soup.find('article') find_2para = find_article.p.find_next_sibling("p") find_largetxt = find_article.p.find_next_sibling("p").nextSibling writer.writerow([find_2para,find_largetxt]) Here's another attempt based on a different suggestion. This one also ends up producing a line break in the csv file: import csv import requests from bs4 import BeautifulSoup def main(): r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield") soup = BeautifulSoup(r.text,"html.parser") with open('Temp.csv', 'w', encoding='utf8', newline='') as f: writer = csv.writer(f,delimiter=",") abstract=soup.find("article").get_text(separator=" ", strip=True) writer.writerow([abstract]) if __name__ == '__main__': main()
Change your abstract = ... line into: abstract = soup.find("article").get_text(separator=" ", strip=True) It'll separate each line using the separator parameter (in this case It'll separate the strings with an empty space.
The solution that ended up working for me is pretty simple: abstract=soup.find("article").text.replace("\t", "").replace("\r", "").replace("\n", "") That gets rid of all line breaks.
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield") soup = BeautifulSoup(r.text,'lxml') # I prefer using xml parser find_article = soup.find('article') # Next line how to find The title in this case: Econometrica: Mar 2017, Volume 85, Issue 2 find_title = find_article.h3 # find search yeild find_yeild = find_article.h1 #first_paragraph example : DOI: 10.3982/ECTA14057 p. 351-378 find_1para = find_article.p #second p example : David Martinez‐Miera, Rafael Repullo find_2para = find_article.p.find_next_sibling("p") #find the large text area using e.g. 'We present a model of the relationship bet...' find_largetxt = find_article.p.find_next_sibling("p").nextSibling I used a variety of methods of getting to the text area you wish just for the purpose of education(you can use .text on each of these to get the text without tags or you can use Zroq's method. But you can write each one of these into the file by doing for example writer.writerow(find_title.text)
Converting NBA play by play specific .json to .csv
I am trying to build a database containing play by play data for several seasons of NBA games, for my Msc. in economics dissertation. Currently I am extracting games from the NBA's API (see example) and splitting each game into a different .json file using this routine (duly adapted for p-b-p purposes), thus yielding .json files as (first play example): {"headers": ["GAME_ID", "EVENTNUM", "EVENTMSGTYPE", "EVENTMSGACTIONTYPE", "PERIOD", "WCTIMESTRING", "PCTIMESTRING", "HOMEDESCRIPTION", "NEUTRALDESCRIPTION", "VISITORDESCRIPTION", "SCORE", "SCOREMARGIN"], "rowSet": [["0041400406", 0, 12, 0, 1, "9:11 PM", "12:00", null, null, null, null, null], ["0041400406", 1, 10, 0, 1, "9:11 PM", "12:00", "Jump Ball Mozgov vs. Green: Tip to Barnes", null, null, null, null] I plan on creating a loop to convert all of the generated .json files to .csv, such that it allows me to proceed to econometric analysis in stata. At the moment, I am stuck in the first step of this procedure: the creation of the json to CSV conversion process (I will design the loop afterwards). The code I am trying is: f = open('pbp_0041400406.json') data = json.load(f) f.close() with open("pbp_0041400406.csv", "w") as file: csv_file = csv.writer(file) for rowSet in data: csv_file.writerow(rowSet) f.close() However, the yielded CSV files are showing awkward results: one line reading h,e,a,d,e,r,s and another reading r,o,w,S,e,t, thus not capturing the headlines or rowSet(the plays themselves). I have tried to solve this problem taking into account the contributes on this thread, but I have not been able to do it. Can anybody please provide me some insight into solving this problem? [EDIT] Replacing rowset with data in the original code also yielded the same results. Thanks in advance!
try this: import json import csv with open('json.json') as f: data = json.load(f) with open("pbp_0041400406.csv", "w") as fout: csv_file = csv.writer(fout, quotechar='"') csv_file.writerow(data['headers']) for rowSet in data['rowSet']: csv_file.writerow(rowSet) Resulting CSV: GAME_ID,EVENTNUM,EVENTMSGTYPE,EVENTMSGACTIONTYPE,PERIOD,WCTIMESTRING,PCTIMESTRING,HOMEDESCRIPTION,NEUTRALDESCRIPTION,VISITORDESCRIPTION,SCORE,SCOREMARGIN 0041400406,0,12,0,1,9:11 PM,12:00,,,,, 0041400406,1,10,0,1,9:11 PM,12:00,Jump Ball Mozgov vs. Green: Tip to Barnes,,,,
I think you might have made a mistake regarding the structure of the json input. There are three keys at the top level. resultSets -> list whose first element is a dictionary with key 'rowSet'. That's what I think you want to iterate over. f = open('playbyplay', 'r') data = json.load(f) f.close() print data.keys() rows = data['resultSets'][0]['rowSet'] with open("pbp_0041400406.csv", "w") as file: csv_file = csv.writer(file) for rowSet in rows: csv_file.writerow(rowSet) Output data: 0041300402,0,12,0,1,8:13 PM,12:00,,,,,,0,0,,,,,,0,0,,,,,,0,0,,,,, 0041300402,1,10,0,1,8:13 PM,12:00,Jump Ball Duncan vs. Bosh: Tip to Wade,,,,,4,1495,Tim Duncan,1610612759,San Antonio,Spurs,SAS,5,2547,Chris Bosh,1610612748,Miami,Heat,MIA,5,2548,Dwyane Wade,1610612748,Miami,Heat,MIA 0041300402,2,5,2,1,8:13 PM,11:45,Green STEAL (1 STL),,James Lost Ball Turnover (P1.T1),,,5,2544,LeBron James,1610612748,Miami,Heat,MIA,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,0,0,,,,, 0041300402,3,1,1,1,8:14 PM,11:26,Green 18' Jump Shot (2 PTS) (Splitter 1 AST),,,0 - 2,2,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,4,201168,Tiago Splitter,1610612759,San Antonio,Spurs,SAS,0,0,,,,, 0041300402,4,6,2,1,8:14 PM,11:03,Green S.FOUL (P1.T1),,,,,4,201980,Danny Green,1610612759,San Antonio,Spurs,SAS,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,1,0,,,,, 0041300402,5,3,11,1,8:14 PM,11:03,,,Lewis Free Throw 1 of 2 (1 PTS),1 - 2,1,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,0,0,,,,,,0,0,,,,, 0041300402,7,3,12,1,8:14 PM,11:03,,,MISS Lewis Free Throw 2 of 2,,,5,1740,Rashard Lewis,1610612748,Miami,Heat,MIA,0,0,,,,,,0,0,,,,, 0041300402,9,4,0,1,8:15 PM,11:01,Splitter REBOUND (Off:0 Def:1),,,,,4,201168,Tiago Splitter,1610612759,San Antonio,Spurs,SAS,0,0,,,,,,0,0,,,,, 0041300402,10,1,5,1,8:15 PM,10:52,Duncan 1' Layup (2 PTS) (Parker 1 AST),,,1 - 4,3,4,1495,Tim Duncan,1610612759,San Antonio,Spurs,SAS,4,2225,Tony Parker,1610612759,San Antonio,Spurs,SAS,0,0,,,,,
Issue solved! Using #MaxU code and a previously constructed .CSV containing all gameid, every nba game since the 01-02 season can be directly scraped via .JSON and converted to CSV using the following code: (Credits for #MaxU) from __future__ import print_function import json import csv import requests u_a = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36" url_pattern = "http://stats.nba.com/stats/playbyplayv2?GameID=%(GameID)s&StartPeriod=%(StartPeriod)s&EndPeriod=%(EndPeriod)s&tabView=%(tabView)s" def write_csv(game_id, resultSet): fn = resultSet['name'] + '_' + str(game_id) + '.csv' # ignore unimportant resultsets ... if resultSet['name'] not in ['PlayByPlay', 'PlayBlahBlah']: return with open(fn, 'w') as fout: csv_file = csv.writer(fout, quotechar='"') csv_file.writerow(resultSet['headers']) for rowSet in resultSet['rowSet']: csv_file.writerow(rowSet) def process_game_id(game_id, tabView='playbyplay', start_period='0', end_period='0'): url_parms = { 'GameID': game_id, 'StartPeriod': start_period, 'EndPeriod': end_period, 'tabView': tabView, } r = requests.get((url_pattern % url_parms), headers={"USER-AGENT":u_a}) if r.status_code == requests.codes.ok: data = json.loads(r.text) for rset in data['resultSets']: write_csv(url_parms['GameID'], rset) else: r.raise_for_status() if __name__ == '__main__': # # assuming that the 'games.csv' file contains all Game_IDs ... # with open('games.csv', 'r') as f: csv_reader = csv.reader(f, delimiter=',') for row in csv_reader: process_game_id(row[<column_num_containing_Game_ID>]) Any further questions on this data, plz do PM me. Happy coding everyone!
extracting data from CSV file using a reference
I have a csv file with several hundred organism IDs and a second csv file with several thousand organism IDs and additional characteristics (taxonomic information, abundances per sample, etc) I am trying to write a code that will extract the information from the larger csv using the smaller csv file as a reference. Meaning it will look at both smaller and larger files, and if the IDs are in both files, it will extract all the information form the larger file and write that in a new file (basically write the entire row for that ID). so far I have written the following, and while the code does not error out on me, I get a blank file in the end and I don't exactly know why. I am a graduate student that knows some simple coding but I'm still very much a novice, thank you import sys import csv import os.path SparCCnames=open(sys.argv[1],"rU") OTU_table=open(sys.argv[2],"rU") new_file=open(sys.argv[3],"w") Sparcc_OTUs=csv.writer(new_file) d=csv.DictReader(SparCCnames) ids=csv.DictReader(OTU_table) for record in ids: idstopull=record["OTUid"] if idstopull[0]=="OTUid": continue if idstopull[0] in d: new_id.writerow[idstopull[0]] SparCCnames.close() OTU_table.close() new_file.close()
I'm not sure what you're trying to do in your code but you can try this: def csv_to_dict(csv_file_path): csv_file = open(csv_file_path, 'rb') csv_file.seek(0) sniffdialect = csv.Sniffer().sniff(csv_file.read(10000), delimiters='\t,;') csv_file.seek(0) dict_reader = csv.DictReader(csv_file, dialect=sniffdialect) csv_file.seek(0) dict_data = [] for record in dict_reader: dict_data.append(record) csv_file.close() return dict_data def dict_to_csv(csv_file_path, dict_data): csv_file = open(csv_file_path, 'wb') writer = csv.writer(csv_file, dialect='excel') headers = dict_data[0].keys() writer.writerow(headers) # headers must be the same with dat.keys() for dat in dict_data: line = [] for field in headers: line.append(dat[field]) writer.writerow(line) csv_file.close() if __name__ == "__main__": big_csv = csv_to_dict('/path/to/big_csv_file.csv') small_csv = csv_to_dict('/path/to/small_csv_file.csv') output = [] for s in small_csv: for b in big_csv: if s['id'] == b['id']: output.append(b) if output: dict_to_csv('/path/to/output.csv', output) else: print "Nothing." Hope that will help.
You need to read the data into a data structure, assuming OTUid is unique you can store this into a dictionary for fast lookup: with open(sys.argv[1],"rU") as SparCCnames: d = csv.DictReader(SparCCnames) fieldnames = d.fieldnames data = {i['OTUid']: i for i in d} with open(sys.argv[2],"rU") as OTU_table, open(sys.argv[3],"w") as new_file: Sparcc_OTUs = csv.DictWriter(new_file, fieldnames) ids = csv.DictReader(OTU_table) for record in ids: if record['OTUid'] in data: Sparcc_OTUs.writerow(data[record['OTUid']])
Thank you everyone for your help. I played with things and consulted with an advisor, and finally got a working script. I am posting it in case it helps someone else in the future. Thanks! import sys import csv input_file = csv.DictReader(open(sys.argv[1], "rU")) #has all info ref_list = csv.DictReader(open(sys.argv[2], "rU")) #reference list output_file = csv.DictWriter( open(sys.argv[3], "w"), input_file.fieldnames) #to write output file with headers output_file.writeheader() #write headers in output file white_list={} #create empty dictionary for record in ref_list: #for every line in my reference list white_list[record["Sample_ID"]] = None #store into the dictionary the ID's as keys for record in input_file: #for every line in my input file record_id = record["Sample_ID"] #store ID's into variable record_id if (record_id in white_list): #if the ID is in the reference list output_file.writerow(record) #write the entire row into a new file else: #if it is not in my reference list continue #ignore it and continue iterating through the file