Related
Ive got a little issue while coding a script that takes a CSV string and is supposed to select a column name and value based on the input. The CSV string contains Names of NBA players, their Universities etc. Now when the input is "name" && "Andre Brown", it should search for those values in the given CSV string. I have a rough code laid out - but I am unsure on how to implement the where method. Any ideas?
import csv
import pandas as pd
import io
class MySelectQuery:
def __init__(self, table, columns, where):
self.table = table
self.columns = columns
self.where = where
def __str__(self):
return f"SELECT {self.columns} FROM {self.table} WHERE {self.where}"
csvString = "name,year_start,year_end,position,height,weight,birth_date,college\nAlaa Abdelnaby,1991,1995,F-C,6-10,240,'June 24, 1968',Duke University\nZaid Abdul-Aziz,1969,1978,C-F,6-9,235,'April 7, 1946',Iowa State University\nKareem Abdul-Jabbar,1970,1989,C,7-2,225,'April 16, 1947','University of California, Los Angeles\nMahmoud Abdul-Rauf,1991,2001,G,6-1,162,'March 9, 1969',Louisiana State University\n"
df = pd.read_csv(io.StringIO(csvString), error_bad_lines=False)
where = "name = 'Alaa Abdelnaby' AND year_start = 1991"
df = df.query(where)
print(df)
The CSV string is being transformed into a pandas Dataframe, which should then find the values based on the input - however I get the error "name 'where' not defined". I believe everything until the df = etc. part is correct, now I need help implementing the where method. (Ive seen one other solution on SO but wasnt able to understand or figure that out)
# importing pandas
import pandas as pd
record = {
'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka', 'Priya', 'Shaurya' ],
'Age': [21, 19, 20, 18, 17, 21],
'Stream': ['Math', 'Commerce', 'Science', 'Math', 'Math', 'Science'],
'Percentage': [88, 92, 95, 70, 65, 78]}
# create a dataframe
dataframe = pd.DataFrame(record, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", dataframe)
options = ['Math', 'Science']
# selecting rows based on condition
rslt_df = dataframe[(dataframe['Age'] == 21) &
dataframe['Stream'].isin(options)]
print('\nResult dataframe :\n', rslt_df)
Output:
Source: https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/
Sometimes Googling does the trick ;)
You need the double = there. So should be:
where = "name == 'Alaa Abdelnaby' AND year_start == 1991"
First time posting here, but having trouble with some code that I'm using to pull fantasy football data from ESPN. I pulled this from Steven Morse's blog (https://stmorse.github.io/journal/espn-fantasy-v3.html) and it appears to work EXCEPT for one error that I'm getting. The error is:
File "<ipython-input-65-56a5896c1c3c>", line 3, in <listcomp>
game['away']['teamId'], game['away']['totalPoints'],
KeyError: 'away'
I've looked in the dictionary and found that 'away' is in there. What I can't figure out is why 'home' works but not 'away'. Here is the code I'm using. Any help is appreciated:
import requests
import pandas as pd
url = 'https://fantasy.espn.com/apis/v3/games/ffl/seasons/2020/segments/0/leagues/721579?view=mMatchupScore'
r = requests.get(url,
cookies={"swid": "{1E653FDE-DA4A-4CC6-A53F-DEDA4A6CC663}",
"espn_s2": "AECpfE9Zsvwwsl7N%2BRt%2BAPhSAKmSs%2F2ZmQVuHJeKG8LGgLBDfRl0j88CvzRFsrRjLmjzASAdIUA9CyKpQJYBfn6avgXoPHJgDiCqfDPspruYqHNENjoeGuGfVqtPewVJGv3rBJPFMp1ugWiqlEzKiT9IXTFAIx3V%2Fp2GBuYjid2N%2FFcSUlRlr9idIL66tz2UevuH4F%2FP6ytdM7ABRCTEnrGXoqvbBPCVbtt6%2Fu69uBs6ut08ApLRQc4mffSYCONOqW1BKbAMPPMbwgCn1d5Ruubl"})
d = r.json()
df = [[
game['matchupPeriodId'],
game['away']['teamId'], game['away']['totalPoints'],
game['home']['teamId'], game['home']['totalPoints']
] for game in d['schedule']]
df = pd.DataFrame(df, columns=['Week', 'Team1', 'Score1', 'Team2', 'Score2'])
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
Seems like some of the games in the schedule don't have an away team:
{'home': {'adjustment': 0.0,
'cumulativeScore': {'losses': 0, 'statBySlot': None, 'ties': 0, 'wins': 0},
'pointsByScoringPeriod': {'14': 102.7},
'teamId': 1,
'tiebreak': 0.0,
'totalPoints': 102.7},
'id': 78,
'matchupPeriodId': 14,
'playoffTierType': 'WINNERS_BRACKET',
'winner': 'UNDECIDED'}
For nested json data like this, it's often easier to use pandas.json_normalize which flattens the data structure and gives you a data frame with lots of columns with names like home.cumulativeScore.losses etc.
df = pd.json_normalize(r.json()['schedule'])
Then you can reshape the dataframe by dropping columns you don't care about and so on.
df = pd.json_normalize(r.json()['schedule'])
column_names = {
'matchupPeriodId':'Week',
'away.teamId':'Team1',
'away.totalPoints':'Score1',
'home.teamId':'Team2',
'home.totalPoints':'Score2',
}
df = df.reindex(columns=column_names).rename(columns=column_names)
df['Type'] = ['Regular' if w<=14 else 'Playoff' for w in df['Week']]
For the games where there's no away team, pandas will populate those columns with NaN values.
df[df.Team1.isna()]
I have two pickle files, each of which is a dictionary of satellite images with their date, PM value, and location of sensor. They have the following setup:
{'Image': array([[[...]]], dtype=uint8),
'Date' : '2018-01-01',
'PM' : 100,
'Location' : 'Los Angeles'}
This is ONE input into the dictionary, i.e., if my pickle is called data, then data[10] would be the entry for Jan 09, 2018, such as:
{'Image': array([[[...]]], dtype=uint8),
'Date' : '2018-01-09',
'PM' : 19,
'Location' : 'Los Angeles'}
I want to combine two pickle files, with lengths 1079 and 1023 respectively, into ONE pickle file grouped by the location. So after pairing, when I call data[0], I get the 1079 inputs for the first sensor station, and data[1] will be the 1023 inputs for the second sensor station. The second sensor station is located in San Diego, and has the exact same format as the first station in terms of its dictionary.
Here's what I have:
I read in both pickle files using the following code (ran this code twice w/ different pickle files, so now i have two lists)
LA_data=[]
with open('/work/srs108/LA/data.pkl', 'rb') as f:
while True:
try:
LA_data.append(pkl.load(f))
except EOFError:
break
SD_data=[]
with open('/work/srs108/SD/data.pkl', 'rb') as f:
while True:
try:
SD_data.append(pkl.load(f))
except EOFError:
break
But now I have two lists, and I'm struggling using the following line to convert it to a Pandas dataframe in order to use the groupby function. I don't know what dimensions to give because my data are lists of dictionary inputs.
df = pd.DataFrame(np.array(data).reshape(??,??), columns = ['Image', 'Date', 'PM', 'Location'])
Any tips on how to get my dictionaries into a dataframe in order to group them together? Is this the right way I should go about this?
Could you specify "data" as variable? In case it is like "city_list" in this small rebuild, this might work.
import numpy as np
import pandas as pd
arr1 = np.array([[[1], [2]], [[3], [4]]])
dict1 = {'Image': arr1,
'Date' : '2018-01-01',
'PM' : 100,
'Location' : 'Los Angeles'}
LA_data=[dict1]
arr2 = np.array([[[11], [12]], [[13], [14]]])
dict2 = {'Image': arr2,
'Date' : '2018-01-09',
'PM' : 19,
'Location' : 'San Diego'}
SD_data=[dict2]
city_list = LA_data + SD_data
df = pd.DataFrame(city_list)
print(df.head())
I have a raw data in csv format which looks like this:
product-name brand-name rating
["Whole Wheat"] ["bb Royal"] ["4.1"]
Expected output:
product-name brand-name rating
Whole Wheat bb Royal 4.1
I want this to affect every entry in my dataset. I have 10,000 rows of data. How can I do this using pandas?
Can we do this using regular expressions? Not sure how to do it.
Thank you.
Edit 1:
My data looks something like this:
df = {
'product-name': [
[""'Whole Wheat'""], [""'Milk'""] ],
'brand-name': [
[""'bb Royal'""], [""'XYZ'""] ],
'rating': [
[""'4.1'""], [""'4.0'""] ]
}
df_p = pd.DataFrame(data=df)
It outputs like this: ["bb Royal"]
PS: Apologies for my programming. I am quite new to programming and also to this community. I really appreciate your help here :)
IIUC select first values of lists:
df = df.apply(lambda x: x.str[0])
Or if values are strings:
df = df.replace('[\[\]]', '', regex=True)
You can use the explode function
df = df.apply(pd.Series.explode)
I am using this database: https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&localOrForeign%5B%5D=Foreign&start_year=1992&end_year=2019&group_by=year
I have preprocessed it into this csv (showing only 2 lines of 159):
year,combinedStatus,fullName,sortName,primaryNationality,secondaryNationality,tertiaryNationality,gender,photoUrl,photoCredit,type,lastStatus,typeOfDeath,status,employedAs,organizations,jobs,coverage,mediums,country,location,region,state,locality,province,localOrForeign,sourcesOfFire,motiveConfirmed,accountabilityCrossfire,accountabilityAssignment,impunityMurder,tortured,captive,threatened,charges,motive,lengthOfSentence,healthProblems,impCountry,entry,sentenceDate,sentence,locationImprisoned
1994,Confirmed,Abdelkader Hireche,,,,,Male,,,Journalist,,Murder,Killed,Staff,Algerian Television (ENTV),Broadcast Reporter,Politics,Television,Algeria,Algiers,,,Algiers,,Foreign,,Confirmed,,,Partial Impunity,No,No,No,,,,,,,,,
2014,Confirmed,Ahmed Hasan Ahmed,,,,,Male,,,Journalist,,Dangerous Assignment,Killed,Staff,Xinhua News Agency,"Camera Operator,Photographer","Human Rights,Politics,War",Internet,Syria,Damascus,,,Damascus,,Foreign,,Confirmed,,,,,,,,,,,,,,,
And I want to make this type of JSON out of it:
"Afghanistan": {"year": 2001, "fullName": "Volker Handloik", "gender": "Male", "typeOfDeath": "Crossfire", "employedAs": "Freelance", "organizations": "freelance reporter", "jobs": "Print Reporter", "coverage": "War", "mediums": "Print", "photoUrl": NaN}, "Somalia": {"year": 1994, "fullName": "Pierre Anceaux", "gender": "Male", "typeOfDeath": "Murder", "employedAs": "Freelance", "organizations": "freelance", "jobs": "Broadcast Reporter", "coverage": "Human Rights", "mediums": "Television", "photoUrl": NaN}
The problem is that Afghanistan (as you can see in the link) has had many journalist deaths. I want to list all these killings under the Index 'Afghanistan'. However, as I currently do it, only the last case (Volker Handloik) in the csv file shows up. How can I get it so every case shows up?
this is my code atm
import pandas as pd
import pprint as pp
import json
# list with stand-ins for empty cells
missing_values = ["n/a", "na", "unknown", "-", ""]
# set missing values to NaN
df = pd.read_csv("data_journalists.csv", na_values = missing_values, skipinitialspace = True, error_bad_lines=False)
# columns
columns_keep = ['year', 'fullName', 'gender', 'typeOfDeath', 'employedAs', 'organizations', 'jobs', 'coverage', 'mediums', 'country', 'photoUrl']
small_df = df[columns_keep]
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(small_df)
# create dict with country-column as index
df_dict = small_df.set_index('country').T.to_dict('dict')
print(df_dict)
# make json file from the dict
with open('result.json', 'w') as fp:
json.dump(df_dict, fp)
# use pretty print to see if dict matches the json example in the exercise
pp.pprint(df_dict)
I want to include all of these names (and more) in the JSON under the index Afghanistan
I think I will need a list of objects that is attached to the index of a country so that every country can show all the cases of journalists death instead of only 1 (each time being replaced by the next in the csv) I hope this is clear enough
I'll keep your code until the definition of small_df.
After that, we perform a groupby on the 'country' column and use pd.to_json on it:
country_series = small_df.groupby('country').apply(lambda r : r.drop(['country'], axis=1).to_json())
country_series is a pd.Series with the countries as index.
After that, we create a nested dictionary, so that we have a valid json object:
fullDict = {}
for ind, a in country_series.iteritems():
b = json.loads(a)
c = b['fullName']
smallDict = {}
for index, journalist in c.items():
smallDict[journalist] = {}
for i in b.keys():
smallDict[journalist][i] = b[i][index]
fullDict[ind] = (smallDict)
The nomenclature in my part of code is pretty bad, but I tried to write all the steps explicitly so that things should be clear.
Finally, we write the results to a file:
with open('result.json', 'w') as f:
json.dump(fullDict, f)