I have some variables and a dictionary of strings and google sheets imported:
grad_year = '2029'
df_dict = {'grade_1': grade_1_class_2029,
'grade_2': grade_2_class_2029,
'grade_3': grade_3_class_2029,
'grade_4': grade_4_class_2029,
'grade_5': grade_5_class_2029}
I then turn the google sheets into dataframes, naming them dynamically:
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records(
rows[2:], columns=rows[1]
)
Now I would like to reference them without a pre-created dictionary of their names.
There is still a bunch of stuff I would like to do to the new dataframes such as deleting blank rows. I have tried:
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = pd.DataFrame.from_records(
rows[2:], columns=rows[1]
)
vars()["df_" + key + "_class_" + grad_year].replace("", nan_value, inplace=True)
vars()["df_" + key + "_class_" + grad_year].dropna(
subset=["Last Name"], inplace=True
)
and
for key, val in df_dict.items():
rows = val.get_all_values()
vars()["df_" + key + "_class_" + grad_year] = (
pd.DataFrame.from_records(rows[2:], columns=rows[1])
.replace("", nan_value, inplace=True)
.dropna(subset=["Last Name"], inplace=True)
)
but neither worked.
If you replace nan_value by pd.NA (Pandas 1.0.0 and beyond), your first code snippet works fine:
import pandas as pd
grad_year = "2029"
vars()[f"df_{grad_year}"] = pd.DataFrame(
{
"class": {
0: "class1",
1: "class2",
2: "class3",
3: "class4",
},
"name": {0: "John", 1: "Jack", 2: "", 3: "Butch"},
}
)
vars()[f"df_{grad_year}"].replace("", pd.NA, inplace=True)
vars()[f"df_{grad_year}"].dropna(subset=["name"], inplace=True)
print(vars()[f"df_{grad_year}"])
# Outputs
class name
0 class1 John
1 class2 Jack
3 class4 Butch
In your second code snippet, you also have to set inplace to False instead of True both times in order for chain assignments to work:
vars()[f"df_{grad_year}"] = (
pd.DataFrame(
{
"class": {
0: "class1",
1: "class2",
2: "class3",
3: "class4",
},
"name": {0: "John", 1: "Jack", 2: "", 3: "Butch"},
}
)
.replace("", pd.NA, inplace=False)
.dropna(subset=["name"], inplace=False)
)
print(vars()[f"df_{grad_year}"])
# Output
class name
0 class1 John
1 class2 Jack
3 class4 Butch
Related
I'm trying to flatten this json response into a pandas dataframe to export to csv.
It looks like this:
j = [
{
"id": 401281949,
"teams": [
{
"school": "Louisiana Tech",
"conference": "Conference USA",
"homeAway": "away",
"points": 34,
"stats": [
{"category": "rushingTDs", "stat": "1"},
{"category": "puntReturnYards", "stat": "24"},
{"category": "puntReturnTDs", "stat": "0"},
{"category": "puntReturns", "stat": "3"},
],
}
],
}
]
...Many more items in the stats area.
If I run this and flatten to the teams level:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
I get:
school conference homeAway points stats
0 Louisiana Tech Conference USA away 34 [{'category': 'rushingTDs', 'stat': '1'}, {'ca...
How do I flatten it twice so that all of the stats are on their own column in each row?
If I do this:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
multiple_level_data = multiple_level_data.explode('stats').reset_index(drop=True)
multiple_level_data=multiple_level_data.join(pd.json_normalize(multiple_level_data.pop('stats')))
I end up with multiple rows instead of more columns:
You can try:
df = pd.DataFrame(j).explode("teams")
df = pd.concat([df, df.pop("teams").apply(pd.Series)], axis=1)
df["stats"] = df["stats"].apply(lambda x: {d["category"]: d["stat"] for d in x})
df = pd.concat(
[
df,
df.pop("stats").apply(pd.Series),
],
axis=1,
)
print(df)
Prints:
id school conference homeAway points rushingTDs puntReturnYards puntReturnTDs puntReturns
0 401281949 Louisiana Tech Conference USA away 34 1 24 0 3
can you try this:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
multiple_level_data = multiple_level_data.explode('stats').reset_index(drop=True)
multiple_level_data=multiple_level_data.join(pd.json_normalize(multiple_level_data.pop('stats')))
#convert rows to columns.
multiple_level_data=multiple_level_data.set_index(multiple_level_data.columns[0:4].to_list())
dfx=multiple_level_data.pivot_table(values='stat',columns='category',aggfunc=list).apply(pd.Series.explode).reset_index(drop=True)
multiple_level_data=multiple_level_data.reset_index().drop(['stat','category'],axis=1).drop_duplicates().reset_index(drop=True)
multiple_level_data=multiple_level_data.join(dfx)
Output:
school
conference
homeAway
points
puntReturnTDs
puntReturnYards
puntReturns
rushingTDs
0
Louisiana Tech
Conference USA
away
34
0
24
3
1
Instead of calling explode() on an output of a json_normalize(), you can explicitly pass the paths to the meta data for each column in a single json_normalize() call. For example, ['teams', 'school'] would be one path, ['teams', 'conference'] is another path, etc. This will create a long dataframe similar to what you already have.
Then you can call pivot() to reshape this output into the correct shape.
# normalize json
df = pd.json_normalize(
j, record_path=['teams', 'stats'],
meta=['id', *(['teams', c] for c in ('school', 'conference', 'homeAway', 'points'))]
)
# column name contains 'teams' prefix; remove it
df.columns = [c.split('.')[1] if '.' in c else c for c in df]
# pivot the intermediate result
df = (
df.astype({'points': int, 'id': int})
.pivot(['id', 'school', 'conference', 'homeAway', 'points'], 'category', 'stat')
.reset_index()
)
# remove index name
df.columns.name = None
df
I created this code where I am able to pull the data I want but not able to sort it as it should be. I am guessing it has to do with the way I am appending each item by ignoring index but I can't find my way around it.
This is my code:
import json
import pandas as pd
#load json object
with open("c:\Sample.json","r",encoding='utf-8') as file:
data = file.read()
data2 = json.loads(data)
print("Type:", type(data2))
cls=['Image', 'Email', 'User', 'Members', 'Time']
df = pd.DataFrame(columns = cls )
for d in data2['mydata']:
for k,v in d.items():
#print(k)
if k == 'attachments':
#print(d.get('attachments')[0]['id'])
image = (d.get('attachments')[0]['id'])
df=df.append({'Image':image},ignore_index = True)
#df['Message'] = image
if k == 'author_user_email':
#print(d.get('author_user_email'))
email = (d.get('author_user_email'))
df=df.append({'Email':email}, ignore_index = True)
#df['Email'] = email
if k == 'author_user_name':
#print(d.get('author_user_name'))
user = (d.get('author_user_name'))
df=df.append({'User':user}, ignore_index = True)
#df['User'] = user
if k == 'room_name':
#print(d.get('room_name'))
members = (d.get('room_name'))
df=df.append({'Members':members}, ignore_index = True)
#df['Members'] = members
if k == 'ts_iso':
#print(d.get('ts_iso'))
time = (d.get('ts_iso'))
df=df.append({'Time':time}, ignore_index = True)
#df['Time'] = time
df
print('Finished getting Data')
df1 = (df.head())
print(df)
print(df.head())
df.to_csv(r'c:\sample.csv', encoding='utf-8')
The code gives me this as the result
I am looking to get this
Data of the file is this:
{
"mydata": [
{
"attachments": [
{
"filename": "image.png",
"id": "888888888"
}
],
"author_user_email": "email#email.com",
"author_user_id": "91",
"author_user_name": "Marlone",
"message": "",
"room_id": "999",
"room_members": [
{
"room_member_id": "91",
"room_member_name": "Marlone"
},
{
"room_member_id": "9191",
"room_member_name": " +16309438985"
}
],
"room_name": "SMS [Marlone] [ +7777777777]",
"room_type": "sms",
"ts": 55,
"ts_iso": "2021-06-13T18:17:32.877369+00:00"
},
{
"author_user_email": "email#email.com",
"author_user_id": "21",
"author_user_name": "Chris",
"message": "Hi",
"room_id": "100",
"room_members": [
{
"room_member_id": "21",
"room_member_name": "Joe"
},
{
"room_member_id": "21",
"room_member_name": "Chris"
}
],
"room_name": "Direct [Chris] [Joe]",
"room_type": "direct",
"ts": 12345678910,
"ts_iso": "2021-06-14T14:42:07.572479+00:00"
}]}
Any help would be appreciated. I am new to python and am learning on my own.
Try:
import json
import pandas as pd
with open("your_data.json", "r") as f_in:
data = json.load(f_in)
tmp = []
for d in data["mydata"]:
image = d.get("attachments", [{"id": None}])[0]["id"]
email = d.get("author_user_email")
user = d.get("author_user_name")
members = d.get("room_name")
time = d.get("ts_iso")
tmp.append((image, email, user, members, time))
df = pd.DataFrame(tmp, columns=["Image", "Email", "User", "Members", "Time"])
print(df)
Prints:
Image Email User Members Time
0 888888888 email#email.com Marlone SMS [Marlone] [ +7777777777] 2021-06-13T18:17:32.877369+00:00
1 None email#email.com Chris Direct [Chris] [Joe] 2021-06-14T14:42:07.572479+00:00
Although the other answer does work, pandas has a built in reader for json files pd.read_json: https://pandas.pydata.org/pandas-docs/version/1.1.3/reference/api/pandas.read_json.html
It has the benefit of being able to handle very large datasets via chunking, as well as processing quite a few different formats. The other answer would not be performant for a large dataset.
This would get you started:
import pandas as pd
df = pd.read_json("c:\Sample.json")
The probblem is that append() adds a new row. So, you have to use at[] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html specifying the index/row. Se below. Some print/debug messages were left and path to input and output files was changed a little because I'm on Linux.
import json
import pandas as pd
import pprint as pp
#load json object
with open("Sample.json","r",encoding='utf-8') as file:
data = file.read()
data2 = json.loads(data)
#pp.pprint(data2)
cls=['Image', 'Email', 'User', 'Members', 'Time']
df = pd.DataFrame(columns = cls )
pp.pprint(df)
index = 0
for d in data2['mydata']:
for k,v in d.items():
#print(k)
if k == 'attachments':
#print(d.get('attachments')[0]['id'])
image = (d.get('attachments')[0]['id'])
df.at[index, 'Image'] = image
#df['Message'] = image
if k == 'author_user_email':
#print(d.get('author_user_email'))
email = (d.get('author_user_email'))
df.at[index, 'Email'] = email
#df['Email'] = email
if k == 'author_user_name':
#print(d.get('author_user_name'))
user = (d.get('author_user_name'))
df.at[index, 'User'] = user
#df['User'] = user
if k == 'room_name':
#print(d.get('room_name'))
members = (d.get('room_name'))
df.at[index, 'Members'] = members
#df['Members'] = members
if k == 'ts_iso':
#print(d.get('ts_iso'))
time = (d.get('ts_iso'))
df.at[index, 'Time'] = time
#df['Time'] = time
index += 1
# start indexing from 0
df.reset_index()
# replace empty str/cells witn None
df.fillna('None', inplace=True)
pp.pprint(df)
print('Finished getting Data')
df1 = (df.head())
print(df)
print(df.head())
df.to_csv(r'sample.csv', encoding='utf-8')
I have an example json data file which has the following structure:
{
"Header": {
"Code1": "abc",
"Code2": "def",
"Code3": "ghi",
"Code4": "jkl",
},
"TimeSeries": {
"2020-11-25T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
},
"2020-11-26T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
}
}
}
When I parse this into databricks with command:
df = spark.read.json("/FileStore/test.txt")
I get as output 2 objects: Header and TimeSeries. With the TimeSeries I want to be able to flatten the structure so it has the following schema:
Date
UnitPrice
Amount
As the date field is a key, I am currently only able to access it via iterating through the column names and then using this in the dot-notation dynamically:
def flatten_json(data):
columnlist = data.select("TimeSeries.*")
count = 0
for name in data.select("TimeSeries.*"):
df1 = data.select("Header.*").withColumn(("Timeseries"), lit(columnlist.columns[count])).withColumn("join", lit("a"))
df2 = data.select("TimeSeries." + columnlist.columns[count] + ".*").withColumn("join", lit("a"))
if count == 0:
df3 = df1.join(df2, on=['join'], how="inner")
else:
df3 = df3.union(df1.join(df2, on=['join'], how="inner"))
count = count + 1
return(df3)
This is far from ideal. Does anyone know a better method to create the described dataframe?
The idea:
Step 1: Extract Header and TimeSeries separately.
Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct.
Step 3: Merge all these structs into an array column, and explode it.
Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column.
Step 5: Cross join with the Header row.
import pyspark.sql.functions as F
header_df = df.select("Header.*")
timeseries_df = df.select("TimeSeries.*")
fieldNames = enumerate(timeseries_df.schema.fieldNames())
cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]
combined = explode(array(cols)).alias("comb")
timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')
result = header_df.crossJoin(timeseries)
result.show(truncate = False)
Output:
+-----+-----+-----+-----+-------------------------+------+---------+
|Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice|
+-----+-----+-----+-----+-------------------------+------+---------+
|abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 |
|abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 |
+-----+-----+-----+-----+-------------------------+------+---------+
I am trying to compare two json and then write another json with columns names and with differences as yes or no. I am using pandas and numpy
The below is sample files i am including actually, these json are dynamic, that mean we dont know how many key will be there upfront
Input files:
fut.json
[
{
"AlarmName": "test",
"StateValue": "OK"
}
]
Curr.json:
[
{
"AlarmName": "test",
"StateValue": "OK"
}
]
Below code I have tried:
import pandas as pd
import numpy as np
with open(r"c:\csv\fut.json", 'r+') as f:
data_b = json.load(f)
with open(r"c:\csv\curr.json", 'r+') as f:
data_a = json.load(f)
df_a = pd.json_normalize(data_a)
df_b = pd.json_normalize(data_b)
_, df_a = df_b.align(df_a, fill_value=np.NaN)
_, df_b = df_a.align(df_b, fill_value=np.NaN)
with open(r"c:\csv\report.json", 'w') as _file:
for col in df_a.columns:
df_temp = pd.DataFrame()
df_temp[col + '_curr'], df_temp[col + '_fut'], df_temp[col + '_diff'] = df_a[col], df_b[col], np.where((df_a[col] == df_b[col]), 'No', 'Yes')
#[df_temp.rename(columns={c:'Missing'}, inplace=True) for c in df_temp.columns if df_temp[c].isnull().all()]
df_temp.fillna('Missing', inplace=True)
with pd.option_context('display.max_colwidth', -1):
_file.write(df_temp.to_json(orient='records'))
Expected output:
[
{
"AlarmName_curr": "test",
"AlarmName_fut": "test",
"AlarmName_diff": "No"
},
{
"StateValue_curr": "OK",
"StateValue_fut": "OK",
"StateValue_diff": "No"
}
]
Coming output: Not able to parse it in json validator, below is the problem, those [] should be replaed by ',' to get right json dont know why its printing like that
[{"AlarmName_curr":"test","AlarmName_fut":"test","AlarmName_diff":"No"}][{"StateValue_curr":"OK","StateValue_fut":"OK","StateValue_diff":"No"}]
Edit1:
Tried below as well
_file.write(df_temp.to_json(orient='records',lines=True))
now i get json which is again not parsable, ',' is missing and unless i add , between two dic and [ ] at beginning and end manually , its not parsing..
[{"AlarmName_curr":"test","AlarmName_fut":"test","AlarmName_diff":"No"}{"StateValue_curr":"OK","StateValue_fut":"OK","StateValue_diff":"No"}]
Honestly pandas is overkill for this... however
load dataframes as you did
concat them as columns. rename columns
do calcs and map boolean to desired Yes/No
to_json() returns a string so json.loads() to get it back into a list/dict. Filter columns to get to your required format
import json
data_b = [
{
"AlarmName": "test",
"StateValue": "OK"
}
]
data_a = [
{
"AlarmName": "test",
"StateValue": "OK"
}
]
df_a = pd.json_normalize(data_a)
df_b = pd.json_normalize(data_b)
df = pd.concat([df_a, df_b], axis=1)
df.columns = [c+"_curr" for c in df_a.columns] + [c+"_fut" for c in df_a.columns]
df["AlarmName_diff"] = df["AlarmName_curr"] == df["AlarmName_fut"]
df["StateValue_diff"] = df["StateValue_curr"] == df["StateValue_fut"]
df = df.replace({True:"Yes", False:"No"})
js = json.loads(df.loc[:,(c for c in df.columns if c.startswith("Alarm"))].to_json(orient="records"))
js += json.loads(df.loc[:,(c for c in df.columns if c.startswith("State"))].to_json(orient="records"))
js
output
[{'AlarmName_curr': 'test', 'AlarmName_fut': 'test', 'AlarmName_diff': 'Yes'},
{'StateValue_curr': 'OK', 'StateValue_fut': 'OK', 'StateValue_diff': 'Yes'}]
Python "Exec" command is not passing local values in exec shell. I thought this should be a simple question but all seem stumped. Here is a repeatable working version of the problem ... it took me a bit to recreate a working problem (my files are much larger than examples shown here, there are up to 10-dfs per loop, often 1800 items per df )
EXEC was only passing "PRODUCT" (as opposed to "PRODUCT.AREA" before I added "["{ind_id}"]" and then also it also shows an error "<string> in <module>".
datum_0 = {'Products': ['Stocks', 'Bonds', 'Notes'],'PRODUCT.AREA': ['10200', '50291','50988']}
df_0 = pd.DataFrame (datum_0, columns = ['Products','PRODUCT.AREA'])
datum_1 = {'Products': ['Stocks', 'Bonds', 'Notes'],'PRODUCT.CODE': ['66', '55','22']}
df_1 = pd.DataFrame (datum_1, columns = ['Products','PRODUCT.CODE'])
df_0
summary = {'Prodinfo': ['PRODUCT.AREA', 'PRODUCT.CODE']}
df_list= pd.DataFrame (summary, columns = ['Prodinfo'])
df_list
# Create a rankings column for the Prodinfo tables
for rows in df_list.itertuples():
row = rows.Index
ind_id = df_list.loc[row]['Prodinfo']
print(row, ind_id)
exec(f'df_{row}["rank"] = df_{row}["{ind_id}"].rank(ascending=True) ')
Of course its this last line that is throwing exec errors. Any ideas? Have you got a working global or local variable assignment that fixes it? etc... thanks!
I would use list to keep all DataFrames
all_df = [] # list
all_df.append(df_1)
all_df.append(df_2)
and then I would no need exec
for rows in df_list.itertuples():
row = rows.Index
ind_id = df_list.loc[row]['Prodinfo']
print(row, ind_id)
all_df[row]["rank"] = all_df[row][ind_id].rank(ascending=True)
Eventually I would use dictionary
all_df = {} # dict
all_df['PRODUCT.AREA'] = df_1
all_df['PRODUCT.CODE'] = df_2
and then I don't need exec and df_list
for key, df in all_df.items():
df["rank"] = df[key].rank(ascending=True)
Minimal working code with list
import pandas as pd
all_df = [] # list
datum = {
'Products': ['Stocks', 'Bonds', 'Notes'],
'PRODUCT.AREA': ['10200', '50291', '50988']
}
all_df.append( pd.DataFrame(datum) )
datum = {
'Products': ['Stocks', 'Bonds', 'Notes'],
'PRODUCT.CODE': ['66', '55', '22']
}
all_df.append( pd.DataFrame(datum) )
#print( all_df[0] )
#print( all_df[1] )
print('--- before ---')
for df in all_df:
print(df)
summary = {'Prodinfo': ['PRODUCT.AREA', 'PRODUCT.CODE']}
df_list = pd.DataFrame(summary, columns=['Prodinfo'])
#print(df_list)
for rows in df_list.itertuples():
row = rows.Index
ind_id = df_list.loc[row]['Prodinfo']
#print(row, ind_id)
all_df[row]["rank"] = all_df[row][ind_id].rank(ascending=True)
print('--- after ---')
for df in all_df:
print(df)
Minimal working code with dict
import pandas as pd
all_df = {} # dict
datum = {
'Products': ['Stocks', 'Bonds', 'Notes'],
'PRODUCT.AREA': ['10200', '50291', '50988']
}
all_df['PRODUCT.AREA'] = pd.DataFrame(datum)
datum = {
'Products': ['Stocks', 'Bonds', 'Notes'],
'PRODUCT.CODE': ['66', '55', '22']
}
all_df['PRODUCT.CODE'] = pd.DataFrame (datum)
print('--- before ---')
for df in all_df.values():
print(df)
for key, df in all_df.items():
df["rank"] = df[key].rank(ascending=True)
print('--- after ---')
for df in all_df.values():
print(df)
Frankly, for two dataframes I wouldn't waste time for df_list and for-loop
import pandas as pd
datum = {
'Products': ['Stocks', 'Bonds', 'Notes'],
'PRODUCT.AREA': ['10200', '50291', '50988']
}
df_0 = pd.DataFrame(datum)
datum = {
'Products': ['Stocks', 'Bonds', 'Notes'],
'PRODUCT.CODE': ['66', '55', '22']
}
df_1 = pd.DataFrame(datum)
print('--- before ---')
print( df_0 )
print( df_1 )
df_0["rank"] = df_0['PRODUCT.AREA'].rank(ascending=True)
df_1["rank"] = df_1['PRODUCT.CODE'].rank(ascending=True)
print('--- after ---')
print( df_0 )
print( df_1 )
And probably I would even put all in one dataframe
import pandas as pd
df = pd.DataFrame({
'Products': ['Stocks', 'Bonds', 'Notes'],
'PRODUCT.AREA': ['10200', '50291', '50988'],
'PRODUCT.CODE': ['66', '55', '22'],
})
print('--- before ---')
print( df )
#df["rank PRODUCT.AREA"] = df['PRODUCT.AREA'].rank(ascending=True)
#df["rank PRODUCT.CODE"] = df['PRODUCT.CODE'].rank(ascending=True)
for name in ['PRODUCT.AREA', 'PRODUCT.CODE']:
df[f"rank {name}"] = df[name].rank(ascending=True)
print('--- after ---')
print( df )
Result:
--- before ---
Products PRODUCT.AREA PRODUCT.CODE
0 Stocks 10200 66
1 Bonds 50291 55
2 Notes 50988 22
--- after ---
Products PRODUCT.AREA PRODUCT.CODE rank PRODUCT.AREA rank PRODUCT.CODE
0 Stocks 10200 66 1.0 3.0
1 Bonds 50291 55 2.0 2.0
2 Notes 50988 22 3.0 1.0
As expected, this was an easy fix. Thanks to answerers who gave much to think about ...
Kudos to #holdenweb and his answer at ... Create multiple dataframes in loop
dfnew = {} # CREATE A DICTIONARY!!! - THIS WAS THE TRICK I WAS MISSING
df_ = {}
for rows in df_list.itertuples():
row = rows.Index
ind_id = df_list.loc[row]['Prodinfo']
dfnew[row] = df_[row] # or pd.read_csv(csv_file) or database_query or ...
dfnew[row].dropna(inplace=True)
dfnew[row]["rank"] = dfnew[row][ind_id].rank(ascending=True)
Works well and very simple...