optimize performance when creating dataframe from dictionary

optimize performance when creating dataframe from dictionary - python

Data looks like this:
data = {"date": 20210606,
"B": 11355,
"C": 4,
"ID": "ladygaga"}
I want to convert it to dataframe however each value needs to be a list therefore
data = {key: [item] for key, item in data.items()}
df = pd.DataFrame.from_dict(data)
this is what I do, I want to optimize code as much as possible since this is going to be on production level API.

You can pass dictionary to list like:
df = pd.DataFrame([data])
print (df)
date B C ID
0 20210606 11355 4 ladygaga
Also your solution should be faster by:
df = pd.DataFrame({key: [item] for key, item in data.items()})

Related

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?

It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen

Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Most efficient way to place a Pandas data frame into a list of dictionaries with a certain format

I have a Pandas data frame that contains one column and an index of timestamps. The code for the data frame looks something like this:
import pandas as pd
indx = pd.date_range(start = '12-12-2020 06:00:00',end = '12-12-2020 06:02:00',freq = 'T')
df = pd.DataFrame(data = [0.2,0.4,0.6],index = indx,columns = ['colname'])
I want to create a list of dictionaries from the rows of df in a certain way. For each row of the data frame, I want to create a dictionary with the keys "Timestamp" and "Value". The value of the "Timestamp" key will be the index of that row. The value of the "Value" key will be the value of the row in the data frame columns. Each of these dictionaries will be appended to a list.
I know I can do this by looping over all of the rows of the data frame like this:
dict_list = []
for i in range(df.shape[0]):
new_dict = {'Timestamp': df.index[i],'Value': df.iloc[i,0]}
dict_list.append(new_dict)
However, the data frames I'm actually working with may be very large. Is there a faster, more efficient way of doing this other than using a for loop?

You need to rename your column and give your Index a name and turn it into a column. Then you want DataFrame.to_dict using the 'records' ('r') orientation.
df = df.rename(columns={'colname': 'Value'}).rename_axis(index='Timestamp').reset_index()
dict_list = df.to_dict('records')
#[{'Timestamp': Timestamp('2020-12-12 06:00:00'), 'Value': 0.2},
# {'Timestamp': Timestamp('2020-12-12 06:01:00'), 'Value': 0.4},
# {'Timestamp': Timestamp('2020-12-12 06:02:00'), 'Value': 0.6}]
For larger DataFrames it gets a bit faster than simple looping, but it still gets slow as things are large
import perfplot
import pandas as pd
import numpy as np
def loop(df):
dict_list = []
for i in range(df.shape[0]):
new_dict = {'Timestamp': df.index[i],'Value': df.iloc[i,0]}
dict_list.append(new_dict)
return dict_list
def df_to_dict(df):
df = df.rename(columns={'colname': 'Value'}).rename_axis(index='Timestamp').reset_index()
return df.to_dict('records')
perfplot.show(
setup=lambda n: pd.DataFrame({'colname': np.random.normal(0,1,n)},
index=pd.date_range('12-12-2020', freq = 'T', periods=n)),
kernels=[
lambda df: loop(df),
lambda df: df_to_dict(df),
],
labels=['Loop', 'df.to_dict'],
n_range=[2 ** k for k in range(20)],
equality_check=None,
xlabel='len(df)'
)

Map two dataframes and perform sum operation using a dictionary

I have a dataframe df
df
Object Action Cost1 Cost2
0 123 renovate 10000 2000
1 456 do something 0 10
2 789 review 1000 50
and a dictionary (called dictionary)
dictionary
{'Object_new': ['Object'],
'Action_new': ['Action'],
'Total_Cost': ['Cost1', 'Cost2']}
Further, I have a (at the beginning empty) dataframe df_new that should contain almost the identicall information as df, except that the column names need to be different (naming according to the dictionary) and that some columns from df should be consolidated (e.g. a sum-operation) based on the dictionary.
The result should look like this:
df_new
Object_new Action_new Total_Cost
0 123 renovate 12000
1 456 do something 10
2 789 review 1050
How can I achieve this result using only the dictionary? I tried to use the .map() function but could not figure out how to perform the sum-operation with it.
The code to reproduce both dataframes and the dictionary are attached:
# import libraries
import pandas as pd
### create df
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
### create dictionary
dictionary = {'Object_new':['Object'],
'Action_new':['Action'],
'Total_Cost' : ['Cost1', 'Cost2']}
### create df_new
# data_df_new = pd.DataFrame(columns=['Object_new', 'Action_new', 'Total_Cost' ])
data_df_new = {'Object_new': [123, 456, 789],
'Action_new': ['renovate', 'do something', 'review'],
'Total_Cost': [12000, 10, 1050],
}
df_new = pd.DataFrame(data_df_new)

A play with groupby:
inv_dict = {x:k for k,v in dictionary.items() for x in v}
df_new = df.groupby(df.columns.map(inv_dict),
axis=1).sum()
Output:
Action_new Object_new Total_Cost
0 renovate 123 12000
1 do something 456 10
2 review 789 1050

Given the complexity of your algorithm, I would suggest performing a Series addition operation to solve this problem.
Why? In Pandas, every column in a DataFrame works as a Series under the hood.
data_df_new = {
'Object_new': df['Object'],
'Action_new': df['Action'],
'Total_Cost': (df['Cost1'] + df['Cost2']) # Addition of two series
}
df_new = pd.DataFrame(data_df_new)
Running this code will map every value contained in your dataset, which will be stored in our dictionary.

You can use an empty data frame to copy the new column and use the to_dict to convert it to a dictionary.
import pandas as pd
import numpy as np
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
print(df)
MyEmptydf = pd.DataFrame()
MyEmptydf['Object_new']=df['Object']
MyEmptydf['Action_new']=df['Action']
MyEmptydf['Total_Cost'] = df['Cost1'] + df['Cost2']
print(MyEmptydf)
dictionary = MyEmptydf.to_dict(orient="index")
print(dictionary)
you can run the code here:https://repl.it/repls/RealisticVillainousGlueware

If you trying to entirely avoid pandas and only use the dictionary this should solve it
Object = []
totalcost = []
action = []
for i in range(0,3):
Object.append(data_df['Object'][i])
totalcost.append(data_df['Cost1'][i]+data_df['Cost2'][i])
action.append(data_df['Action'][i])
dict2 = {'Object':Object, 'Action':action, 'TotalCost':totalcost}

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?

Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

How to append to a dictionary of dictionaries

I'm trying to create a dictionary of dictionaries like this:
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
But I am populating it from an SQL table. So I've pulled the SQL table into an SQL object called "result". And then I got the column names like this:
nutCol = [i[0] for i in result.description]
The table has about 40 characteristics, so it is quite long.
I can do this...
foodList = {}
for id, food in enumerate(result):
addMe = {str(food[1]): {nutCol[id + 2]: food[2], nulCol[idx + 3]:
food[3] ...}}
foodList.update(addMe)
But this of course would look horrible and take a while to write. And I'm still working out how I want to build this whole thing so it's possible I'll need to change it a few times...which could get extremely tedious.
Is there a DRY way of doing this?

In order to make solution position independent you can make use of dict1.update(dict2). This simply merges dict2 with dict1.
In our case since we have dict of dict, we can use dict['key'] as dict1 and simply add any additional key,value pair as dict2.
Here is an example.
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
addthis = {'foo':'bar'}
Suppose you want to add addthis dict to food['strawberry'] , we can simply use,
food["Strawberry"].update(addthis)
Getting result:
>>> food
{'Strawberry': {'Taste': 'Good', 'foo': 'bar', 'Smell': 'Good'},'Broccoli': {'Taste': 'Bad', 'Smell': 'Bad'}}
>>>

Assuming that column 0 is what you wish to use as your key, and you do wish to build a dictionary of dictionaries, then its:
detail_names = [col[0] for col in result.description[1:]]
foodList = {row[0]: dict(zip(detail_names, row[1:]))
for row in result}
Generalising, if column k is your identity then its:
foodList = {row[k]: {col[0]: row[i]
for i, col in enumerate(result.description) if i != k}
for row in result}
(Here each sub dictionary is all columns other than column k)

addMe = {str(food[1]):dict(zip(nutCol[2:],food[2:]))}
zip will take two (or more) lists of items and pair the elements, then you can pass the result to dict to turn the pairs into a dictionary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

optimize performance when creating dataframe from dictionary - python

You can pass dictionary to list like: df = pd.DataFrame([data]) print (df) date B C ID 0 20210606 11355 4 ladygaga Also your solution should be faster by: df = pd.DataFrame({key: [item] for key, item in data.items()})

Related

Pandas Dataframe from list nested in json

Most efficient way to place a Pandas data frame into a list of dictionaries with a certain format

Map two dataframes and perform sum operation using a dictionary

Replace values from pandas dataset with dictionary

How to append to a dictionary of dictionaries

Categories

Resources