Related
How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.
If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.
How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.
In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.
You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])
The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)
I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)
Given the following dataframe and list of dictionaries:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict([
{'id': '912SAFD', 'key': 3, 'list_index': [0]},
{'id': '812SAFD', 'key': 4, 'list_index': [0, 1]},
{'id': '712SAFD', 'key': 5, 'list_index': [2]}])
designs = [{'designs': [{'color_id': 609090, 'value': 'b', 'lang': ''}]},
{'designs': [{'color_id': 609091, 'value': 'c', 'lang': ''}]},
{'designs': [{'color_id': 609092, 'value': 'd', 'lang': 'fr'}]}]
Dataframe output:
id key list_index
0 912SAFD 3 [0]
1 812SAFD 4 [0, 1]
2 712SAFD 5 [2]
Without using explicit loops (if possible), is it feasible to iterate through the lists in 'list_index' for each row, extract the values and use them to access the list of dictionaries by index and then create new columns based on the values in the dictionaries?
Here is an example of the expected result:
id key list_index 609090 609091 609092 609092_lang
0 912SAFD 3 [0] b NaN NaN NaN
1 812SAFD 4 [0, 1] b c NaN NaN
2 712SAFD 5 [2] NaN NaN d fr
If 'lang' is not empty, it should be added as a column to the dataframe by using the color_id value combined with an underscore and its own name as the column name. For example: 609092_lang.
Any help would be much appreciated.
# this is to get the inner dictionary and make a tidy dataframe from it
designs = [info for design in designs for info in design['designs']]
df_designs = pd.DataFrame(designs)
df_designs['lang_code'] = 'lang_' + df_designs['color_id'].astype(str)
df_designs['lang'] = df_designs.lang.replace('', np.NaN)
df = df.explode('list_index').merge(df_designs, left_on='list_index', right_index=True)
df_color = df.pivot(index=['id', 'key'], columns=['color_id'], values='value')
df_lang = df.pivot(index=['id', 'key'], columns=['lang_code'], values='lang')
df = df_color.join(df_lang).reset_index().dropna(how='all' , axis=1)
print(df)
output :
>>>
id key 609090 609091 609092 lang_609092
0 712SAFD 5 NaN NaN d fr
1 812SAFD 4 b c NaN NaN
2 912SAFD 3 b NaN NaN NaN
alternatively, if you could work with multiIndex df , instead of naming them, that would be simpler :
# this is to get the inner dictionary and make a tidy dataframe from it
designs = [info for design in designs for info in design['designs']]
df_designs = pd.DataFrame(designs)
df_designs['lang'] = df_designs.lang.replace('',np.NaN)
df = df.explode('list_index').merge(df_designs, left_on='list_index', right_index=True)
df = df.pivot(index=['id', 'key'], columns=['color_id'], values=['value','lang']).dropna(how='all' , axis=1).reset_index()
print(df)
output:
>>>
id key value lang
color_id 609090 609091 609092 609092
0 712SAFD 5 NaN NaN d fr
1 812SAFD 4 b c NaN NaN
2 912SAFD 3 b NaN NaN NaN
How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.
If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.
How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.
In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.
You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])
The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)
I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)
I have data like this :
A,B,C,D
1,50,1 ,3.9
2,20,22,1.5
3,10,10,2.3
2,15,11,1.8
1,16,13,4.2
and I want to group them by A that I would take mean for BandC and sum for D .
the solution would be like this :
df = df.groupby(['A']).agg({
'B': 'mean', 'C': 'mean', 'D': sum
})
I am asking about if there is a way to choose multiple columns for the same function rather than repeating it as in the case of BandC
If you require at most one aggregation per column, you can store the aggregations in a dict {func: col_list}, then unpack it when you aggregate.
d = {'mean': ['B', 'C'], sum: ['D']}
df.groupby(['A']).agg({col: f for f,cols in d.items() for col in cols})
# B C D
#A
#1 33.0 7.0 8.1
#2 17.5 16.5 3.3
#3 10.0 10.0 2.3
I have a 2-D numpy array each row of which consists of three elements - ['dataframe_column_name', 'dataframe_index', 'value'].
Now, I tried populating the pandas dataframe using iloc double for loop but it is quite slow. Is there any faster way of doing this. I am a bit new to pandas, so apologies in case this is something very basic.
Here is the code snippet :
my_nparray = [['a', 1, 123], ['b', 1, 230], ['a', 2, 321]]
for r in range(my_nparray.shape[0]):
[col, ind, value] = my_nparray[r]
df.iloc[col][ind] = value
This takes a lot of time when my_nparray is large, is there any other way of doing this?
Initially assume that I can create this data frame :
'a' 'b'
1 NaN NaN
2 NaN NaN
I want the output as :
'a' 'b'
1 123 230
2 321 NaN
You can use from_records and then pivot:
df = pd.DataFrame.from_records(my_nparray, index=1).pivot(columns=0)
2
0 a b
1
1 123.0 230.0
2 321.0 NaN
This specifies that the index uses field 1 from your array and pivot uses Series 0 for the columns.
Then we can reset the MultiIndex on the columns and the index:
df.columns = df.columns.droplevel(None)
df.columns.name = None
df.index.name = None
a b
1 123.0 230.0
2 321.0 NaN
Use DataFrame constructor with DataFrame.pivot and DataFrame.rename_axis:
df = pd.DataFrame(my_nparray).pivot(1,0,2).rename_axis(index=None, columns=None)
print (df)
a b
1 123.0 230.0
2 321.0 NaN