extract dataframes from list of dictionaries and combine into one - python

I have a list of dictionaries. Each item in the list is a dictionary. Each dictionary is a pair of key and value with the value being a data frame.
I would like to extract all the data frames and combine them into one.
I have tried:
df = pd.DataFrame.from_dict(data)
for both the full data file and for each dictionary in the list.
This gives the following error:
ValueError: If using all scalar values, you must pass an index
I have also tried turning the dictionary into a list, then converting to a pd.DataFrame, i get:
KeyError: 0
Any ideas?

It should be doable with pd.concat(). Let's say you have a list of dictionaries l:
l = (
{'a': pd.DataFrame(np.arange(9).reshape((3,3)))},
{'b': pd.DataFrame(np.arange(9).reshape((3,3)))},
{'c': pd.DataFrame(np.arange(9).reshape((3,3)))}
)
You can feed dataframes from each dict in the list to pd.concat():
df = pd.concat([[pd.DataFrame(df_) for df_ in dict_.values()][0] for dict_ in l])
In my example all data frames have the same number of columns, so the result has 9 x 3 shape. If your dataframes have different columns the output will be malformed and required extra steps to process.

This should work.
import pandas as pd
dict1 = {'d1': pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', 'three']})}
dict2 = {'d2': pd.DataFrame({'a': [4,5,6], 'b': ['four', 'five', 'six']})}
dict3 = {'d3': pd.DataFrame({'a': [7,8,9], 'b': ['seven', 'eigth', 'nine']})}
# dicts list. you would start from here
dicts_list = [dict1, dict2, dict3]
dict_counter = 0
for _dict in dicts_list:
aux_df = list(_dict.values())[0]
if dict_counter == 0:
df = aux_df
else:
df = df.append(aux_df)
dict_counter += 1
# Reseting and dropping old index
df = df.reset_index(drop=True)
print(df)
Just out of curiosity: Why are your sub-dataframes already included in a dictionary? An easy way of creating a dataframe from dictionaries is just building a list of dictionaries and then calling pd.DataFrame(list_with_dicts). If the keys are the same across all dictionaries, it should work. Just a suggestion from my side. Something like this:
list_with_dicts = [{'a': 1, 'b': 2}, {'a': 5, 'b': 4}, ...]
# my_df -> DataFrame with columns [a, b] and two rows with the values in the dict.
my_df = pd.DataFrame(list_with_dicts)

Related

Pandas appending dictionary values with iterrows row values

I have a dict of city names, each having an empty list as a value. I am trying to use
df.iterrows()
to append corresponding names to each dict key(city):
for index, row in df.iterrows():
dict[row['city']].append(row['fullname'])
Can somebody explain why the code above appends all possible 'fullname' values to each dict's key instead of appending them to their respective city keys?
I.e. instead of getting the result
{"City1":["Name1","Name2"],"City2":["Name3","Name4"]}
I'm getting
{"City1":["Name1","Name2","Name3","Name4"],"City2":["Name1","Name2","Name3","Name4"]}
Edit: providing a sample of the dataframe:
d = {'fullname': ['Jason', 'Katty', 'Molly', 'Nicky'],
'city': ['Arizona', 'Arizona', 'California', 'California']}
df = pd.DataFrame(data=d)
Edit 2:
I'm pretty sure that my problem lies in my dict, since I created it in the following way:
cities = []
for i in df['city']:
cities.append(i)
dict = dict.fromkeys(set(cities), [])
when I call dict, i get the correct output:
{"Arizona":[],"California":[]}
However if I specify a key dict['Arizona'], i get this:
{"index":[],"columns":[],"data":[]}
I'm surprised it works at all, because row is a Series.
How about this alternative approach:
for city in your_dict.keys():
your_dict[city] += list(df["fullname"][df["city"] == city])
You should always avoid iterating through dataframes unless it's absolutely necessary.
The problem is indeed .fromkeys - the default value is evaluated once - so all of the keys are "pointing to" the same list.
>>> dict.fromkeys(['one', 'two'], [])
{'one': [], 'two': []}
>>> d = dict.fromkeys(['one', 'two'], [])
>>> d['one'].append('three')
>>> d
{'one': ['three'], 'two': ['three']}
You'd need a comprehension to create a distinct list for each key.
>>> d = { k: [] for k in ['one', 'two'] }
>>> d
{'one': [], 'two': []}
>>> d['one'].append('three')
>>> d
{'one': ['three'], 'two': []}
You are also manually implementing a groupby with your code:
>>> df.groupby('city')['fullname'].agg(list)
city
Arizona [Jason, Katty]
California [Molly, Nicky]
Name: fullname, dtype: object
If you want a dict:
>>> df.groupby('city')['fullname'].agg(list).to_dict()
{'Arizona': ['Jason', 'Katty'], 'California': ['Molly', 'Nicky']}

Creating a dictionary out of pandas dataframe, where the value is the index

I have a pandas dataframe like so:
A
a
b
c
d
I am trying to create a python dictionary which would look like this:
df_dict = {'a':0, 'b':1, 'c':2, 'd':3}
What I've tried:
df.reset_index(inplace=True)
df = {x : y for x in df['A'] for y in df['index']}
But the df is 75k long and its taking a while now, not even sure if this produces the result I need. Is there a neat, fast way of achieving this?
Use dict with zip and range:
d = dict(zip(df['A'], range(len(df))))
print (d)
{'a': 0, 'b': 1, 'c': 2, 'd': 3}
You can do it like this:
#creating example dataframe with 75 000 rows
import uuid
df = pd.DataFrame({"col": [str(uuid.uuid4()) for _ in range(75000) ] } )
#your bit
{ i:v for i,v in df.reset_index().values }
It runs in seconds.
You could convert series to list and use enumerate:
lst = { x: i for i, x in enumerate(df['A'].tolist()) }

Creating a column, where the value of each row is a key of a specified dict, based on whether existing column contains that dict value as a substring?

Say I have the following dictionary
dict = {'a': ['tool', 'device'], 'b': ['food', 'beverage']},
and I have a dataframe with a column with the first 2 row values as
'tools',
'foods'
and I want to create a new column where the 1st value is a, and the second is b.
What would be the best way to do this?
First dont use varable name dict, because builtins (python code word). Then are swapped values of dict - values with keys for new dict, get values from column by Series.str.findall by keys of dict and Series.map by dictionary for new column:
d = {'a': ['tool', 'device'], 'b': ['food', 'beverage']}
df = pd.DataFrame({'col':['tools','foods']})
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'tool': 'a', 'device': 'a', 'food': 'b', 'beverage': 'b'}
df['new'] = df['col'].str.findall('|'.join(d1.keys())).str[0].map(d1)
print (df)
col new
0 tools a
1 foods b
Or:
df['new'] = df['col'].str.extract('({})'.format('|'.join(d1.keys())), expand=False).map(d1)

Can I join two data frames using one column in df1 and one of any values in a cell in df2?

I'm working with some geospatial data, df_geo and am have a CSV of values I'd like to join to the location data frame, called df_data.
My issue, however, is that there are multiple ways to spell the values in the column I'd like to join the two data frames on (region names). Look at the Catalonia example below, in df_geo: there are 6 different ways to spell the region name, depending on the language.
My question is this: if the row is named "Catalonia" in df_data, how would I go about joining df_data to df_geo?
Since the rows are unique to a region, you can create a dictionary that maps any name in 'VARNAME_1' to the index from df_geo.
Then use this to map the the names in df_data to a dummy column and you can do a simple merge on the index in df_geo and the mapped column in df_data.
To get the dictionary do:
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
Sample Data:
import pandas as pd
df_geo = pd.DataFrame({'VARNAME_1': ['Catalogna\Catalogne\Catalonia', 'A\B\C\D\E\F\G']})
df_data = pd.DataFrame({'Name': ['Catalogna', 'Seven', 'E'],
'Vals': [1,2,3]})
Code
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
#{'A': 1,
# 'B': 1,
# 'C': 1,
# 'Catalogna': 0,
# 'Catalogne': 0,
# 'Catalonia': 0,
# 'D': 1,
# 'E': 1,
# 'F': 1,
# 'G': 1}
df_data['ID'] = df_data.Name.map(d)
df_data.merge(df_geo, left_on='ID', right_index=True, how='left').drop(columns='ID')
Output:
Name Vals VARNAME_1
0 Catalogna 1 Catalogna\Catalogne\Catalonia
1 Seven 2 NaN
2 E 3 A\B\C\D\E\F\G
How the dictionary works.
df_geo.VARNAME_1.str.split(r'\\').values splits the string in VARNAME_1 on the '\' character and places all the separated values in a Series of lists. Using .items on the Series gives you a tuple (which we unpacked into two separate values), with the first value being the index, which is the same as the index of the original DataFrame, and the second item being the
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
print(f'id:{ids} and val:{val}')
#id:0 and val:['Catalogna', 'Catalogne', 'Catalonia']
#id:1 and val:['A', 'B', 'C', 'D', 'E', 'F', 'G']
So now val is a list, which we again want to iterate over to create out dictionary.
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
for y in val:
print(f'id:{ids} and y:{y}')
#id:0 and y:Catalogna
#id:0 and y:Catalogne
#id:0 and y:Catalonia
#id:1 and y:A
#id:1 and y:B
#id:1 and y:C
#id:1 and y:D
#id:1 and y:E
#id:1 and y:F
#id:1 and y:G
And so the dictionary I created was with y as the key, and the original DataFrame index ids as the value.

How to combine multiple columns from a pandas df into a list

How can you combine multiple columns from a dataframe into a list?
Input:
df = pd.DataFrame(np.random.randn(10000, 7), columns=list('ABCDEFG'))
If I wanted to create a list from column A I would perform:
df1 = df['A'].tolist()
But if I wanted to combine numerous columns into this list it wouldn't be efficient write df['A','B','C'...'Z'].tolist()
I have tried to do the following but it just adds the columns headers to a list.
df1 = list(df.columns)[0:8]
Intended input:
A B C D E F G
0 0.787576 0.646178 -0.561192 -0.910522 0.647124 -1.388992 0.728360
1 0.265409 -1.919283 -0.419196 -1.443241 -2.833812 -1.066249 0.553379
2 0.343384 0.659273 -0.759768 0.355124 -1.974534 0.399317 -0.200278
Intended Output:
[0.787576, 0.646178, -0.561192, -0.910522, 0.647124, -1.388992, 0.728360,
0.265409, -1.919283, -0.419196, -1.443241, -2.833812, -1.066249, 0.553379,
0.343384, 0.659273, -0.759768, 0.355124, -1.974534, 0.399317, -0.200278]
Is this what you are looking for
lst = df.values.tolist()
flat_list = [item for x in lst for item in x]
print(flat_list)
You can using to_dict
df = pd.DataFrame(np.random.randn(10, 10), columns=list('ABCDEFGHIJ'))
df.to_dict('l')
Out[1036]:
{'A': [-0.5611441440595607,
-0.3785906500723589,
-0.19480328695097676,
-0.7472526275034221,
-2.4232786057647457,
0.10506614562827334,
0.4968179288412277,
1.635737019365132,
-1.4286421753281746,
0.4973223222844811],
'B': [-1.0550082961139444,
-0.1420067090193365,
0.30130476834580633,
1.1271866812852227,
0.38587456174846285,
-0.531163142682951,
-1.1335754634118729,
0.5975963084356348,
-0.7361022807495443,
1.4329395663140427],
...}
Or adding values.tolist()
df[list('ABC')].values.tolist()
Out[1041]:
[[0.09552771302434987, 0.18551596484768904, -0.5902249875268607],
[-1.5285190712746388, 1.2922627021799646, -0.8347422966138306],
[-0.4092028716404067, -0.5669107267579823, 0.3627970727410332],
[-1.3546346273319263, -0.9352316948439341, 1.3568726575880614],
[-1.3509518030469496, 0.10487182694997808, -0.6902134363370515]]
Edit : np.concatenate(df[list('ABC')].T.values.tolist())

Categories

Resources