Nested dictionary of namedtuples to pandas dataframe - python

I have namedtuples defined as follows:
In[37]: from collections import namedtuple
Point = namedtuple('Point', 'x y')
The nested dictionary has the following format:
In[38]: d
Out[38]:
{1: {None: {1: Point(x=1.0, y=5.0), 2: Point(x=4.0, y=8.0)}},
2: {None: {1: Point(x=45324.0, y=24338.0), 2: Point(x=45.0, y=38.0)}}}
I am trying to create a pandas dataframe from the dictionary d without having to do for loops.
I have succeeded in creating the dataframe from a subset of the dictionary by doing this:
In[40]: df=pd.DataFrame(d[1][None].values())
In[41]: df
Out[41]:
x y
0 1 5
1 4 8
But i want to be able to create the dataframe from the entire dictionary.
I want the dataframe to output the following (i am using multi index notation):
In[42]: df
Out[42]:
Subcase Step ID x y
1 None 1 1.0 5.0
2 4.0 8.0
2 None 1 45324.0 24338.0
2 45.0 38.0
The from_dict method of DataFrame, only supports up to two levels of nesting, so i was not able to use it. I am also considering modifying the structure of the d dictionary to achieve my goal. Furthermore, maybe it does not have to be a dictionary.
Thank you.

There are already several answers to similar questions on SO (here, here, or here). These solutions can be adapted to this problem as well. However, none of them is really general to be run on an arbitrary dict. So I decided to write something more universal.
This is a function that can be run on any dict. The dict has to have the same number of levels (depth) at any of its elements, otherwise it will most probably raise.
def frame_from_dict(dic, depth=None, **kwargs):
def get_dict_depth(dic):
if not isinstance(dic, dict):
return 0
for v in dic.values():
return get_dict_depth(v) + 1
if depth is None:
depth = get_dict_depth(dic)
if depth == 0:
return pd.Series(dic)
elif depth > 0:
keys = []
vals = []
for k, v in dic.items():
keys.append(k)
vals.append(frame_from_dict(v, depth - 1))
try:
keys = sorted(keys)
except TypeError:
# unorderable types
pass
return pd.concat(vals, axis=1, keys=keys, **kwargs)
raise ValueError("depth should be a nonnegative integer or None")
I sacrificed a namedtuple case from this question for the generality. But it can be tweaked if needed.
In this particular case, it can be applied as follows:
df = frame_from_dict(d, names=['Subcase', 'Step', 'ID']).T
df.columns = ['x', 'y']
df
Out[115]:
x y
Subcase Step ID
1 NaN 1 1.0 5.0
2 4.0 8.0
2 NaN 1 45324.0 24338.0
2 45.0 38.0

I decided to flatten the keys into a tuple (tested using pandas 0.18.1):
In [5]: from collections import namedtuple
In [6]: Point = namedtuple('Point', 'x y')
In [11]: from collections import OrderedDict
In [14]: d=OrderedDict()
In [15]: d[(1,None,1)]=Point(x=1.0, y=5.0)
In [16]: d[(1,None,2)]=Point(x=4.0, y=8.0)
In [17]: d[(2,None,1)]=Point(x=45324.0, y=24338.0)
In [18]: d[(2,None,2)]=Point(x=45.0, y=38.0)
Finally,
In [7]: import pandas as pd
In [8]: df=pd.DataFrame(d.values(), index=pd.MultiIndex.from_tuples(d.keys(), names=['Subcase','Step','ID']))
In [9]:df
Out[9]:
x y
Subcase Step ID
1 NaN 1 1.0 5.0
2 4.0 8.0
2 NaN 1 45324.0 24338.0
2 45.0 38.0

Related

how do I do a value count for items inside a list in a dataframe [duplicate]

I have a dataframe column which is a list of strings:
df['colors']
0 ['blue','green','brown']
1 []
2 ['green','red','blue']
3 ['purple']
4 ['brown']
What I'm trying to get is:
'blue' 2
'green' 2
'brown' 2
'red' 1
'purple' 1
[] 1
Without knowing what I'm doing I even managed to count the characters in the entire column
b 5
[ 5
] 5
etc.
which I think was pretty cool, but the solution to this escapes me
Solution
Best option: df.colors.explode().dropna().value_counts().
However, if you also want to have counts for empty lists ([]), use Method-1.B/C similar to what was suggested by Quang Hoang in the comments.
You can use any of the following two methods.
Method-1: Use pandas methods alone ⭐⭐⭐
explode --> dropna --> value_counts
Method-2: Use list.extend --> pd.Series.value_counts
## Method-1
# A. If you don't want counts for empty []
df.colors.explode().dropna().value_counts()
# B. If you want counts for empty [] (classified as NaN)
df.colors.explode().value_counts(dropna=False) # returns [] as Nan
# C. If you want counts for empty [] (classified as [])
df.colors.explode().fillna('[]').value_counts() # returns [] as []
## Method-2
colors = []
_ = [colors.extend(e) for e in df.colors if len(e)>0]
pd.Series(colors).value_counts()
Output:
green 2
blue 2
brown 2
red 1
purple 1
# NaN 1 ## For Method-1.B
# [] 1 ## For Method-1.C
dtype: int64
Dummy Data
import pandas as pd
df = pd.DataFrame({'colors':[['blue','green','brown'],
[],
['green','red','blue'],
['purple'],
['brown']]})
Use a Counter + chain, which is meant to do exactly this. Then construct the Series from the Counter object.
import pandas as pd
from collections import Counter
from itertools import chain
s = pd.Series([['blue','green','brown'], [], ['green','red','blue']])
pd.Series(Counter(chain.from_iterable(s)))
#blue 2
#green 2
#brown 1
#red 1
#dtype: int64
While explode + value_counts are the pandas way to do things, they're slower for shorter lists.
import perfplot
import pandas as pd
import numpy as np
from collections import Counter
from itertools import chain
def counter(s):
return pd.Series(Counter(chain.from_iterable(s)))
def explode(s):
return s.explode().value_counts()
perfplot.show(
setup=lambda n: pd.Series([['blue','green','brown'], [], ['green','red','blue']]*n),
kernels=[
lambda s: counter(s),
lambda s: explode(s),
],
labels=['counter', 'explode'],
n_range=[2 ** k for k in range(17)],
equality_check=np.allclose,
xlabel='~len(s)'
)
You can use Counter from the collections module:
import pandas as pd
from collections import Counter
from itertools import chain
df = pd.DataFrame({'colors':[['blue','green','brown'],
[],
['green','red','blue'],
['purple'],
['brown']]})
df = pd.Series(Counter(chain(*df.colors)))
print (df)
Output:
blue 2
green 2
brown 2
red 1
purple 1
dtype: int64
A quick and dirty solution would be something like this I imagine.
You'd still have to add a condition to get the empty list, though.
colors = df.colors.tolist()
d = {}
for l in colors:
for c in l:
if c not in d.keys():
d.update({c: 1})
else:
current_val = d.get(c)
d.update({c: current_val+1})
this produces a dictionary looking like this:
{'blue': 2, 'green': 2, 'brown': 2, 'red': 1, 'purple': 1}
I would use .apply with pd.Series to accomplish this:
# 1. Expand columns and count them
df_temp = df["colors"].apply(pd.Series.value_counts)
blue brown green purple red
0 1.0 1.0 1.0 NaN NaN
1 NaN NaN NaN NaN NaN
2 1.0 NaN 1.0 NaN 1.0
3 NaN NaN NaN 1.0 NaN
4 NaN 1.0 NaN NaN NaN
# 2. Get the value counts from this:
df_temp.sum()
blue 2.0
brown 2.0
green 2.0
purple 1.0
red 1.0
# Alternatively, convert to a dict
df_temp.sum().to_dict()
# {'blue': 2.0, 'brown': 2.0, 'green': 2.0, 'purple': 1.0, 'red': 1.0}

Pandas groupby mean() not ignoring NaNs

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

Python Pandas: Passing arguments to a function in agg()

I am trying to reduce data in a pandas dataframe by using different kind of functions and argument values. However, I did not manage to change the default arguments in the aggregation functions. Here is an example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']})
>>> df
x y
0 1.0 a
1 NaN a
2 2.0 b
3 1.0 b
Here is an aggregation function, for which I would like to test different values of b:
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
In the following code, I can use this function with the default b value, but I would like to pass other values:
>>> df.groupby('y').agg(translate_mean)
x
y
a NaN
b 11.5
Any ideas?
Just pass as arguments to agg (this works with apply, too).
df.groupby('y').agg(translate_mean, b=4)
Out:
x
y
a NaN
b 5.5
Maybe you can try using apply in this case:
df.groupby('y').apply(lambda x: translate_mean(x['x'], 20))
Now the result is:
y
a NaN
b 21.5
Just in case you have multiple columns, and you want to apply different functions and different parameters for each column, you can use lambda function with agg function.
For example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']
'z': ['0.1','0.2','0.3','0.4']})
>>> df
x y z
0 1.0 a 0.1
1 NaN a 0.2
2 2.0 b 0.3
3 1.0 0.4
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
To groupby column 'y', and apply function translate_mean with b=10 for col 'x'; b=25 for col 'z', you can try this:
df_res = df.groupby(by='a').agg({
'x': lambda x: translate_mean(x, 10),
'z': lambda x: translate_mean(x, 25)})
Hopefully, it helps.

python, rank a list of number/string (convert list elements to ordinal value)

Say I have a list (or numpy array or pandas series) as below
l = [1,2,6,6,4,2,4]
I want to return a list of each value's ordinal, 1-->1(smallest), 2-->2, 4-->3, 6-->4 and
to_ordinal(l) == [1,2,4,4,3,2,4]
and I want it to also work for list of strings input.
I can try
s = numpy.unique(l)
then loop over each element in l and find its index in s. Just wonder if there is a direct method?
In pandas you can call rank and pass method='dense':
In [18]:
l = [1,2,6,6,4,2,4]
s = pd.Series(l)
s.rank(method='dense')
Out[18]:
0 1
1 2
2 4
3 4
4 3
5 2
6 3
dtype: float64
This also works for strings:
In [19]:
l = ['aaa','abc','aab','aba']
s = pd.Series(l)
Out[19]:
0 aaa
1 abc
2 aab
3 aba
dtype: object
In [20]:
s.rank(method='dense')
Out[20]:
0 1
1 4
2 2
3 3
dtype: float64
I don't think that there is a "direct method" for this1. The most straight forward way that I can think to do it is to sort a set of the elements:
sorted_unique = sorted(set(l))
Then make a dictionary mapping the value to it's ordinal:
ordinal_map = {val: i for i, val in enumerate(sorted_unique, 1)}
Now one more pass over the data and we can get your list:
ordinals = [ordinal_map[val] for val in l]
Note that this is a roughly O(NlogN) algorithm (due to the sort) -- And the more non-unique elements you have, the closer it becomes to O(N).
1Certainly not in vanilla python and I don't know of anything in numpy. I'm less familiar with pandas so I can't speak to that.

Interpolation on DataFrame in pandas

I have a DataFrame, say a volatility surface with index as time and column as strike. How do I do two dimensional interpolation? I can reindex but how do i deal with NaN? I know we can fillna(method='pad') but it is not even linear interpolation. Is there a way we can plug in our own method to do interpolation?
You can use DataFrame.interpolate to get a linear interpolation.
In : df = pandas.DataFrame(numpy.random.randn(5,3), index=['a','c','d','e','g'])
In : df
Out:
0 1 2
a -1.987879 -2.028572 0.024493
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
g -1.632493 0.938456 0.492695
In : df2 = df.reindex(['a','b','c','d','e','f','g'])
In : df2
Out:
0 1 2
a -1.987879 -2.028572 0.024493
b NaN NaN NaN
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
f NaN NaN NaN
g -1.632493 0.938456 0.492695
In : df2.interpolate()
Out:
0 1 2
a -1.987879 -2.028572 0.024493
b 0.052363 -1.729055 0.114652
c 2.092605 -1.429537 0.204811
d 0.767215 1.077814 0.565666
e -1.027733 1.330702 -0.490780
f -1.330113 1.134579 0.000958
g -1.632493 0.938456 0.492695
For anything more complex, you need to roll-out your own function that will deal with a Series object and fill NaN values as you like and return another Series object.
Old thread but thought I would share my solution with 2d extrapolation/interpolation, respecting index values, which also works on demand. Code ended up a bit weird so let me know if there is a better solution:
import pandas
from numpy import nan
import numpy
dataGrid = pandas.DataFrame({1: {1: 1, 3: 2},
2: {1: 3, 3: 4}})
def getExtrapolatedInterpolatedValue(x, y):
global dataGrid
if x not in dataGrid.index:
dataGrid.ix[x] = nan
dataGrid = dataGrid.sort()
dataGrid = dataGrid.interpolate(method='index', axis=0).ffill(axis=0).bfill(axis=0)
if y not in dataGrid.columns.values:
dataGrid = dataGrid.reindex(columns=numpy.append(dataGrid.columns.values, y))
dataGrid = dataGrid.sort_index(axis=1)
dataGrid = dataGrid.interpolate(method='index', axis=1).ffill(axis=1).bfill(axis=1)
return dataGrid[y][x]
print getExtrapolatedInterpolatedValue(2, 1.4)
>>2.3

Categories

Resources