Mean between n and n+1 row in pandas groupby object? - python

I have a groupby object:
col1 col2 x y z
0 A D1 0.269002 0.131740 0.401020
1 B D1 0.201159 0.072912 0.775171
2 D D1 0.745292 0.725807 0.106000
3 F D1 0.270844 0.214708 0.935534
4 C D1 0.997799 0.503333 0.250536
5 E D1 0.851880 0.921189 0.085515
How do I sort the groupby object into the following:
col1 col2 x y z
0 A D1 0.269002 0.131740 0.401020
1 B D1 0.201159 0.072912 0.775171
4 C D1 0.997799 0.503333 0.250536
2 D D1 0.745292 0.725807 0.106000
5 E D1 0.851880 0.921189 0.085515
3 F D1 0.270844 0.214708 0.935534
And then compute the means between Row A {x, y, z} and Row B {x, y, z}, Row B {x, y, z} and Row C {x, y, z}... such that I have:
col1 col2 x_mean y_mean z_mean
0 A-B D1 0.235508 0.102326 0.58809
1 B-C D1 ... ... ...
4 C-D D1 ... ... ...
2 D-E D1 ... ... ...
5 E-F D1 ... ... ...
3 F-A D1 ... ... ...
I am basically trying to computationally find the midpoints between vertices of a hexagonal structure (well... more like 10 million). Hints appreciated!

I believe you need groupby with rolling and aggregate mean, last for pairs use shift and delete first NaNs rows per group:
print (df)
col1 col2 x y z
0 A D1 0.269002 0.131740 0.401020
1 B D1 0.201159 0.072912 0.775171
2 D D1 0.745292 0.725807 0.106000
3 F D2 0.270844 0.214708 0.935534 <-change D1 to D2
4 C D2 0.997799 0.503333 0.250536 <-change D1 to D2
5 E D2 0.851880 0.921189 0.085515 <-change D1 to D2
#
df = (df.sort_values(['col1','col2'])
.set_index('col1')
.groupby('col2')['x','y','z']
.rolling(2)
.mean()
.reset_index())
df['col1'] = df.groupby('col2')['col1'].shift() + '-' + df['col1']
df = df.dropna(subset=['col1','x','y','z'], how='all')
#alternative
#df = df[df['col2'].duplicated()]
print (df)
col2 col1 x y z
1 D1 A-B 0.235081 0.102326 0.588095
2 D1 B-D 0.473226 0.399359 0.440586
4 D2 C-E 0.924840 0.712261 0.168026
5 D2 E-F 0.561362 0.567948 0.510524

Related

Midway between combine_first and merge?

I have a master dataframe A
id A B C
0 a b c
1 d e f
2 g h i
3 j k l
and a newer dataframe B
id A B D
0 a2 b2 x
1 d2 e2 y
2 g2 h2 z
3 j2 k2 NaN
4 l2 m2 NaN
5 n2 o2 NaN
If I merge them, I get duplicated columns like A, A_x, B and B_x. If I use combine_first, I end up with rows 4,5, and column D, which I'm not interested in. Besides doing something like
not_on_a = B["id"].isin(A["id"])
B = B.loc[not_on_a, [A,B]]
A = A.combine_first(B)
Is there a way to overwrite B on A ignoring everything that isn't on A? The desired output is
id A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l
If the indices are the same, this is a simple update:
>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": ["a", "d", "g", "j"], "B": ["b", "e", "h", "k"], "C": ["c", "f", "i", "l"]})
>>> df1
A B C
0 a b c
1 d e f
2 g h i
3 j k l
>>> df2 = pd.DataFrame({"A": ["a2", "d2", "g2", "j2", "l2", "n2"], "B": ["b2", "e2", "h2", "k2", "m2", "o2"], "D": ["x", "y", "z", None, None, None]})
>>> df2
A B D
0 a2 b2 x
1 d2 e2 y
2 g2 h2 z
3 j2 k2 None
4 l2 m2 None
5 n2 o2 None
>>> df1.update(df2)
>>> df1
A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l
If you don't want to mutate the first dataframe, you can make a copy first.
You could use join and then clean up the dataframe as desired. I do this dynamically by putting drop in the suffix of the columns you later use to dropna values by defining the subset of columns as those with "drop" in them.
df = B.join(A, lsuffix='', rsuffix='drop')
df = df.dropna(subset=[col for col in df.columns if 'drop' in col])
df = df[A.columns]
df
id A B C
0 a2 b2 c
1 d2 e2 f
2 g2 h2 i
3 j2 k2 l

Fill in same amount of characters where other column is NaN

I have the following dummy dataframe:
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']})
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h NaN
2 i,j,k,l,m ii~jj~kk~ll~mm
The real dataset has shape 500000, 90.
I need to unnest these values to rows and I'm using the new explode method for this, which works fine.
The problem is the NaN, these will cause unequal lengths after the explode, so I need to fill in the same amount of delimiters as the filled values. In this case ~~~ since row 1 has three comma's.
expected output
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
Attempt 1:
df['Col2'].fillna(df['Col1'].str.count(',')*'~')
Attempt 2:
np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2'])
This works, but I feel like there's an easier method for this:
characters = df['Col1'].str.replace('\w', '').str.replace(',', '~')
df['Col2'] = df['Col2'].fillna(characters)
print(df)
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']]
d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']]
final = pd.concat([d1,d2], axis=1)
print(final)
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
1 e
1 f
1 g
1 h
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
Question: is there an easier and more generalized method for this? Or is my method fine as is.
pd.concat
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in df}, axis=1
).stack()
Col1 Col2
0 0 a aa
1 b bb
2 c cc
3 d dd
1 0 e NaN
1 f NaN
2 g NaN
3 h NaN
2 0 i ii
1 j jj
2 k kk
3 l ll
4 m mm
This loops on columns in df. It may be wiser to loop on keys in the delims dictionary.
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in delims}, axis=1
).stack()
Same thing, different look
delims = {'Col1': ',', 'Col2': '~'}
def f(c): return df[c].str.split(delims[c], expand=True)
pd.concat(map(f, delims), keys=delims, axis=1).stack()
One way is using str.repeat and fillna() not sure how efficient this is though:
df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(',')))
0 aa~bb~cc~dd
1 ~~~
2 ii~jj~kk~ll~mm
Name: Col2, dtype: object
Just split the dataframe into two
df1=df.dropna()
df2=df.drop(df1.index)
d1 = df1['Col1'].str.split(',').explode()
d2 = df1['Col2'].str.split('~').explode()
d3 = df2['Col1'].str.split(',').explode()
final = pd.concat([d1, d2], axis=1).append(d3.to_frame(),sort=False)
Out[77]:
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
1 e NaN
1 f NaN
1 g NaN
1 h NaN
zip_longest can be useful here, given you don't need the original Index. It will work regardless of which column has more splits:
from itertools import zip_longest, chain
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m', 'x,y'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm', 'xx~yy~zz']})
# Col1 Col2
#0 a,b,c,d aa~bb~cc~dd
#1 e,f,g,h NaN
#2 i,j,k,l,m ii~jj~kk~ll~mm
#3 x,y xx~yy~zz
l = [zip_longest(*x, fillvalue='')
for x in zip(df.Col1.str.split(',').fillna(''),
df.Col2.str.split('~').fillna(''))]
pd.DataFrame(chain.from_iterable(l))
0 1
0 a aa
1 b bb
2 c cc
3 d dd
4 e
5 f
6 g
7 h
8 i ii
9 j jj
10 k kk
11 l ll
12 m mm
13 x xx
14 y yy
15 zz

how to convert column names into column values in pandas - python

df=pd.DataFrame(index=['x','y'], data={'a':[1,2],'b':[3,4]})
how can I convert column names into values of a column? This is my desired output
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
You can use:
print (df.T.unstack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
Or:
print (df.stack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
try this:
In [279]: df.stack().reset_index().set_index('level_0').rename(columns={'level_1':'c2',0:'c1'})
Out[279]:
c2 c1
level_0
x a 1
x b 3
y a 2
y b 4
Try:
df1 = df.stack().reset_index(-1).iloc[:, ::-1]
df1.columns = ['c1', 'c2']
df1
In [62]: (pd.melt(df.reset_index(), var_name='c2', value_name='c1', id_vars='index')
.set_index('index'))
Out[62]:
c2 c1
index
x a 1
y a 2
x b 3
y b 4

pandas dataframe: return column that is a compression of other columns

I have a dataframe with a lot of columns, an arbitrary number of which the column names fit a specific string pattern. I want to create a new column that is set to 'r' if any of those other columns has an 'r' in it. I can do something like this:
for col in df.columns:
if 'abc' in col:
for i in df.index:
if df.ix[i, col] == 'r':
df.ix[i, 'newcol'] = 'r'
However this is a bit ugly and slow. Is there a faster way to do this?
Edit: including a sample of what my source data could look like:
df = pd.DataFrame({'abc1':['r','r','n','n'], 'abc2':['r','n','n','r'], 'xyz1':['r','n','n','n'], 'xyz2':['n','n','r','n']})
The output I need (in 'newcol') is:
abc1 abc2 xyz1 xyz2 newcol
0 r r r n r
1 r n n n r
2 n n n r nan
3 n r n n r
(nan could be replaced by pretty much anything as long as it's not 'r').
Alternatively newcol could contain True, True, False, True which would also work fine for my purposes.
Well, I'd probably do it as follows (an example dataframe the hopefully captures your situation well enough):
>>> df
A B abc1 abc2 abc3 abc4
0 1 4 x r a d
1 1 3 y d b e
2 2 4 z e c r
3 3 5 r g d f
4 4 8 z z z z
Get the columns of interest:
>>> cols = [x for x in df.columns if 'abc' in x]
>>> cols
['abc1', 'abc2', 'abc3', 'abc4']
>>> df['newcol'] = (df[cols] == 'r').any(axis=1).map({True:'r',False:'np.nan'})
>>> df
A B abc1 abc2 abc3 abc4 newcol
0 1 4 x r a d r
1 1 3 y d b e np.nan
2 2 4 z e c r r
3 3 5 r g d f r
4 4 8 z z z z np.nan
This should be pretty fast; I think even the use of map here will be a Cythonized call. If a boleen vector is sufficient for the newcol, you could just simplify it to the following:
>>> df['newcol'] = (df[cols] == 'r').any(axis=1)
>>> df
A B abc1 abc2 abc3 abc4 newcol
0 1 4 x r a d True
1 1 3 y d b e False
2 2 4 z e c r True
3 3 5 r g d f True
4 4 8 z z z z False
Now, if you need to check if the strings contain 'r' instead of equalling 'r', you could do as follows:
>>> df
A B abc1 abc2 abc3 abc4
0 1 4 x root a d
1 1 3 y d b e
2 2 4 z e c bar
3 3 5 r g d f
4 4 8 z z z z
>>> cols = [x for x in df.columns if 'abc' in x]
>>> df['newcol'] = df[cols].apply(lambda x: x.str.contains('r'),axis=0).any(axis=1)
>>> df['newcol'] = df['newcol'].map({True:'r',False:'np.nan'})
>>> df
A B abc1 abc2 abc3 abc4 newcol
0 1 4 x root a d r
1 1 3 y d b e np.nan
2 2 4 z e c bar r
3 3 5 r g d f r
4 4 8 z z z z np.nan
This should still be pretty fast because it uses pandas' vectorized string methods for each of the columns (the apply is across the columns, not an iteration over the rows).
Try using apply with a custom function over axis=1:
get_val_for_row = lambda items: 'r' if (items == 'r').any() else None
df['newcol'] = df.apply(get_val_for_row, axis=1)

Concatenate column values in Pandas DataFrame with "NaN" values

I'm trying to concatenate Pandas DataFrame columns with NaN values.
In [96]:df = pd.DataFrame({'col1' : ["1","1","2","2","3","3"],
'col2' : ["p1","p2","p1",np.nan,"p2",np.nan], 'col3' : ["A","B","C","D","E","F"]})
In [97]: df
Out[97]:
col1 col2 col3
0 1 p1 A
1 1 p2 B
2 2 p1 C
3 2 NaN D
4 3 p2 E
5 3 NaN F
In [98]: df['concatenated'] = df['col2'] +','+ df['col3']
In [99]: df
Out[99]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D NaN
4 3 p2 E p2,E
5 3 NaN F NaN
Instead of 'NaN' values in "concatenated" column, I want to get "D" and "F" respectively for this example?
I don't think your problem is trivial. However, here is a workaround using numpy vectorization :
In [49]: def concat(*args):
...: strs = [str(arg) for arg in args if not pd.isnull(arg)]
...: return ','.join(strs) if strs else np.nan
...: np_concat = np.vectorize(concat)
...:
In [50]: np_concat(df['col2'], df['col3'])
Out[50]:
array(['p1,A', 'p2,B', 'p1,C', 'D', 'p2,E', 'F'],
dtype='|S64')
In [51]: df['concatenated'] = np_concat(df['col2'], df['col3'])
In [52]: df
Out[52]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D D
4 3 p2 E p2,E
5 3 NaN F F
[6 rows x 4 columns]
You could first replace NaNs with empty strings, for the whole dataframe or the column(s) you desire.
In [6]: df = df.fillna('')
In [7]: df['concatenated'] = df['col2'] +','+ df['col3']
In [8]: df
Out[8]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 D ,D
4 3 p2 E p2,E
5 3 F ,F
We can use stack which will drop the NaN, then use groupby.agg and ','.join the strings:
df['concatenated'] = df[['col2', 'col3']].stack().groupby(level=0).agg(','.join)
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D D
4 3 p2 E p2,E
5 3 NaN F F

Categories

Resources