How to shift a dataframe element-wise to fill NaNs? - python

I have a DataFrame like this:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
I am trying to fill NaN with values of the previous column in the next row and dropping this second row. In other words, I want to combine the two rows with NaNs to form a single row without NaNs like this:
a b
0 A E
1 B C
2 D F
I have tried various flavors of df.fillna(method="<bfill/ffill>") but this didn't give me the expected output.
I haven't found any other question about this problem, Here's one. And actually that DataFrame is made from list of DataFrame by doing .concat(), you may notice that from indexes also. I am telling this because it may be easy to do in single row rather then in multiple rows.
I have found some suggestions to use shift, combine_first but non of them worked for me. You may try these too.
I also have found this too. It is a whole article about filling nan values but I haven't found problem/answer like mine.

OK misunderstood what you wanted to do the first time. The dummy example was a bit ambiguous.
Here is another:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
To my knowledge, this operation does not exist with pandas, so we will use numpy to do the work.
First transform the dataframe to numpy array and flatten it to be one-dimensional. Then drop NaNs using pandas.isna that is working on a larger range types than numpy.isnan, and then reshape the array to its original shape before transforming back to dataframe:
array = df.to_numpy().flatten()
pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
output:
a b
0 A E
1 B C
2 D F
It is also working for more complex examples, as long as the NaN pattern is conserved among columns with NaNs:
In:
a b c d
0 A H A2 H2
1 B NaN B2 NaN
2 C NaN C2 NaN
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
Out:
a b c d
0 A H A2 H2
1 B B2 C C2
2 D I D2 I2
3 E E2 F F2
4 G J G2 J2
In:
a b c
0 A F H
1 B NaN NaN
2 C NaN NaN
3 D NaN NaN
4 E G I
Out:
a b c
0 A F H
1 B C D
2 E G I
In case NaNs columns do not have the same pattern such as:
a b c d
0 A H A2 NaN
1 B NaN B2 NaN
2 C NaN C2 H2
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
You can apply the operation per group of two columns:
def elementwise_shift(df):
array = df.to_numpy().flatten()
return pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
(df.groupby(np.repeat(np.arange(df.shape[1]/2), 2), axis=1)
.apply(elementwise_shift)
)
output:
a b c d
0 A H A2 B2
1 B C C2 H2
2 D I D2 I2
3 E F E2 F2
4 G J G2 J2

You can do this in two steps with a placeholder column. First you fill all the nans in column b with the a values from the next row. Then you apply the filtering. In this example I use ffill with a limit of 1 to filter all nan values after the first, there's probably a better method.
import pandas as pd
import numpy as np
df=pd.DataFrame({"a":[1,2,3,3,4],"b":[1,2,np.nan,np.nan,4]})
# Fill all nans:
df['new_b'] = df['b'].fillna(df['a'].shift(-1))
df = df[df['b'].ffill(limit=1).notna()].copy() # .copy() because loc makes a view
df = df.drop('b', axis=1).rename(columns={'new_b': 'b'})
print(df)
# output:
# a b
# 0 1 1
# 1 2 2
# 2 3 2
# 4 4 4

Related

Comparing 2 columns group by group in pandas or python

I currently have a dataset here where i am unsure of how to compare if the groups have similar values. Here is a sample of my dataset
type value
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 1
c 3
c 4
d 2
d 3
d 4
I want to know which rows are similar, in the sense that all the (values in 1 type) are present in another type. So for example type d has value 2,3,4 and type a has value 1,2,3,4
so this is 'similar' or can be considered the same so i would like it output something that tells me d is similar to A.
Expected output should be like this
type value similarity
a 1 A is similar to B and D
a 2
a 3
a 4
b 2 b is similar to a and d
b 3
b 4
b 5
c 1 c is similar to a
c 3
c 4
d 2 d is similar to a and b
d 3
d 4
not sure if this can be done in python or pandas but guidance is really appreciated as i'm really lost and not sure where to begain
the output also does not have to be what i just put as an example here, it can just be another csv that tells me which types are similar and
I would use set operations.
assuming similarity means at least N items in common:
from itertools import combinations
# define minimum number of common items
N = 3
# aggregate as sets
s = df.groupby('type')['value'].agg(set)
# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to b, c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d, a
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
assuming similarity means one set is the subset of the other:
from itertools import combinations
s = df.groupby('type')['value'].agg(set)
out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
You can use:
# Group all rows and transform as set
df1 = df.groupby('type', as_index=False)['value'].agg(set)
# Get all combinations
df1 = df1.merge(df1, how='cross').query('type_x != type_y')
# Compute the intersection between sets
df1['similarity'] = [row.value_x.intersection(row.value_y)
for row in df1[['value_x', 'value_y']].itertuples()]
# Keep rows with at least 3 similarities then export report
sim = (df1.loc[df1['similarity'].str.len() >= 3].groupby('type_x')['type_y']
.agg(', '.join).rename('similarity').rename_axis(index='type')
.reset_index())
Output:
>>> sim
type similarity
0 a b, c, d
1 b a, d
2 c a
3 d a, b

Pandas: Same indices for each column. Is there a better way to solve this?

Sorry for the lousy text in the question? I can't come up with a summarized way to ask this question.
I have a dataframe (variable df) such as the below:
df
ID
A
B
C
1
m
nan
nan
2
n
nan
nan
3
b
nan
nan
1
nan
t
nan
2
nan
e
nan
3
nan
r
nan
1
nan
nan
y
2
nan
nan
u
3
nan
nan
i
The desired output is:
ID
A
B
C
1
m
t
y
2
n
e
u
3
b
r
i
I solved this by running the following lines:
new_df = pd.DataFrame()
for column in df.columns:
new_df = pd.concat([new_df, df[column].dropna()], join='outer', axis=1)
And then I figured this would be faster:
empty_dict = {}
for column in df.columns:
empty_dict[column] = df[column].dropna()
new_df = pd.DataFrame.from_dict(empty_dict)
However, the dropna could represent a problem if, for example, there is a missing value in the rows that have the values to be used in each column. E.g. if df.loc[2,'A'] = nan, then that key in the dictionary will only have 2 values causing a misalignment with the rest of the columns. I'm not convinced.
I have the feeling pandas must have a builtin function that will do a better job and either of my two solutions. Is there? If not, is there any better way of solving this?
Looks like you only need groupby().first():
df.groupby('ID', as_index=False).first()
Output:
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i
Use stack_unstack() as suggested by #QuangHoang if ID is the index:
>>> df.stack().unstack().reset_index()
A B C
ID
1 m t y
2 n e u
3 b r i
You can use melt and pivot:
>>> df.melt('ID').dropna().pivot('ID', 'variable', 'value') \
.rename_axis(columns=None).reset_index()
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i

Moving rows from one column to another along with respective values in pandas DataFrame

This is the dataframe I have with three rows and three columns.
a d aa
b e bb
c f cc
What I want is to remove the second column and adding those values to the rows in first column along with their respective values from third column.
This is the expected result:
a aa
b bb
c cc
d aa
e bb
f cc
Firstly concat the columns:
df1 = pd.concat([df[df.columns[[0,2]]], df[df.columns[[1,2]]]])
Then what you obtain is:
0 1 2
0 a NaN aa
1 b NaN bb
2 c NaN cc
0 NaN d aa
1 NaN e bb
2 NaN f cc
Now, just replace the NaN values in [0] with the corresponding values from [1].
df1[0] = df1[0].fillna(df1[1])
Output:
0 1 2
0 a NaN aa
1 b NaN bb
2 c NaN cc
0 d d aa
1 e e bb
2 f f cc
Here, you may only need [0] and [2] columns.
df1[[0,2]]
Final Output:
0 2
0 a aa
1 b bb
2 c cc
0 d aa
1 e bb
2 f cc
Here are 4 steps: split into 2 dataframes; make column names the same; append; reindex.
Import pandas as pd
df = pd.DataFrame({'col1':['a','b','c'],'col2':['c','d','e'],'col3':['aa','bb','cc']})
df2 = df[['col1','col3']] # split into 2 dataframes
df3 = df[['col2','col3']]
df3.columns = df2.columns # make column names the same
df_final = df2.append(df3) # append
df_final.index = range(len(df_final.index)) # reindex
print(df_final)
pd.concat([df[df.columns[[0, 2]]], df[df.columns[[0, 1]]])

Pandas merge on multiple columns ignoring NaN

I am trying to do the same as this answer, but with the difference that I'd like to ignore NaN in some cases. For instance:
#df1
c1 c2 c3
0 a b 1
1 a c 2
2 a nan 1
3 b nan 3
4 c d 1
5 d e 3
#df2
c1 c2 c4
0 a nan 1
1 a c 2
2 a x 1
3 b nan 3
4 z y 2
#merged output based on [c1, c2], dropping instances
#with `NaN` unless both dataframes have `NaN`.
c1 c2 c3 c4
0 a b 1 1 #c1,c2 from df1 because df2 has a nan in c2
1 a c 2 2 #in both
2 a x 1 1 #c1,c2 from df2 because df1 has a nan in c2
3 b nan 3 3 #c1,c2 as found in both
4 c d 1 nan #from df1
5 d e 3 nan #from df1
6 z y nan 2 #from df2
NaNs may come from either c1 or c2, but for this example I kept it simpler.
I'm not sure what's the cleanest way to do this. I was thinking to merge based on [c1,c2], and then loop by rows with nan, but this will not be so direct. Do you see a better way to do it?
Edit - clarifying conditions
1. No duplicates are found anywhere.
2. No combination is performed between two rows if they both have values. c1 may not be combined with c2, so order must be respected.
3. For the cases where one of the 2 dfs has a nan in either c1 or c2, find the rows in the other dataframe that don't have a full match on both c1+c2, and use it. For instance:
(a,c) has a match in both so it is no longer discussed.
(a,b) is only in df1. No b is found in df2.c2. The only row in df2 with a known key and a nan is row 0 so it is combined with this one. Note that order must be respected this is why (a,b) #df1 cannot be combined with any other row of df2 that also contains a b.
(a,x) is only in df2. No x is found in df1.c2. The only row in df1 with one of the known keys with a nan is row with index 2.

pandas groupby transpose str column

here is what I am trying to do:
>>>import pandas as pd
>>>dftemp = pd.DataFrame({'a': [1] * 3 + [2] * 3, 'b': 'a a b c d e'.split()})
a b
0 1 a
1 1 a
2 1 b
3 2 c
4 2 d
5 2 e
6 3 f
how to transpose column 'b' grouped by column 'a', so that output looks like:
a b0 b1 b2
0 1 a a b
3 2 c d e
6 3 f NaN NaN
Using pivot_table with cumcount:
(df.assign(flag=df.groupby('a').b.cumcount())
.pivot_table(index='a', columns='flag', values='b', aggfunc='first')
.add_prefix('B'))
flag B0 B1 B2
a
1 a a b
2 c d e
3 f NaN NaN
You can try of grouping by column and flattening the values associated with group and reframe it as dataframe
df = df.groupby(['a'])['b'].apply(lambda x: x.values.flatten())
pd.DataFrame(df.values.tolist(),index=df.index).add_prefix('B')
Out:
B0 B1 B2
a
1 a a b
2 c d e
3 f None None
you could probably try something like this :
>>> dftemp = pd.DataFrame({'a': [1] * 3 + [2] * 2 + [3]*1, 'b': 'a a b c d e'.split()})
>>> dftemp
a b
0 1 a
1 1 a
2 1 b
3 2 c
4 2 d
5 3 e
>>> dftemp.groupby('a')['b'].apply(lambda df: df.reset_index(drop=True)).unstack()
0 1 2
a
1 a a b
2 c d None
3 e None None
Given the ordering of your DataFrame you could find where the group changes and use np.split to create a new DataFrame.
import numpy as np
import pandas as pd
splits = dftemp[(dftemp.a != dftemp.a.shift())].index.values
df = pd.DataFrame(np.split(dftemp.b.values, splits[1:])).add_prefix('b').fillna(np.NaN)
df['a'] = dftemp.loc[splits, 'a'].values
Output
b0 b1 b2 a
0 a a b 1
1 c d e 2
2 f NaN NaN 3

Categories

Resources