Get first version of a line with duplicate values versus one column - python

Hello I'm looking for a way to get from this dataframe df::
df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
df
# X Y Z
# 0 a A 1
# 1 b B 2
# 2 b C 3
# 3 c D 4
# 4 c E 1
# 5 c F 2
# 6 d G 3
# 7 d H 4
# 8 e I 1
# 9 f J 2
Only the first lines for each X value, so this one::
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2
I'm looking for a more elegant way than this::
x_unique = df.X.unique()
x_unique
# array(['a', 'b', 'c', 'd', 'e', 'f'], dtype=object)
res = df[df.X == x_unique[0]].iloc[0]
for u in x_unique[1:]:
res = pd.concat([res, df[df.X==u].iloc[0]], axis=1)
res
# 0 1 3 6 8 9
# X a b c d e f
# Y A B D G I J
# Z 1 2 4 3 1 2
res = res.transpose()
res
# X Y Z
# 0 a A 1
# 1 b B 2
# 3 c D 4
# 6 d G 3
# 8 e I 1
# 9 f J 2

You could use drop_duplicates() method on X
In [60]: df.drop_duplicates('X')
Out[60]:
X Y Z
0 a A 1
1 b B 2
3 c D 4
6 d G 3
8 e I 1
9 f J 2

You can also do:
In [3]: import pandas as pd
In [4]: df = pd.DataFrame(dict(X=list('abbcccddef'),
Y=list('ABCDEFGHIJ'),
Z=list('1234123412')))
In [5]: df.groupby('X').first()
Out[5]:
Y Z
X
a A 1
b B 2
c D 4
d G 3
e I 1
f J 2

Related

pandas compare first rows and make identical

I have two dfs.
df1 = pd.DataFrame(["bazzar","dogsss","zxvfzx","anythi"], columns = [0], index = [0,1,2,3])
df2 = pd.DataFrame(["baar","maar","cats","$%&*"], columns = [0], index = [0,1,2,3])
df1 = df1[0].apply(lambda x: pd.Series(list(x)))
df2 = df2[0].apply(lambda x: pd.Series(list(x)))
which look like
df1
0 1 2 3 4 5
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
df2
0 1 2 3
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
I want to compare their first rows and make them identical by inserting new columns containing the character z to df2, so that df2 becomes
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
An additional example:
df3 = pd.DataFrame(["aazzbbzcc","bbbbbbbbb","ccccccccc","ddddddddd"], columns = [0], index = [0,1,2,3])
df4 = pd.DataFrame(["aabbcc","111111","222222","333333"], columns = [0], index = [0,1,2,3])
df3 = df3[0].apply(lambda x: pd.Series(list(x)))
df4 = df4[0].apply(lambda x: pd.Series(list(x)))
df3
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 b b b b b b b b b
2 c c c c c c c c c
3 d d d d d d d d d
df4
0 1 2 3 4 5
0 a a b b c c
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
You can see, an important relationship between the first rows of the two dataframes: they will eventually become the same when character z are added to the later dataframe (i.e. df2 and df4), so that the expected output for this example is:
0 1 2 3 4 5 6 7 8
0 a a z z b b z c c
1 1 1 z z 1 1 z 1 1
2 2 2 z z 2 2 z 2 2
3 3 3 z z 3 3 z 3 3
Any idea how to do that?
Because in first rows are duplicated values are create MultiIndex with first rows and GroupBy.cumcount for both DataFrames:
a = df1.iloc[[0]].T
df1.columns = [a[0], a.groupby(a[0]).cumcount()]
b = df2.iloc[[0]].T
df2.columns = [b[0], b.groupby(b[0]).cumcount()]
print (df1)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 d o g s s s
2 z x v f z x
3 a n y t h i
print (df2)
0 b a r
0 0 1 0
0 b a a r
1 m a a r
2 c a t s
3 $ % & *
And then is used DataFrame.reindex with replace missing values by first row of df1:
df = df2.reindex(df1.columns, axis=1).fillna(df1.iloc[0])
print (df)
0 b a z a r
0 0 0 1 1 0
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Last set range to columns:
df.columns = range(len(df.columns))
print (df)
0 1 2 3 4 5
0 b a z z a r
1 m a z z a r
2 c a z z t s
3 $ % z z & *
Check where to add:
list(difflib.ndiff(df2[0][0], df1[0][0]))
[' b', ' a', '+ z', '+ z', ' a', ' r']
Add manually
df2[0].str.replace('(.){2}', '\\1zz', regex = True).str.split('(?<=\\S)(?=\\S)', expand = True)
Out[1557]:
0 1 2 3 4 5
0 a z z r z z
1 a z z r z z
2 a z z s z z
3 % z z * z z

Pandas: How to replace values to np.nan based on Condition for multiple columns

Here is my dataframe.
I A B C D E F
1 9 4 0 T F F
2 0 5 1 S X J
3 1 8 0 G G J
Here is my expected output.
I want to replace if value in A ==0, repalce to np.nan in D.
I A B C D E F
1 9 4 0 T F nan
2 0 5 1 nan X J
3 1 8 0 G G nan
I want to replace values on D/E/F columns by values of A/B/C columns.
For example, D column changes according to the A column. (A->D, B->E, C->F)
I tried this code but didn't change value.
list1 = ['A', 'B', 'C']
list2 = ['D', 'E', 'F']
for i in list2:
for j in list1:
df[i] = np.where(df[j] == 0, np.nan, df[i])
For below code, working well.
But there are lots of columns, so I want to using list and for sentence.
df['D'] = np.where(df.A == 0, np.nan, df.D)
First we create a dictionary from your two lists using zip
replace_dict = dict(zip(list1,list2))
then we loop over it to handle your assignments,
for k,v in replace_dict.items():
df.loc[df[k] == 0, v] = np.nan
print(df)
I A B C D E F
0 1 9 4 0 T F NaN
1 2 0 5 1 NaN X J
2 3 1 8 0 G G NaN
another method is to use np.where with your lists.
df[list2] = np.where(df[list1].eq(0), np.nan,df[list2])
print(df)
I A B C D E F
0 1 9 4 0 T F NaN
1 2 0 5 1 NaN X J
2 3 1 8 0 G G NaN
Let us do
df.loc[:,'D':].mask(df.loc[:,'A':'C'].eq(0).values)
D E F
0 T F NaN
1 NaN X J
2 G G NaN
df.loc[:,'D':]= df.loc[:,'D':].mask(df.loc[:,'A':'C'].eq(0).values)
Use DataFrame.mask with DataFrame.rename as:
df[list2] = df[list2].mask(df[list1].rename(columns=dict(zip(list1, list2))).eq(0))
print(df)
I A B C D E F
0 1 9 4 0 T F NaN
1 2 0 5 1 NaN X J
2 3 1 8 0 G G NaN

Separate pandas df by repeating range in a column

Problem:
I'm trying to split a pandas data frame by the repeating ranges in column A. My data and output are as follows. The ranges in columns A are always increasing and do not skip values. The values in column A do start and stop arbitrarily, however.
Data:
import pandas as pd
dict = {"A": [1,2,3,2,3,4,3,4,5,6],
"B": ["a","b","c","d","e","f","g","h","i","k"]}
df = pd.DataFrame(dict)
df
A B
0 1 a
1 2 b
2 3 c
3 2 d
4 3 e
5 4 f
6 3 g
7 4 h
8 5 i
9 6 k
Desired ouptut:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 2 d
1 3 e
2 4 f
df3
A B
0 3 g
1 4 h
2 5 i
3 6 k
Thanks for advice!
Answer times:
from timeit import default_timer as timer
start = timer()
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
end = timer()
aa = end - start
start = timer()
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
end = timer()
bb = end - start
start = timer()
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
end = timer()
cc = end - start
print(aa,bb,cc)
0.0176649530000077 0.018132143000002543 0.018715283999995336
Create the groupby key by using diff and cumsum
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Just groupby by the difference
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
Outputs
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
One-liner
because that's important
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
Print it
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Assign it
df1, df2, df3 = (d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))

How to reshape a multi-column dataframe by index?

Following from here . The solution works for only one column. How to improve the solution for multiple columns. i.e If I have a dataframe like
df= pd.DataFrame([['a','b'],['b','c'],['c','z'],['d','b']],index=[0,0,1,1])
0 1
0 a b
0 b c
1 c z
1 d b
How to reshape them like
0 1 2 3
0 a b b c
1 c z d b
If df is
0 1
0 a b
1 c z
1 d b
Then
0 1 2 3
0 a b NaN NaN
1 c z d b
Use flatten/ravel
In [4401]: df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
Out[4401]:
0 1 2 3
0 a b b c
1 c z d b
Or, stack
In [4413]: df.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
Out[4413]:
0 1 2 3
0 a b b c
1 c z d b
Also, with unequal indices
In [4435]: df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
Out[4435]:
0 1 2 3
0 a b NaN NaN
1 c z d b
Use groupby + pd.Series + np.reshape:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.reshape(-1, )))
0 1 2 3
0 a b b c
1 c z d b
Solution for unequal number of indices - call the pd.DataFrame constructor instead.
df
0 1
0 a b
1 c z
1 d b
df.groupby(level=0).apply(lambda x: \
pd.DataFrame(x.values.reshape(1, -1))).reset_index(drop=True)
0 1 2 3
0 a b NaN NaN
1 c z d b
pd.DataFrame({n: g.values.ravel() for n, g in df.groupby(level=0)}).T
0 1 2 3
0 a b b c
1 c z d b
This is all over the place and I'm too tired to make it pretty
v = df.values
cc = df.groupby(level=0).cumcount().values
i0, r = pd.factorize(df.index.values)
n, m = v.shape
j0 = np.tile(np.arange(m), n)
j = np.arange(r.size * m).reshape(-1, m)[cc].ravel()
i = i0.repeat(m)
e = np.empty((r.size, m * r.size), dtype=object)
e[i, j] = v.ravel()
pd.DataFrame(e, r)
0 1 2 3
0 a b None None
1 c z d b
Let's try
df1 = df.set_index(df.groupby(level=0).cumcount(), append=True).unstack()
df1.set_axis(labels=pd.np.arange(len(df1.columns)), axis=1)
Output:
0 1 2 3
0 a b b c
1 c d z b
Output for df with NaN:
0 1 2 3
0 a None b None
1 c d z b

Replace values in a column, with certain values from another, ignoring any 'nan' entries

I have the following pandas dataframe:
A B C D
2 a 1 F
4 b 2 G
6 b 3 nan
1 c 4 G
5 c 5 nan
7 d 6 H
I want to replace any values in column B, with the values in column D, while not doing anything for the 'nan' entries in column D.
Desired output:
A B C D
2 F 1 F
4 G 2 G
6 b 3 nan
1 G 4 G
5 c 5 nan
7 H 6 H
You can mask the rows of interest using a boolean mask and pass this to loc so only those rows are overwritten:
In [3]:
df.loc[df['D'].notnull(), 'B'] = df['D']
df
Out[3]:
A B C D
0 2 F 1 F
1 4 G 2 G
2 6 b 3 NaN
3 1 G 4 G
4 5 c 5 NaN
5 7 H 6 H
See the docs on boolean indexing and notnull
Few alternative solutions:
In [72]: df['B'] = df['D'].combine_first(df['B'])
In [73]: df
Out[73]:
A B C D
0 2 F 1 F
1 4 G 2 G
2 6 b 3 NaN
3 1 G 4 G
4 5 c 5 NaN
5 7 H 6 H
or:
df['B'] = df['D'].fillna(df['B'])
or:
df['B'] = df['D'].mask(df['D'].isnull(), df['B'])
or:
df['B'] = df['D'].where(df['D'].notnull(), df['B'])
or:
df['B'] = np.where(df['D'].notnull(), df['D'], df['B'])

Categories

Resources