I have a dataframe with two integer columns like:
a b
0 5 7
1 3 5
2 7 1
I need an additional column containing the index where the value of column a equals that of column b of the current row. I. e.: b=7 is matched by a=7 at index 2, b=5 by a=5 at index 0, b=1 is not matched. Desired output:
a b c
0 5 7 2
1 3 5 0
2 7 1 NaN
There will never be multiple lines where the condition is fulfilled.
Option with searchsorted:
# sort column a and find candidate position of values in b in a
df.sort_values('a', inplace=True)
pos = df.a.searchsorted(df.b)
# handle edge case when the pos is out of bound
pos[pos == len(pos)] = len(pos) - 1
# assign the index to column as c and mark values that don't match as nan
df['c'] = df.index[pos]
df.loc[df.a.loc[df.c].values != df.b.values, 'c'] = np.nan
df.sort_index()
# a b c
#0 5 7 2.0
#1 3 5 0.0
#2 7 1 NaN
Related
I need to sum up values of 'D' column for every row with the same combination of values from columns 'A','B' and 'C. Eventually I need to create DataFrame with unique combinations of values from
columns 'A','B' and 'C' with corresponding sum in column D.
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df
OT:
A B C D
0 0 2 0 2
1 0 1 2 1
2 0 0 2 0
3 1 2 2 2
4 0 2 2 2
5 0 2 2 2
6 2 2 2 1
7 2 1 1 1
8 1 0 2 0
9 1 2 0 0
I've tried to create temporary data frame with empty cells
D = pd.DataFrame([i for i in range(len(df))]).rename(columns = {0:'D'})
D['D'] = ''
D
OT:
D
0
1
2
3
4
5
6
7
8
9
And use apply() to sum up all 'D' column values for unique row consisted of columns 'A','B' and 'C'. For example below line returns sum of values from 'D' column for 'A'=0,'B'=2,'C'=2:
df[(df['A']==0) & (df['B']==2) & (df['C']==2)]['D'].sum()
OT:
4
function:
def Sumup(cols):
A = cols[0]
B = cols[1]
C = cols[2]
D = cols[3]
sum = df[(df['A']==A) & (df['B']==B) & (df['C']==C)]['D'].sum()
return sum
apply on df and saved in temp df D['D']:
D['D'] = df[['A','B','C','D']].apply(Sumup)
Later I wanted to use drop_duplicates but I receive dataframe consisted of NaN's.
D
OT:
D
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
Anyone could give me a hint how to manage the NaN problem or what other approach can I apply to solve the original
problem?
df.groupby(['A','B','C']).sum()
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df.groupby(["A", "B", "C"])["D"].sum()
sample and expected data
The block one is current data and block 2 is the expected data that is, when i encounter 1 i need the next row to be incremented by one and for next country b same should happen
First replace all another values after first 1 to 1, so is possible use GroupBy.cumsum:
df = pd.DataFrame({'c':['a']*3 + ['b']*3+ ['c']*3, 'v':[1,0,0,0,1,0,0,0,1]})
s = df.groupby('c')['v'].cumsum()
df['new'] = s.where(s.eq(0), 1).groupby(df['c']).cumsum()
print (df)
c v new
0 a 1 1
1 a 0 2
2 a 0 3
3 b 0 0
4 b 1 1
5 b 0 2
6 c 0 0
7 c 0 0
8 c 1 1
Another solution is replace all not 1 values to missing values and forward filling 1 per groups, then first missing values are replaced to 0, so cumulative sum also working perfectly:
s = df['v'].where(df['v'].eq(1)).groupby(df['c']).ffill().fillna(0).astype(int)
df['new'] = s.groupby(df['c']).cumsum()
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
I want to create a new column in my dataframe that places the name of the column in the row if only that column has a value of 8 in the respective row, otherwise the new column's value for the row would be "NONE". For the dataframe df, the new column df["New_Column"] = ["NONE","NONE","A","NONE"]
df = pd.DataFrame({"A": [1, 2,8,3], "B": [0, 2,4,8], "C": [0, 0,7,8]})
Cool problem.
Find the 8-fields in each row: df==8
Count them: (df==8).sum(axis=1)
Find the rows where the count is 1: (df==8).sum(axis=1)==1
Select just those rows from the original dataframe: df[(df==8).sum(axis=1)==1]==8
Find the 8-fields again: df[(df==8).sum(axis=1)==1]==8)
Find the columns that hold the True values with idxmax (because True>False): (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
Fill in the gaps with "NONE"
To summarize:
df["New_Column"] = (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
df["New_Column"] = df["New_Column"].fillna("NONE")
# A B C New_Column
#0 1 0 0 NONE
#1 2 2 0 NONE
#2 8 4 7 A
#3 3 8 8 NONE
# I added another line as a proof of concept
#4 0 8 0 B
You can accomplish this using idxmax and a mask:
out = (df==8).idxmax(1)
m = ~(df==8).any(1) | ((df==8).sum(1) > 1)
df.assign(col=out.mask(m))
A B C col
0 1 0 0 NaN
1 2 2 0 NaN
2 8 4 7 A
3 3 8 8 NaN
Or do:
df2=df[(df==8)]
df['New_Column']=(df2[(df2!=df2.dropna(thresh=2).values[0]).all(1)].dropna(how='all')).idxmax(1)
df['New_Column'] = df['New_Column'].fillna('NONE')
print(df)
dropna + dropna again + idxmax + fillna. that's all you need for this.
Output:
A B C New_Column
0 1 0 0 NONE
1 2 2 0 NONE
2 8 4 7 A
3 3 8 8 NONE
I am trying to use information about a row to inform which other data throughout a DataFrame to look at.
I have a DataFrame like this:
df = pd.DataFrame({'a':[1,5,9],'b':[2,6,3],'c':[0,7,1]})
a b c
0 1 2 0
1 5 6 7
2 9 3 1
I would like to ask something like:
What is the value at the next index location for the highest value in each row.
The result might look something like this:
a b c data
0 1 2 0 6
1 5 6 7 1
2 9 3 1 NaN
The largest number at index 0 is 2, and 6 is found in the same column at the next index location.
The largest number at index 1 is 7, and 1 is found in the same column at the next index location.
And there is no data after index 2 so nothing is returned.
Use .idxmax to find the column of the maximum value for each row, and then use df.lookup to find the value in the next row within the same column.
import pandas as pd
# Ignore the last row
lookups = df.idxmax(axis=1)[:-1]
#0 b
#1 c
#dtype: object
df['data'] = pd.Series(df.lookup(lookups.index+1, lookups))
# a b c data
#0 1 2 0 6.0
#1 5 6 7 1.0
#2 9 3 1 NaN