I want to create a new column in my dataframe that places the name of the column in the row if only that column has a value of 8 in the respective row, otherwise the new column's value for the row would be "NONE". For the dataframe df, the new column df["New_Column"] = ["NONE","NONE","A","NONE"]
df = pd.DataFrame({"A": [1, 2,8,3], "B": [0, 2,4,8], "C": [0, 0,7,8]})
Cool problem.
Find the 8-fields in each row: df==8
Count them: (df==8).sum(axis=1)
Find the rows where the count is 1: (df==8).sum(axis=1)==1
Select just those rows from the original dataframe: df[(df==8).sum(axis=1)==1]==8
Find the 8-fields again: df[(df==8).sum(axis=1)==1]==8)
Find the columns that hold the True values with idxmax (because True>False): (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
Fill in the gaps with "NONE"
To summarize:
df["New_Column"] = (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
df["New_Column"] = df["New_Column"].fillna("NONE")
# A B C New_Column
#0 1 0 0 NONE
#1 2 2 0 NONE
#2 8 4 7 A
#3 3 8 8 NONE
# I added another line as a proof of concept
#4 0 8 0 B
You can accomplish this using idxmax and a mask:
out = (df==8).idxmax(1)
m = ~(df==8).any(1) | ((df==8).sum(1) > 1)
df.assign(col=out.mask(m))
A B C col
0 1 0 0 NaN
1 2 2 0 NaN
2 8 4 7 A
3 3 8 8 NaN
Or do:
df2=df[(df==8)]
df['New_Column']=(df2[(df2!=df2.dropna(thresh=2).values[0]).all(1)].dropna(how='all')).idxmax(1)
df['New_Column'] = df['New_Column'].fillna('NONE')
print(df)
dropna + dropna again + idxmax + fillna. that's all you need for this.
Output:
A B C New_Column
0 1 0 0 NONE
1 2 2 0 NONE
2 8 4 7 A
3 3 8 8 NONE
Related
I have 2 dataframes
df1 = pd.DataFrame([["M","N","O"],["A","B","C"],["X","Y","Z"],[2,3,4],[1,2,3]])
0 1 2
M N O
A B C
X Y Z
2 3 4
1 2 3
df2 = pd.DataFrame([["P","Q","R","S"],["X","Z","W","Y"],[4,5,6,7],[7,8,9,3]])
0 1 2 3
P Q R S
X Z W Y
4 5 6 7
7 8 9 3
I want to read the 1st dataframe drop the rows till the row starts with X and make that row as column names, then read the 2nd dataframe again drop the rows till row starts with X then append it to the 1st dataframe. Repeat the process in loop because I have multiple such dataframes.
Expected Output:
df_out = pd.DataFrame([[2,3,4,0],[1,2,3,0],[4,7,5,6],[7,3,8,9]],columns=["X","Y","Z","W"])
X Y Z W
2 3 4 0
1 2 3 0
4 7 5 6
7 3 8 9
How to do it?
First test if value X exist in any row for all columns in shifted DataFrame for get all rows after match by DataFrame.cummax with DataFrame.any and set columns names by this row in DataFrame.set_axis, same solution use for another DataFrame, join by concat, replace missing values and for expected order add DataFrame.reindex with unon both columns names:
m1 = df1.shift().eq('X').cummax().any(axis=1)
cols1 = df1[df1.eq('X').any(axis=1)].to_numpy().tolist()
df11 = df1[m1].set_axis(cols1, axis=1)
m2 = df2.shift().eq('X').cummax().any(axis=1)
cols2 = df2[df2.eq('X').any(axis=1)].to_numpy().tolist()
df22 = df2[m2].set_axis(cols2, axis=1)
df = (pd.concat([df11, df22], ignore_index=True)
.fillna(0)
.reindex(df11.columns.union(df22.columns, sort=False), axis=1))
print (df)
X Y Z W
0 2 3 4 0
1 1 2 3 0
2 4 7 5 6
3 7 3 8 9
This works,
shift = 0
for index in df.index:
if df.iloc[index - 1, 0] == "X":
X = df.iloc[index - 1, :].values
break
shift -= 1
df = df.shift(shift).dropna()
df.columns = X
df
Output -
X
Y
Z
0
2
3
4
1
1
2
3
I have a dataframe with two integer columns like:
a b
0 5 7
1 3 5
2 7 1
I need an additional column containing the index where the value of column a equals that of column b of the current row. I. e.: b=7 is matched by a=7 at index 2, b=5 by a=5 at index 0, b=1 is not matched. Desired output:
a b c
0 5 7 2
1 3 5 0
2 7 1 NaN
There will never be multiple lines where the condition is fulfilled.
Option with searchsorted:
# sort column a and find candidate position of values in b in a
df.sort_values('a', inplace=True)
pos = df.a.searchsorted(df.b)
# handle edge case when the pos is out of bound
pos[pos == len(pos)] = len(pos) - 1
# assign the index to column as c and mark values that don't match as nan
df['c'] = df.index[pos]
df.loc[df.a.loc[df.c].values != df.b.values, 'c'] = np.nan
df.sort_index()
# a b c
#0 5 7 2.0
#1 3 5 0.0
#2 7 1 NaN
I need to sum up values of 'D' column for every row with the same combination of values from columns 'A','B' and 'C. Eventually I need to create DataFrame with unique combinations of values from
columns 'A','B' and 'C' with corresponding sum in column D.
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df
OT:
A B C D
0 0 2 0 2
1 0 1 2 1
2 0 0 2 0
3 1 2 2 2
4 0 2 2 2
5 0 2 2 2
6 2 2 2 1
7 2 1 1 1
8 1 0 2 0
9 1 2 0 0
I've tried to create temporary data frame with empty cells
D = pd.DataFrame([i for i in range(len(df))]).rename(columns = {0:'D'})
D['D'] = ''
D
OT:
D
0
1
2
3
4
5
6
7
8
9
And use apply() to sum up all 'D' column values for unique row consisted of columns 'A','B' and 'C'. For example below line returns sum of values from 'D' column for 'A'=0,'B'=2,'C'=2:
df[(df['A']==0) & (df['B']==2) & (df['C']==2)]['D'].sum()
OT:
4
function:
def Sumup(cols):
A = cols[0]
B = cols[1]
C = cols[2]
D = cols[3]
sum = df[(df['A']==A) & (df['B']==B) & (df['C']==C)]['D'].sum()
return sum
apply on df and saved in temp df D['D']:
D['D'] = df[['A','B','C','D']].apply(Sumup)
Later I wanted to use drop_duplicates but I receive dataframe consisted of NaN's.
D
OT:
D
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
Anyone could give me a hint how to manage the NaN problem or what other approach can I apply to solve the original
problem?
df.groupby(['A','B','C']).sum()
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df.groupby(["A", "B", "C"])["D"].sum()
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
Say I have following dataframe (but keep in mind this could have 100+ rows and columns):
I only want to sum the values of some rows that met a condition, in this case of the rows that have a 2 for stream. For the other rows I want them to get a default value, for example 0.
This is what I tried:
cols = [col for col in dataFrame.columns if col != 'stream']
dataFrame.loc[dataFrame['stream'] == 2, cols].sum(axis=1)
But it doesn't get the result I want. What's wrong with my code?
I think you are very close, you need only add new column sum and then fillna with 0:
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']
df1['sum'] = df1.loc[df1['stream'] == 2, cols ].sum(axis=1)
df1['sum'] = df1['sum'].fillna(0)
print df1
stream feat another_feat sum
a 1 8 4 0.0
b 2 5 5 10.0
c 2 7 7 14.0
d 3 3 2 0.0
If all values are int, last you can cast float to int by astype:
df1['sum'] = df1['sum'].fillna(0).astype(int)
print df1
stream feat another_feat sum
a 1 8 4 0
b 2 5 5 10
c 2 7 7 14
d 3 3 2 0
Another solution with numpy.where:
df1['sum'] = np.where(df1['stream'] == 2, df1[cols].sum(axis=1), 0)
print df1
stream feat another_feat sum
a 1 8 4 0
b 2 5 5 10
c 2 7 7 14
d 3 3 2 0