Say I have following dataframe (but keep in mind this could have 100+ rows and columns):
I only want to sum the values of some rows that met a condition, in this case of the rows that have a 2 for stream. For the other rows I want them to get a default value, for example 0.
This is what I tried:
cols = [col for col in dataFrame.columns if col != 'stream']
dataFrame.loc[dataFrame['stream'] == 2, cols].sum(axis=1)
But it doesn't get the result I want. What's wrong with my code?
I think you are very close, you need only add new column sum and then fillna with 0:
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']
df1['sum'] = df1.loc[df1['stream'] == 2, cols ].sum(axis=1)
df1['sum'] = df1['sum'].fillna(0)
print df1
stream feat another_feat sum
a 1 8 4 0.0
b 2 5 5 10.0
c 2 7 7 14.0
d 3 3 2 0.0
If all values are int, last you can cast float to int by astype:
df1['sum'] = df1['sum'].fillna(0).astype(int)
print df1
stream feat another_feat sum
a 1 8 4 0
b 2 5 5 10
c 2 7 7 14
d 3 3 2 0
Another solution with numpy.where:
df1['sum'] = np.where(df1['stream'] == 2, df1[cols].sum(axis=1), 0)
print df1
stream feat another_feat sum
a 1 8 4 0
b 2 5 5 10
c 2 7 7 14
d 3 3 2 0
Related
I have 2 dataframes
df1 = pd.DataFrame([["M","N","O"],["A","B","C"],["X","Y","Z"],[2,3,4],[1,2,3]])
0 1 2
M N O
A B C
X Y Z
2 3 4
1 2 3
df2 = pd.DataFrame([["P","Q","R","S"],["X","Z","W","Y"],[4,5,6,7],[7,8,9,3]])
0 1 2 3
P Q R S
X Z W Y
4 5 6 7
7 8 9 3
I want to read the 1st dataframe drop the rows till the row starts with X and make that row as column names, then read the 2nd dataframe again drop the rows till row starts with X then append it to the 1st dataframe. Repeat the process in loop because I have multiple such dataframes.
Expected Output:
df_out = pd.DataFrame([[2,3,4,0],[1,2,3,0],[4,7,5,6],[7,3,8,9]],columns=["X","Y","Z","W"])
X Y Z W
2 3 4 0
1 2 3 0
4 7 5 6
7 3 8 9
How to do it?
First test if value X exist in any row for all columns in shifted DataFrame for get all rows after match by DataFrame.cummax with DataFrame.any and set columns names by this row in DataFrame.set_axis, same solution use for another DataFrame, join by concat, replace missing values and for expected order add DataFrame.reindex with unon both columns names:
m1 = df1.shift().eq('X').cummax().any(axis=1)
cols1 = df1[df1.eq('X').any(axis=1)].to_numpy().tolist()
df11 = df1[m1].set_axis(cols1, axis=1)
m2 = df2.shift().eq('X').cummax().any(axis=1)
cols2 = df2[df2.eq('X').any(axis=1)].to_numpy().tolist()
df22 = df2[m2].set_axis(cols2, axis=1)
df = (pd.concat([df11, df22], ignore_index=True)
.fillna(0)
.reindex(df11.columns.union(df22.columns, sort=False), axis=1))
print (df)
X Y Z W
0 2 3 4 0
1 1 2 3 0
2 4 7 5 6
3 7 3 8 9
This works,
shift = 0
for index in df.index:
if df.iloc[index - 1, 0] == "X":
X = df.iloc[index - 1, :].values
break
shift -= 1
df = df.shift(shift).dropna()
df.columns = X
df
Output -
X
Y
Z
0
2
3
4
1
1
2
3
Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
I need to sum up values of 'D' column for every row with the same combination of values from columns 'A','B' and 'C. Eventually I need to create DataFrame with unique combinations of values from
columns 'A','B' and 'C' with corresponding sum in column D.
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df
OT:
A B C D
0 0 2 0 2
1 0 1 2 1
2 0 0 2 0
3 1 2 2 2
4 0 2 2 2
5 0 2 2 2
6 2 2 2 1
7 2 1 1 1
8 1 0 2 0
9 1 2 0 0
I've tried to create temporary data frame with empty cells
D = pd.DataFrame([i for i in range(len(df))]).rename(columns = {0:'D'})
D['D'] = ''
D
OT:
D
0
1
2
3
4
5
6
7
8
9
And use apply() to sum up all 'D' column values for unique row consisted of columns 'A','B' and 'C'. For example below line returns sum of values from 'D' column for 'A'=0,'B'=2,'C'=2:
df[(df['A']==0) & (df['B']==2) & (df['C']==2)]['D'].sum()
OT:
4
function:
def Sumup(cols):
A = cols[0]
B = cols[1]
C = cols[2]
D = cols[3]
sum = df[(df['A']==A) & (df['B']==B) & (df['C']==C)]['D'].sum()
return sum
apply on df and saved in temp df D['D']:
D['D'] = df[['A','B','C','D']].apply(Sumup)
Later I wanted to use drop_duplicates but I receive dataframe consisted of NaN's.
D
OT:
D
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
Anyone could give me a hint how to manage the NaN problem or what other approach can I apply to solve the original
problem?
df.groupby(['A','B','C']).sum()
import numpy as np
df = pd.DataFrame(np.random.randint(0,3,size=(10,4)),columns=list('ABCD'))
df.groupby(["A", "B", "C"])["D"].sum()
I have a 21840x39 data frame. A few of my columns are numerically valued and I want to make sure they are all in the same data type (which I want to be a float).
Instead of naming all the columns out and converting them:
df[['A', 'B', 'C', '...]] = df[['A', 'B', 'C', '...]].astype(float)
Can I do a for loop that will allow me to say something like " convert to float from column 18 to column 35"
I know how to do one column: df['A'] = df['A'].astype(float)
But how can I do multiple columns? I tried with list slicing within a loop but couldn't get it right.
First idea is convert selected columns, python counts from 0, so for 18 to 36 columns use:
df.iloc[:, 17:35] = df.iloc[:, 17:35].astype(float)
If not working (because possible bug) use another solution:
df = df.astype(dict.fromkeys(df.columns[17:35], float))
Sample - convert 8 to 15th columns:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8 0 0 8 9 3 7 2 3 6 5
1 0 4 8 6 4 1 1 5 9 5 6 6 6 5 4 6 4 2
2 3 4 7 1 4 9 3 2 0 9 1 2 7 1 0 2 8 8
df = df.astype(dict.fromkeys(df.columns[7:15], float))
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8.0 0.0 0.0 8.0 9.0 3.0 7.0 2.0 3 6 5
1 0 4 8 6 4 1 1 5.0 9.0 5.0 6.0 6.0 6.0 5.0 4.0 6 4 2
2 3 4 7 1 4 9 3 2.0 0.0 9.0 1.0 2.0 7.0 1.0 0.0 2 8 8
Tweaked #jezrael code as typing in column names (I feel) is a good option.
import pandas as pd
import numpy as np
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print(df)
columns = list(df.columns)
#change the first and last column names below as required
df = df.astype(dict.fromkeys(
df.columns[columns.index('h'):(columns.index('o')+1)], float))
print (df)
Leaving the original answer below here but note: Never loop in pandas if vectorized alternatives exist
If I had a dataframe and wanted to change columns 'col3' to 'col5' (human readable names) to floats I could...
import pandas as pd
import re
df = pd.read_csv('dummy_data.csv')
df
columns = list(df.columns)
#change the first and last column names below as required
start_column = columns.index('col3')
end_column = columns.index('col5')
for index, col in enumerate(columns):
if (start_column <= index) & (index <= end_column):
df[col] = df[col].astype(float)
df
...by just changing the column names. Perhaps it's easier to work in column names and 'from this one' and 'to that one' (inclusive).
I want to create a new column in my dataframe that places the name of the column in the row if only that column has a value of 8 in the respective row, otherwise the new column's value for the row would be "NONE". For the dataframe df, the new column df["New_Column"] = ["NONE","NONE","A","NONE"]
df = pd.DataFrame({"A": [1, 2,8,3], "B": [0, 2,4,8], "C": [0, 0,7,8]})
Cool problem.
Find the 8-fields in each row: df==8
Count them: (df==8).sum(axis=1)
Find the rows where the count is 1: (df==8).sum(axis=1)==1
Select just those rows from the original dataframe: df[(df==8).sum(axis=1)==1]==8
Find the 8-fields again: df[(df==8).sum(axis=1)==1]==8)
Find the columns that hold the True values with idxmax (because True>False): (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
Fill in the gaps with "NONE"
To summarize:
df["New_Column"] = (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
df["New_Column"] = df["New_Column"].fillna("NONE")
# A B C New_Column
#0 1 0 0 NONE
#1 2 2 0 NONE
#2 8 4 7 A
#3 3 8 8 NONE
# I added another line as a proof of concept
#4 0 8 0 B
You can accomplish this using idxmax and a mask:
out = (df==8).idxmax(1)
m = ~(df==8).any(1) | ((df==8).sum(1) > 1)
df.assign(col=out.mask(m))
A B C col
0 1 0 0 NaN
1 2 2 0 NaN
2 8 4 7 A
3 3 8 8 NaN
Or do:
df2=df[(df==8)]
df['New_Column']=(df2[(df2!=df2.dropna(thresh=2).values[0]).all(1)].dropna(how='all')).idxmax(1)
df['New_Column'] = df['New_Column'].fillna('NONE')
print(df)
dropna + dropna again + idxmax + fillna. that's all you need for this.
Output:
A B C New_Column
0 1 0 0 NONE
1 2 2 0 NONE
2 8 4 7 A
3 3 8 8 NONE