I am new in Python and Pandas. I worked with SAS. In SAS I can use IF statement with "Do; End;" to update values of several columns based on one condition.
I tried np.where() clause but it updates only one column. The "apply(function, ...)" also updates only one column. Positioning extra update statement inside the function body didn't help.
Suggestions?
You can select which columns you want to alter, then use .apply():
df = pd.DataFrame({'a': [1,2,3],
'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df[['a','b']].apply(lambda x: x+1)
a b
0 2 5
1 3 6
2 4 7
This link may help:
You could use:
for col in df:
df[col] = np.where(df[col] == your_condition, value_if, value_else)
eg:
a b
0 0 2
1 2 0
2 1 1
3 2 0
for col in df:
df[col] = np.where(df[col]==0,12, df[col])
Output:
a b
0 12 2
1 2 12
2 1 1
3 2 12
Or if you want apply the condition only on some columns, select them in the for loop:
for col in ['a','b']:
or just in this way:
df[['a','b']] = np.where(df[['a','b']]==0,12, df[['a','b']])
Related
This might be a quite easy problem but I can't deal with it properly and didn't find the exact answer here. So, let's say we have a Python Dataframe as below:
df:
ID a b c d
0 1 3 4 9
1 2 8 8 3
2 1 3 10 12
3 0 1 3 0
I want to remove all the rows that contain repeating values in different columns. In other words, I am only interested in keeping rows with unique values. Referring to the above example, the desired output should be:
ID a b c d
0 1 3 4 9
2 1 3 10 12
(I didn't change the ID values on purpose to make the comparison easier). Please let me know if you have any ideas. Thanks!
You can compare length of sets with length of columns names:
lc = len(df.columns)
df1 = df[df.apply(lambda x: len(set(x)) == lc, axis=1)]
print (df1)
a b c d
ID
0 1 3 4 9
2 1 3 10 12
Or test by Series.duplicated and Series.any:
df1 = df[~df.apply(lambda x: x.duplicated().any(), axis=1)]
Or DataFrame.nunique:
df1 = df[df.nunique(axis=1).eq(lc)]
Or:
df1 = df[[len(set(x)) == lc for x in df.to_numpy()]]
Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?
Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]
You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
I'm trying to set col 'b' of my dataframe based on it's previous value from the row above. Is there any way to do this without iterating through the rows or using decorators to the pd.apply function?
Psuedo code:
if row != 0:
curr_row['b'] = prev_row['b'] + curr_row['a']
else:
curr_row['b'] = curr_row['a']
Here's what i've tried:
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [0,0,0,0,0]})
df.b = df.apply(lambda row: row.a if row.name < 1 else (df.iloc[row.name-1].b + row.a), axis=1)
output:
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
Desired output:
a b
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
if I run the apply function a second time on the new df one more row value of c is correct.:
a b
0 1 1
1 2 3
2 3 5
3 4 7
4 5 9
This pattern continues if I continue to re-run the apply function until the output is correct.
I'm guessing the issue has something to do with the mechanics of how the apply function works which makes it break when you use a value from the same column you are 'applying' on. That or I'm just being an idiot somehow (very plausible). Can someone explain this?
Do I have to use decorators to store the previous row or is there a cleaner way to do this?
Your requirement is cumsum()
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [0,0,0,0,0]})
df.assign(b=df.a.cumsum())
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)