Pandas DataFrame replace negative values with latest preceding positive value - python

Consider a DataFrame such as
df = pd.DataFrame({'a': [1,-2,0,3,-1,2],
'b': [-1,-2,-5,-7,-1,-1],
'c': [-1,-2,-5,4,5,3]})
For each column, how to replace any negative value with the last positive value or zero ? Last here refers from top to bottom for each column. The closest solution noticed is for instance df[df < 0] = 0.
The expected result would be a DataFrame such as
df_res = pd.DataFrame({'a': [1,1,0,3,3,2],
'b': [0,0,0,0,0,0],
'c': [0,0,0,4,5,3]})

You can use DataFrame.mask to convert all values < 0 to NaN then use ffill and fillna:
df = df.mask(df.lt(0)).ffill().fillna(0).convert_dtypes()
a b c
0 1 0 0
1 1 0 0
2 0 0 0
3 3 0 4
4 3 0 5
5 2 0 3

Use pandas where
df.where(df.gt(0)).ffill().fillna(0).astype(int)
a b c
0 1 0 0
1 1 0 0
2 1 0 0
3 3 0 4
4 3 0 5
5 2 0 3

Expected result may obtained with this manipulations:
mask = df >= 0 #creating boolean mask for non-negative values
df_res = (df.where(mask, np.nan) #replace negative values to nan
.ffill() #apply forward fill for nan values
.fillna(0)) # fill rest nan's with zeros

Related

Merge duplicated cells instead of dropping them [duplicate]

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

Pandas Lag over multiple columns and set number of iterations

I have a dataframe like below:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
I would like to apply the pandas shift function to shift each column 4 times and create a new row for each shift:
col1 col1.lag0 col1.lag1 col1.lag2 col1.lag3 col2 col2.lag0 col2.lag1 col2.lag2 col2.lag3
1 0 0 0 0 3 0 0 0 0
2 1 0 0 0 4 3 0 0 0
0 2 1 0 0 0 4 3 0 0
0 0 2 1 0 0 0 4 3 0
0 0 0 2 1 0 0 0 4 3
I have tried a few solutions with shift like d['col1'].shift().fillna(0), however, I am not sure how to iterate the solution nor how to ensure the correct number of rows are added to the dataframe.
First I extend the given DataFrame by the correct number of rows with zeros. Then iterate over the columns and the amount of shifts to create the desired columns.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
n_shifts = 4
zero_rows = pd.DataFrame(index=pd.RangeIndex(n_shift_rows), columns=df.columns).fillna(0)
df = df.append(zero_rows).reset_index(drop=True)
for col in df.columns:
for shift_amount in range(1, n_shifts+1):
df[f"{col}.lag{shift_amount}"] = df[col].shift(shift_amount)
df.fillna(0).astype(int)
As pointed out by Ben.T the outer loop can be avoided as shift can be applied at once on the whole DataFrame. An alternative for the looping would be
shifts = df
for shift_amount in range(1, n_shifts+1):
columns = df.columns + ".lag" + str(shift_amount)
shift = pd.DataFrame(df.shift(shift_amount).values, columns=columns)
shifts = shifts.join(shift)
shifts.fillna(0).astype(int)

Calculating no of non-zero in a column corresponding to another column

I have a dataframe:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
which looks like-
A B class
0 0 4 0
1 4 1 1
2 8 0 1
3 1 0 0
4 0 3 1
5 0 1 0
I want to calculate for each column the corresponding a,b,c,d which are no of non-zero in column corresponding to class column 1,no of non-zero in column corresponding to class column 0,no of zero in column corresponding to class column 1,no of zero in column corresponding to class column 0
for example-
for column A the a,b,c,d are 2,1,1,2
explantion- In column A we see that where column[class]=1 the number of non zero values in column A are 2 therefore a=2(indices 1,2).Similarly b=1(indices 3)
My attempt(when the dataframe had equal no of 0 and 1 class)-
dataset = pd.read_csv('aaf.csv')
n=len(dataset.columns) #no of columns
X=dataset.iloc[:,1:n].values
l=len(X) #no or rows
score = []
for i in range(n-1):
#print(i)
X_column=X[:,i]
neg_array,pos_array=np.hsplit(X_column,2)##hardcoded
#print(pos_array.size)
a=np.count_nonzero(pos_array)
b=np.count_nonzero(neg_array)
c= l/2-a
d= l/2-b
Use:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
df = (df.set_index('class')
.ne(0)
.stack()
.groupby(level=[0,1])
.value_counts()
.unstack(1)
.sort_index(level=1, ascending=False)
.T)
print (df)
class 1 0 1 0
True True False False
A 2 1 1 2
B 2 2 1 1
df.columns = list('abcd')
print (df)
a b c d
A 2 1 1 2
B 2 2 1 1

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

how do I insert a column at a specific column index in pandas?

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Categories

Resources