I am working with python3 and pandas, and I would like to pass : as a function parameter to state all rows in a slice passed to df.loc.
For example, say I have a function that fills na values like so:
def fill_na_w_value(df, rows, columns, fill):
for col in columns:
df.loc[rows, columns].fillna(
fill,
inplace=True
)
Sometimes I may not want to apply it to some rows but to apply it to all rows, in pandas this is accessed with df.loc[:, col]
If I am calling this from a function it would like
fill_na_w_value(df, :, ['col1'], 0)
But the above will give me a syntax error because of the :; how can I pass this as a function parameter?
Use slice(None) to represent :. Note you can use pipe to pass your dataframe through a function, and loc accepts a list for row and index filtering:
df = pd.DataFrame({'col1': [1, 2, np.nan, 4, 5, np.nan, 7, 8, np.nan]})
def fill_na_w_value(df, row_slicer, columns, value):
df.loc[row_slicer, columns] = df.loc[row_slicer, columns].fillna(value)
return df
df1 = df.pipe(fill_na_w_value, slice(None), ['col1'], 0)
print(df1)
col1
0 1.0
1 2.0
2 0.0
3 4.0
4 5.0
5 0.0
6 7.0
7 8.0
8 0.0
Here's an example using a list instead of a slice object:
df2 = df.pipe(fill_na_w_value, [2, 5], ['col1'], 0)
print(df2)
col1
0 1.0
1 2.0
2 0.0
3 4.0
4 5.0
5 0.0
6 7.0
7 8.0
8 NaN
Related
I have a data frame with numeric values, such as
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
and I append a single row with all the column sums
totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)
Simple enough.
Here are the values of df, totals, and df_append
>>> df
A B
0 1 2
1 3 4
>>> totals
A 4
B 6
Name: totals, dtype: int64
>>> df_append
A B
0 1 2
1 3 4
totals 4 6
Unfortunately, in newer versions of pandas the method DataFrame.append is deprecated, and will be removed in some future version of pandas. The advise is to replace it with pandas.concat.
Now, using pd.concat naively as follows
df_concat_bad = pd.concat([df, totals])
produces
>>> df_concat_bad
A B 0
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column.
You cannot fix this with something like calling pd.concat with axis=1, because that would add the totals as column:
>>> pd.concat([df, totals], axis=1)
A B totals
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
(In this case, the result looks the same as using the default axis=0, because the indexes of df and totals are disjoint, as are their column names.)
How to handle this (elegantly and efficiently)?
The solution is to convert totals (a Series object) to a DataFrame (which will then be a column) using to_frame() and next transpose it using T:
df_concat_good = pd.concat([df, totals.to_frame().T])
yields the desired
>>> df_concat_good
A B
0 1 2
1 3 4
totals 4 6
I prefer to use df.loc() to solve this problem than pd.concat()
df.loc["totals"]=df.sum()
I have a DataFrame like
A B
1 2
2 -
5 -
4 5
I want to apply a function func() on column B (but the function gives an error if - is passed). I cannot modify the func() function. I need something like:
df['B']=df['B'].apply(func) only if value not equal to -
Use a custom function to apply on a df column if a condition is satisfied:
def func(a):
return a + 10
#new pandas dataframe with four rows and 2 columns. 3rd row having a nan
df = pd.DataFrame([[1, 2], [3, 4], [5, pd.np.nan], [7, 8]], columns=["A", "B"])
print(df)
#coerce column named B to numeric
s = pd.to_numeric(df['B'], errors='coerce')
#a mask has true for numeric rows, false for non numeric rows
mask = s.notna()
#mask
print(mask)
#run function named func across the B column
df.loc[mask, 'B'] = s[mask].apply(func)
print(df)
Which prints:
A B
0 1 2.0
1 3 4.0
2 5 NaN
3 7 8.0
0 True
1 True
2 False
3 True
A B
0 1 12.0
1 3 14.0
2 5 NaN
3 7 18.0
Try:
df['B'] = df[df['B']!='-']['B'].apply(func)
Or when the - is actaully nan you can use:
df['B'] = df[pd.notnull(df['B'])]['B'].apply(func)
so I was trying to impute some missing values with fillna() in pandas, but I don't really know how to impute by using the mean value of the last 3 rows in the same column (not the mean value of the entire column), so if anyone can help it will be greatly appreciated, thanks
You can fillna with rolling(3).mean(). shift gets the alignment correct. This approach fills everything at once, so for consecutive NaN values the fillings are independent. If you need iterative filling (fills the first NaN then that value is used to compute the fill value in the next consecutive NaN) then it cannot be done in this way.
df = pd.DataFrame({'col1': [np.NaN, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
# Fill if
# at least
# one value
df.fillna(df.rolling(3, min_periods=1).mean().shift()) # works for many cols at once
col1
0 NaN # Unfilled because < min_periods
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
You could do:
df.fillna(df.iloc[-3:].mean())
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1':[1, 2, 3, np.nan, 5, 6, 7],
'var2':[np.nan, np.nan, np.nan, np.nan, np.nan, 1, 0]})
var1 var2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN NaN
4 5.0 NaN
5 6.0 1.0
6 7.0 0.0
print(df.fillna(df.iloc[-3:].mean()))
Output:
var1 var2
0 1.0 0.5
1 2.0 0.5
2 3.0 0.5
3 6.0 0.5
4 5.0 0.5
5 6.0 1.0
6 7.0 0.0
Dan's solution is much simpler if the kink is worked out. If not, this will accomplish it:
df2 = df1.fillna('nan') # Just filling them for the loop
dfrows = df2.shape[0]
dfcols = df2.shape[1]
for row in range(dfrows):
for col in range(dfcols):
if df2.iloc[row, col] == ('nan'):
df2.iloc[row,col] = (df2.iloc[row-1,col] + df2.iloc[row-2,col] + df2.iloc[row-3,col])/3
df2
This question already has answers here:
add a row at top in pandas dataframe [duplicate]
(6 answers)
Closed 4 years ago.
I would like to move an entire row (index and values) from the last row to the first row of a DataFrame. Every other example I can find either uses an ordered row index (to be specific - my row index is not a numerical sequence - so I cannot simply add at -1 and then reindex with +1) or moves the values while maintaining the original index. My DF has descriptions as the index and the values are discrete to the index description.
I'm adding a row and then would like to move that into row 1. Here is the setup:
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'F', 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
}).set_index('col1')
#output
In [7]: df
Out[7]:
col2 col3
col1
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
I then add a new row as follows:
df.loc["Deferred Description"] = pd.Series([''])
In [9]: df
Out[9]:
col2 col3
col1
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
Deferred Description NaN NaN
I would like the resulting output to be:
In [9]: df
Out[9]:
col2 col3
col1
Defenses Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I've tried using df.shift() but only the values shift. I've also tried df.sort_index() but that requires the index to be ordered (there are several SO examples using df.loc[-1] = ... then then reindexing with df.index = df.index + 1). In my case I need the Defenses Description to be the first row.
Your problem is not one of cyclic shifting, but a simpler oneāone of insertion (which is why I've chosen to mark this question as duplicate).
Construct an empty DataFrame and then concatenate the two using pd.concat.
pd.concat([pd.DataFrame(columns=df.columns, index=['Deferred Description']), df])
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
If this were columns, it'd have been easier. Funnily enough, pandas has a DataFrame.insert function that works for columns, but not rows.
Generalized Cyclic Shifting
If you were curious to know how you'd cyclically shift a dataFrame, you can use np.roll.
# apply this fix to your existing DataFrame
pd.DataFrame(np.roll(df.values, 1, axis=0),
index=np.roll(df.index, 1), columns=df.columns
)
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
This, thankfully, also works when you have duplicate index values. If the index or columns aren't important, then pd.DataFrame(np.roll(df.values, 1, axis=0)) works well enough.
You can using append
pd.DataFrame({'col2':[np.nan],'col3':[np.nan]},index=["Deferred Description"]).append(df)
Out[294]:
col2 col3
Deferred Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I have a simple dataframe with 2 columns and 2rows.
I also have a list of 4 numbers.
I want to concatenate this list to the FIRST column of the dataframe, and only the first. So the dataframe will have 6rows in the first column, and 2in the second.
I wrote this code:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
numbers = [5, 6, 7, 8]
for i in range(0, 4):
df1['A'].loc[i + 2] = numbers[i]
print(df1)
It prints the original dataframe oddly enough. But when I debug and evaluate the expression df1['A'] then it does show the new numbers. What's going on here?
It's not just that it's printing the original df, it also writes the original df to csv when I use to_csv method.
It seems you need:
for i in range(0, 4):
df1.loc[0, i] = numbers[i]
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN
df1 = pd.concat([df1, pd.DataFrame([numbers], index=[0])], axis=1)
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN