df.groupby([df.index.month, df.index.day])[vars_rs].transform(lambda y: y.fillna(y.median()))
I am filling missing values in a dataframe with median values from climatology. The days range from Jan 1 2010 to Dec 31st 2016. However, I only want to fill in missing values for days before current date (say Oct 1st 2016). How do I modify the statement?
The algorithm would be:
Get a part of the data frame which contains only rows filtered by date with a boolean mask
Perform required replacements on it
Append the rest of the initial data frame to the end of the resulting data frame.
Dummy data:
df = pd.DataFrame(np.zeros((5, 2)),columns=['A', 'B'],index=pd.date_range('2000',periods=5,freq='M'))
A B
2000-01-31 0.0 0.0
2000-02-29 0.0 0.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
The code
vars_rs = ['A', 'B']
mask = df.index < '2000-03-31'
early = df[mask]
early = early.groupby([early.index.month, early.index.day])[vars_rs].transform(lambda y: y.replace(0.0, 1)) # replace with your code
result = early.append(df[~mask])
So the result is
A B
2000-01-31 1.0 1.0
2000-02-29 1.0 1.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
Use np.where, example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a','b','b','c','c'],'B':[1,2,3,4,5,6],'C':[1,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.ix[:,'C'] = np.where((df.A != 'c')&(df.B < 4)&(pd.isnull(df.C)),-99,df.ix[:,'C'])
Like this you can directly modify the desired column using boolean expressions and all columns.
Original dataframe:
A B C
0 a 1 1.0
1 a 2 NaN
2 b 3 NaN
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN
Modified dataframe:
A B C
0 a 1 1.0
1 a 2 -99.0
2 b 3 -99.0
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN
Related
I have a two column data frame of the form:
Death HEALTH
0 other 0.0
1 other 1.0
2 vascular 0.0
3 other 0.0
4 other 0.0
5 vascular 0.0
6 NaN 0.0
7 NaN 0.0
8 NaN 0.0
9 vascular 1.0
I would like to create a new column following the steps:
wherever appears the value 'other', write a 'No'
wherever appears the NaN, leave it as it is
wherever appears the value 'vascular' in the first column and 1.0 in the second, write 'Yes'
wherever appears the value 'vascular' in the first column and 0.0 in the second, write 'No'
The output should be:
Death HEAlTH New
0 other 0.0 No
1 other 1.0 No
2 vascular 0.0 No
3 other 0.0 No
4 other 0.0 No
5 vascular 0.0 No
6 NaN 0.0 NaN
7 NaN 0.0 NaN
8 NaN 0.0 NaN
9 vascular 1.0 Yes
Is there a pythonic way to achieve this? I'm all lost between loops and conditionals.
You can create conditions for No and Yes and for all another values are created original value in numpy.select:
m1 = df['Death'].eq('other') | (df['Death'].eq('vascular') & df['HEALTH'].eq(0))
m2 = (df['Death'].eq('vascular') & df['HEALTH'].eq(1))
df['new'] = np.select([m1, m2], ['No','Yes'], default=df['Death'])
Another idea is test also missing values and if no match conditions is set original values:
m1 = df['Death'].eq('other') | (df['Death'].eq('vascular') & df['HEALTH'].eq(0))
m2 = (df['Death'].eq('vascular') & df['HEALTH'].eq(1))
m3 = df['Death'].isna()
df['new'] = np.select([m1, m2, m3], ['No','Yes', np.nan], default=df['Death'])
print (df)
print (df)
0 another val 0.0 another val
1 other 1.0 No
2 vascular 0.0 No
3 other 0.0 No
4 other 0.0 No
5 vascular 0.0 No
6 NaN 0.0 NaN
7 NaN 0.0 NaN
8 NaN 0.0 NaN
9 vascular 1.0 Yes
A simple way to do this is to implement your conditional logic using if/else inside a function, and apply this function row-wise to the dataframe.
def function(row):
if row['Death']=='other':
return 'No'
if row['Death']=='vascular':
if row['Health']==1:
return 'Yes'
elif row['Health']==0:
return 'No'
return np.nan
# axis = 1 to apply it row-wise
df['New'] = df.apply(function, axis=1)
It produces the following output as required:
Death Health New
0 other 0 No
1 other 1 No
2 vascular 0 No
3 other 0 No
4 other 0 No
5 vascular 0 No
6 NaN 0 NaN
7 NaN 0 NaN
8 NaN 0 NaN
9 vascular 1 Yes
I have Excel spreadsheets with data, one for each year. Alas the columns change slightly over the year. What I want is to have one dataframe with all the data and fill the lacking columns with predefined data. I wrote a small example program to test that.
import numpy as np
import pandas as pd
# Initialize three dataframes
df1 = pd.DataFrame([[1,2], [11,22],[111,222]], columns=['een', 'twee'])
df2 = pd.DataFrame([[3,4], [33,44],[333,444]], columns=['een', 'drie'])
df3 = pd.DataFrame([[5,6], [55,66],[555,666]], columns=['twee', 'vier'])
# Store these in a dictionary and print for verification
d = {'df1': df1, 'df2': df2, 'df3': df3}
for key in d:
print(d[key])
print()
# Create a list of all columns, as order is relevant a Set is not used
cols = []
# Count total number of rows
nrows = 0
# Loop thru each dataframe to determine total number of rows and columns
for key in d:
df = d[key]
nrows += len(df)
for col in df.columns:
if col not in cols:
cols += [col]
# Create total dataframe, fill with default (zeros)
data = pd.DataFrame(np.zeros((nrows, len(cols))), columns=cols)
# Assign dataframe to each slice
c = 0
for key in d:
data.loc[c:c+len(d[key])-1, d[key].columns] = d[key]
c += len(d[key])
print(data)
The dataframes are initialized all right but there is something weird with the assignment to the slice of the data dataframe. What I wanted (and expected) is:
een twee drie vier
0 1.0 2.0 0.0 0.0
1 11.0 22.0 0.0 0.0
2 111.0 222.0 0.0 0.0
3 3.0 0.0 4.0 0.0
4 33.0 0.0 44.0 0.0
5 333.0 0.0 444.0 0.0
6 0.0 5.0 0.0 6.0
7 0.0 55.0 0.0 66.0
8 0.0 555.0 0.0 666.0
But this is what I got:
een twee drie vier
0 1.0 2.0 0.0 0.0
1 11.0 22.0 0.0 0.0
2 111.0 222.0 0.0 0.0
3 NaN 0.0 NaN 0.0
4 NaN 0.0 NaN 0.0
5 NaN 0.0 NaN 0.0
6 0.0 NaN 0.0 NaN
7 0.0 NaN 0.0 NaN
8 0.0 NaN 0.0 NaN
The location AND the data of the first dataframe are correctly assigned. However, the second dataframe is assigned to the correct location, but not its contents: NaN is assigned instead. This also happens for the third dataframe: correct location but missing data. I have tried to assign d[key].loc[0:2, d[key].columns and some more fanciful solutions to the data slice, but all return NaN. How can I get the contents of the dataframe as well assigned to data?
Per the comments, you can use:
pd.concat([df1, df2, df3])
OR
pd.concat([df1, df2, df3]).fillna(0)
I have a large dataframe that looks similar to this:
As you can tell, there are plenty of blanks. I want to propagate non-null values forward (so for example, in the first row 1029 goes to 1963.02.12 column, between 1029 and 1043) but only up to the last entry, that is it should stop propagating when it encounters the last non-null value (for D it would be
the 1992.03.23 column, but for A it'd be 1963.09.21, just outside the screenshot).
Is there a quicker way to achieve this without fiddling around with df.fillna(method='ffill', limit=x)? My original idea was to remember the date of the last entry, propagate values to the end of the row, and then fill the row with nulls after the saved date. I've been wondering if there is a cleverer method to achieve the same result.
This might not be very performant. I couldn't get a pure-pandas solution (which obviously doesn't guarantee performance anyway!)
>>> df
a b c d e
0 0.0 NaN NaN 1.0 NaN
1 0.0 1.0 NaN 2.0 3.0
2 NaN 1.0 2.0 NaN 4.0
What happens if we just ffill everything?
>>> df.ffill(axis=1)
a b c d e
0 0.0 0.0 0.0 1.0 1.0
1 0.0 1.0 1.0 2.0 3.0
2 NaN 1.0 2.0 2.0 4.0
We need to go back and add NaNs for the last null column in each row:
>>> new_data = []
>>> for _, row in df.iterrows():
... new_row = row.ffill()
... null_columns = [col for col, is_null in zip(row.index, row.isnull().values) if is_null]
... # replace value in last column with NaN
... if null_columns:
... last_null_column = null_columns[-1]
... new_row.ix[last_null_column] = np.nan
... new_data.append(new_row.to_dict())
...
>>> new_df = pd.DataFrame.from_records(new_data)
>>> new_df
a b c d e
0 0.0 0.0 0.0 1.0 NaN
1 0.0 1.0 NaN 2.0 3.0
2 NaN 1.0 2.0 NaN 4.0
Basically, I'm trying to do something like this but for a fillna instead of a sum.
I have a list of df's, each with same colunms/indexes, ordered over time:
import numpy as np
import pandas as pd
np.random.seed(0)
df_list = []
for index in range(3):
a = pd.DataFrame(np.random.randint(3, size=(5,3)), columns=list('abc'))
mask = np.random.choice([True, False], size=a.shape)
df_list.append(a.mask(mask))
now, I want to do a replace the numpy.nan cells of the ith
DataFrame in df_list by the value of the same cell in the i-1 th
DataFrame in df_list.
so if the first DataFrame is:
a b c
0 NaN 1.0 0.0
1 1.0 1.0 NaN
2 0.0 NaN 0.0
3 NaN 0.0 2.0
4 NaN 2.0 2.0
and the 2nd is:
a b c
0 0.0 NaN NaN
1 NaN NaN NaN
2 0.0 1.0 NaN
3 NaN NaN 2.0
4 0.0 NaN 2.0
Then the output output_list should be a list of the same length as df_list and having also DataFrames as elements.
The first entry of output_list is the same as the first entry of df_list.
The second entry of output_list is:
a b c
0 0.0 1.0 0.0
1 1.0 1.0 NaN
2 0.0 1.0 0.0
3 NaN 0.0 2.0
4 0.0 2.0 2.0
I believe the update functionality is very good for this, see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
It is a method that specifically allows you to update a DataFrame, in your case only the NaN-elements of it.
In particular, you could use it like this:
new_df_list = df_list[:1]
for df_new, df_old in zip(df_list[1:], df_list[:-1]):
df_new.update(df_old, overwrite=False)
new_df_list.append(df_new)
Which will give you the desired output
How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667