I'm a new pandas user (as of yesterday), and have found it at times both convenient and frustrating.
My current frustration is in trying to use df.fillna() on multiple columns of a dataframe. For example, I've got two sets of data (a newer set and an older set) which partially overlap. For the cases where we have new data, I just use that, but I also want to use the older data if there isn't anything newer. It seems I should be able to use fillna() to fill the newer columns with the older ones, but I'm having trouble getting that to work.
Attempt at a specific example:
df.ix[:,['newcolumn1','newcolumn2']].fillna(df.ix[:,['oldcolumn1','oldcolumn2']], inplace=True)
But this doesn't work as expected - numbers show up in the new columns that had been NaNs, but not the ones that were in the old columns (in fact, looking through the data, I have no idea where the numbers it picked came from, as they don't exist in either the new or old data anywhere).
Is there a way to fill in NaNs of specific columns in a DataFrame with vales from other specific columns of the DataFrame?
fillna is generally for carrying an observation forward or backward. Instead, I'd use np.where... If I understand what you're asking.
import numpy as np
np.where(np.isnan(df['newcolumn1']), df['oldcolumn1'], df['newcolumn1'])
To answer your question: yes. Look at using the value argument of fillna. Along with the to_dict() method on the other dataframe.
But to really solve your problem, have a look at the update() method of the DataFrame. Assuming your two dataframes are similarly indexed, I think it's exactly what you want.
In [36]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [37]: df
Out[37]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [38]: df2 = pd.DataFrame({'A': [0, np.nan, 2, 3, 4, 5], 'B': [1, 0, 1, 1, 0, 0]})
In [40]: df2
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [52]: df.update(df2, overwrite=False)
In [53]: df
Out[53]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
Notice that all the NaNs in df were replaced except for (1, A) since that was also NaN in df2. Also some of the values like (5, B) differed between df and df2. By using overwrite=False it keeps the value from df.
EDIT: Based on comments it seems like your looking for a solution where the column names don't match over the two DataFrames (It'd be helpful if you posted sample data). Let's try that, replacing column A with C and B with D.
In [33]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [34]: df2 = pd.DataFrame({'C': [0, np.nan, 2, 3, 4, 5], 'D': [1, 0, 1, 1, 0, 0]})
In [35]: df
Out[35]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [36]: df2
Out[36]:
C D
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [37]: d = {'A': df2.C, 'B': df2.D} # pass this values in fillna
In [38]: df
Out[38]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [40]: df.fillna(value=d)
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
I think if you invest the time to learn pandas you'll hit fewer moments of frustration. It's a massive library though, so it takes time.
Related
I need to modify the data but I need to exclude the nans rows. Once the data has been modified I need to put back the nans in the data. What I have so far is I separated the data by no-nans and nans df and then after the modifications I'm using concat to bring the data back together. I'm hoping to see if there is a better way to do it, concat adds the df at the bottom, even though that's the case, there might be some cases where that's not true. I was hoping to add the nans back to their original position rather than at the bottom.
import pandas as pd
import numpy as np
def modify_data():
d = {'num': [1, 2, 3, 4, np.nan], 'n_obs': [3, 4, 2, 3, 1], 'target': [3, 4, 5, 2, 7]}
df = pd.DataFrame(data=d)
nan_df = df[df["num"].isnull()]
not_nan_df = df[df["num"].notnull()]
df["num"] = pd.concat([not_nan_df["num"].clip(lower=2), nan_df["num"]])
print(df["num"])
return df["num"].values
You don't need all of that. Just restrict both sides of the equals sign:
df[df["num"].notnull()] = df[df["num"].notnull()].clip(lower=2)
Output:
num n_obs target
0 2.0 3 3
1 2.0 4 4
2 3.0 2 5
3 4.0 3 2
4 NaN 1 7
According to the documentation, you can use clip without considering NaN:
# Or df['num'].clip(lower=2, inplace=True)
df['num'] = df['num'].clip(lower=2)
print(df)
# Output
num n_obs target
0 2.0 3 3
1 2.0 4 4
2 3.0 2 5
3 4.0 3 2
4 NaN 1 7
I have two dataframes. The first one is df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) i.e
A B
0 5 2
1 0 4
another one is df2 = pd.DataFrame({'C': [1, 1], 'D': [3, 3]}) i.e
C D
0 1 3
1 1 3
I want want to grab only 4 from df1 and make new column in df2. I have tried this df2['E']=df1['B'][df1['B']==4] and got
C D E
0 1 3 NaN
1 1 3 4.0
I want both rows of df2 to be 4. How can I achieve this? Any help would be immense help.
if the value '4' appears as the last value in your column( like your example), you could do:
df2['E'].fillna(method= 'backfill')
for other methods, have a look here:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
It is not actually clear what you wanna accomplish here, but I assume you would like to check if there is any "4" in df1 (column B) and then filling all rows in df2 (column E) with "4". Then you could do:
import numpy as np
df2['E'] = np.where(df1['B'].isin([4]).any(), 4, np.nan)
Output:
C D E
0 1 3 4.0
1 1 3 4.0
my first post!
I'm running python 3.8.5 & pandas 1.1.0 on jupyter notebooks.
I want to divide several columns by the corresponding elements in another column of the same dataframe.
For example:
import pandas as pd
df = pd.DataFrame({'a': [2, 3, 4], 'b': [4, 6, 8], 'c':[6, 9, 12]})
df
a b c
0 2 4 6
1 3 6 9
2 4 8 12
I'd like to divide columns 'b' & 'c' by the corresponding values in 'a' and substitute the values in 'b' and 'c' with the result of this division. So the above dataframe becomes:
a b c
0 2 2 3
1 3 2 3
2 4 2 3
I tried
df.iloc[: , 1:] = df.iloc[: , 1:] / df['a']
but this gives:
a b c
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
I got it working by doing:
for colname in df.columns[1:]:
df[colname] = (df[colname] / df['a'])
Is there a faster way of doing the above by avoiding the for loop?
thanks,
mk
Almost there, use div with axis=0:
df.iloc[:,1:] = df.iloc[:,1:].div(df.a, axis=0)
df.b= df.b/df.a
df.c=df.c/df.a
or
df[['b','c']]=df.apply(lambda x: x[['b','c']]/x.a ,axis=1)
I am trying to populate a new column within a pandas dataframe by using values from several columns. The original columns are either 0 or '1' with exactly a single 1 per series. The new column would correspond to df['A','B','C','D'] by populating new_col = [1, 3, 7, 10] as shown below. (A 1 at A means new_col = 1; if B=1,new_col = 3, etc.)
df
A B C D
1 1 0 0 0
2 0 0 1 0
3 0 0 0 1
4 0 1 0 0
The new df should look like this.
df
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3
I've tried to use map, loc, and where but can't seem to formulate an efficient way to get it done. Problem seems very close to this. A couple other posts I've looked at 1 2 3. None of these show how to use multiple columns conditionally to fill a new column based on a list.
I can think of a few ways, mostly involving argmax or idxmax, to get either an ndarray or a Series which we can use to fill the column.
We could drop down to numpy, find the maximum locations (where the 1s are) and use those to index into an array version of new_col:
In [148]: np.take(new_col,np.argmax(df.values,1))
Out[148]: array([ 1, 7, 10, 3])
We could make a Series with new_col as the values and the columns as the index, and index into that with idxmax:
In [116]: pd.Series(new_col, index=df.columns).loc[df.idxmax(1)].values
Out[116]: array([ 1, 7, 10, 3])
We could use get_indexer to turn the column idxmax results into integer offsets we can use with new_col:
In [117]: np.array(new_col)[df.columns.get_indexer(df.idxmax(axis=1))]
Out[117]: array([ 1, 7, 10, 3])
Or (and this seems very wasteful) we could make a new frame with the new columns and use idxmax directly:
In [118]: pd.DataFrame(df.values, columns=new_col).idxmax(1)
Out[118]:
0 1
1 7
2 10
3 3
dtype: int64
It's not the most elegant solution, but for me it beats the if/elif/elif loop:
d = {'A': 1, 'B': 3, 'C': 7, 'D': 10}
def new_col(row):
k = row[row == 1].index.tolist()[0]
return d[k]
df['new_col'] = df.apply(new_col, axis=1)
Output:
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3
So given a multiindexed dataframe, I would like to return only rows that satisfy a condition for all levels of the lower index in a multi index. Here is a small working example:
df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4], 'c': [0, 2, 2, 2]})
df = df.set_index(['a', 'b'])
print(df)
out:
c
a b
1 1 0
2 2
2 3 2
4 2
Now, I would like to return the entries for which c > 1. For instance, I would like to do something like
df[df[c > 1]]
out:
c
a b
1 2 2
2 3 2
4 2
But I want to get
out:
c
a b
2 3 2
4 2
Any thoughts on how to do this in the most efficient way?
I ended up using groupby:
df.groupby(level=0).filter(lambda x: all([c > 1 for v in x['c']]))