update missing values in Python Pandas dataframe with matching conditions - python

I have a dataframe df1 with 3 columns (A,B,C), NaN represents missing value here
A B C
1 2 NaN
2 1 2.3
2 3 2.5
I have a dataframe df2 with 3 columns (A,B,D)
A B D
1 2 2
2 1 2
2 3 4
The expected output would be
A B C
1 2 2
2 1 2.3
2 3 2.5
I want to have values in column C in df1 intact if not missing, replaced by corresponding value in D with other two columns value equal, i.e, df1.A==df2.A and df1.B==df2.B
any good solution?

One way would be to use the columns A and B as the index. If you use fillna then, pandas will align the indices and give you the correct result:
df1.set_index(['A', 'B'])['C'].fillna(df2.set_index(['A', 'B'])['D']).reset_index()
Out:
A B C
0 1 2 2.0
1 2 1 2.3
2 2 3 2.5

IIUC:
In [100]: df['C'] = np.where((np.isnan(df.C))&(df.A==df1.A)&(df.B==df1.B),df1.D,df.C)
In [101]: df
Out[101]:
A B C
0 1.0 2.0 2.0
1 2.0 1.0 2.3
2 2.3 1.2 2.5
np.where is faster when compared:
In [102]: %timeit df['C'] = np.where((np.isnan(df.C))&(df.A==df1.A)&(df.B==df1.B),df1.D,df.C)
1000 loops, best of 3: 1.3 ms per loop
In [103]: %timeit df.set_index(['A', 'B'])['C'].fillna(df1.set_index(['A', 'B'])['D']).reset_index()
100 loops, best of 3: 5.92 ms per loop

Related

Grouping dataframe based on consecutive occurrence of values

I have a pandas array which has one column which is either true or false (titled 'condition' in the example below). I would like to group the array by consecutive true or false values. I have tried to use pandas.groupby but haven't succeeded using that method, albeit I think that's down to my lack of understanding. An example of the dataframe can be found below:
df = pd.DataFrame(df)
print df
print df
index condition H t
0 1 2 1.1
1 1 7 1.5
2 0 1 0.9
3 0 6.5 1.6
4 1 7 1.1
5 1 9 1.8
6 1 22 2.0
Ideally the output of the program would be something along the lines of what can be found below. I was thinking of using some sort of 'grouping' method to make it easier to call each set of results but not sure if this is the best method. Any help would be greatly appreciated.
index condition H t group
0 1 2 1.1 1
1 1 7 1.5 1
2 0 1 0.9 2
3 0 6.5 1.6 2
4 1 7 1.1 3
5 1 9 1.8 3
6 1 22 2.0 3
Since you're dealing with 0/1s, here's another alternative using diff + cumsum -
df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
df
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
If you don't mind floats, this can be made a little faster.
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
df
index condition H t group
0 0 1 2.0 1.1 1.0
1 1 1 7.0 1.5 1.0
2 2 0 1.0 0.9 2.0
3 3 0 6.5 1.6 2.0
4 4 1 7.0 1.1 3.0
5 5 1 9.0 1.8 3.0
6 6 1 22.0 2.0 3.0
Here's the version with numpy equivalents -
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
df
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
On my machine, here are the timings -
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 25.1 ms per loop
%%timeit
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
10 loops, best of 3: 23.4 ms per loop
%%timeit
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
10 loops, best of 3: 21.4 ms per loop
%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop
Compare with ne (!=) by shifted column and then use cumsum:
df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
print (df)
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
Detail:
print (df['condition'].ne(df['condition'].shift()))
index
0 True
1 False
2 True
3 False
4 True
5 False
6 False
Name: condition, dtype: bool
Timings:
df = pd.concat([df]*100000).reset_index(drop=True)
In [54]: %timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 12.2 ms per loop
In [55]: %timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 24.5 ms per loop
In [56]: %%timeit
...: df['group'] = 1
...: df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
...:
10 loops, best of 3: 26.6 ms per loop

If a value in column A of a data frame is null, write the value from column B into column A

I just cannot get my head around this. I have a data frame with the following values:
df = pd.DataFrame([
(1,np.nan,"a"),
(1,"a",np.nan),
(1,np.nan,"b"),
(1,"c","b"),
(2,"a",np.nan),
(2,np.nan,"b"),
(3,"a",np.nan)], columns=["A", "B", "C"])
That translates into
A B C
0 1 NaN a
1 1 a NaN
2 1 NaN b
3 1 c b
4 2 a NaN
5 2 NaN b
6 3 a NaN
What I want is that if I have a null value / empty field in "B" it should be replaced with the value from "C". Like this:
A B C
0 1 a a
1 1 a NaN
2 1 b b
3 1 c b
4 2 a NaN
5 2 b b
6 3 a NaN
I can of course filer for the values:
df.loc[df.B.isnull()]
but I cannot manage to assign values from the other column:
df.loc[df.B.isnull()] = df.C
I understand that I want to replace the three NaN with seven entries in column C, so it does not match. So how do I get the corresponding values over?
You can use:
df.loc[df.B.isnull(), 'B'] = df.C
Output:
A B C
0 1 a a
1 1 a NaN
2 1 b b
3 1 c b
4 2 a NaN
5 2 b b
6 3 a NaN
Or as suggested in comment below you can also use:
df.B.where(pd.notnull, df.C, inplace=True)
You can use combine_first, it also seems to be much faster
df.B = df.B.combine_first(df.C)
1000 loops, best of 3: 764 µs per loop
df.loc[df.B.isnull(), 'B'] = df.C
100 loops, best of 3: 1.54 ms per loop
You get
A B C
0 1 a a
1 1 a NaN
2 1 b b
3 1 c b
4 2 a NaN
5 2 b b
6 3 a NaN

How to fillna by groupby outputs in pandas?

I have a dataframe having 4 columns(A,B,C,D). D has some NaN entries. I want to fill the NaN values by the average value of D having same value of A,B,C.
For example,if the value of A,B,C,D are x,y,z and Nan respectively,then I want the NaN value to be replaced by the average of D for the rows where the value of A,B,C are x,y,z respectively.
df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply
In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64
In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Details
In [2396]: df.shape
Out[2396]: (10000, 4)
In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop
In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
I think you need:
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
Sample:
df = pd.DataFrame({'A':[1,1,1,3],
'B':[1,1,1,3],
'C':[1,1,1,3],
'D':[1,np.nan,3,5]})
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Link to duplicate of this question for further information:
Pandas Dataframe: Replacing NaN with row average
Another suggested way of doing it mentioned in the link is using a simple fillna on the transpose:
df.T.fillna(df.mean(axis=1)).T

Concat list of pandas data frame, but ignoring column name

Sub-title: Dumb it down pandas, stop trying to be clever.
I've a list (res) of single-column pandas data frames, each containing the same kind of numeric data, but each with a different column name. The row indices have no meaning. I want to put them into a single, very long, single-column data frame.
When I do pd.concat(res) I get one column per input file (and loads and loads of NaN cells). I've tried various values for the parameters (*), but none that do what I'm after.
Edit: Sample data:
res = [
pd.DataFrame({'A':[1,2,3]}),
pd.DataFrame({'B':[9,8,7,6,5,4]}),
pd.DataFrame({'C':[100,200,300,400]}),
]
I have an ugly-hack solution: copy every data frame and giving it a new column name:
newList = []
for r in res:
r.columns = ["same"]
newList.append(r)
pd.concat( newList, ignore_index=True )
Surely that is not the best way to do it??
BTW, pandas: concat data frame with different column name is similar, but my question is even simpler, as I don't want the index maintained. (I also start with a list of N single-column data frames, not a single N-column data frame.)
*: E.g. axis=0 is default behaviour. axis=1 gives an error. join="inner" is just silly (I only get the index). ignore_index=True renumbers the index, but I stil gets lots of columns, lots of NaNs.
UPDATE for empty lists
I was having problems (with all the given solutions) when the data had an empty list, something like:
res = [
pd.DataFrame({'A':[1,2,3]}),
pd.DataFrame({'B':[9,8,7,6,5,4]}),
pd.DataFrame({'C':[]}),
pd.DataFrame({'D':[100,200,300,400]}),
]
The trick was to force the type, by adding .astype('float64'). E.g.
pd.Series(np.concatenate([df.values.ravel().astype('float64') for df in res]))
or:
pd.concat(res,axis=0).astype('float64').stack().reset_index(drop=True)
I think you need concat with stack:
print (pd.concat(res, axis=1))
A B C
0 1.0 9 100.0
1 2.0 8 200.0
2 3.0 7 300.0
3 NaN 6 400.0
4 NaN 5 NaN
5 NaN 4 NaN
print (pd.concat(res, axis=1).stack().reset_index(drop=True))
0 1.0
1 9.0
2 100.0
3 2.0
4 8.0
5 200.0
6 3.0
7 7.0
8 300.0
9 6.0
10 400.0
11 5.0
12 4.0
dtype: float64
Another solution with numpy.ravel for flattening:
print (pd.Series(pd.concat(res, axis=1).values.ravel()).dropna())
0 1.0
1 9.0
2 100.0
3 2.0
4 8.0
5 200.0
6 3.0
7 7.0
8 300.0
10 6.0
11 400.0
13 5.0
16 4.0
dtype: float64
print (pd.DataFrame(pd.concat(res, axis=1).values.ravel(), columns=['col']).dropna())
col
0 1.0
1 9.0
2 100.0
3 2.0
4 8.0
5 200.0
6 3.0
7 7.0
8 300.0
10 6.0
11 400.0
13 5.0
16 4.0
Solution with list comprehension:
print (pd.Series(np.concatenate([df.values.ravel() for df in res])))
0 1
1 2
2 3
3 9
4 8
5 7
6 6
7 5
8 4
9 100
10 200
11 300
12 400
dtype: int64
I would use list comphrension such has:
import pandas as pd
res = [
pd.DataFrame({'A':[1,2,3]}),
pd.DataFrame({'B':[9,8,7,6,5,4]}),
pd.DataFrame({'C':[100,200,300,400]}),
]
x = []
[x.extend(df.values.tolist()) for df in res]
pd.DataFrame(x)
Out[49]:
0
0 1
1 2
2 3
3 9
4 8
5 7
6 6
7 5
8 4
9 100
10 200
11 300
12 400
I tested speed for you.
%timeit x = []; [x.extend(df.values.tolist()) for df in res]; pd.DataFrame(x)
10000 loops, best of 3: 196 µs per loop
%timeit pd.Series(pd.concat(res, axis=1).values.ravel()).dropna()
1000 loops, best of 3: 920 µs per loop
%timeit pd.concat(res, axis=1).stack().reset_index(drop=True)
1000 loops, best of 3: 902 µs per loop
%timeit pd.DataFrame(pd.concat(res, axis=1).values.ravel(), columns=['col']).dropna()
1000 loops, best of 3: 1.07 ms per loop
%timeit pd.Series(np.concatenate([df.values.ravel() for df in res]))
10000 loops, best of 3: 70.2 µs per loop
looks like
pd.Series(np.concatenate([df.values.ravel() for df in res]))
is the fastest.

Fill non-consecutive missings with consecutive numbers

For a given data frame...
data = pd.DataFrame([[1., 6.5], [1., np.nan],[5, 3], [6.5, 3.], [2, np.nan]])
that looks like this...
0 1
0 1.0 6.5
1 1.0 NaN
2 5.0 3.0
3 6.5 3.0
4 2.0 NaN
...I want to create a third column where all missings of the second column are replaced with consecutive numbers. So the result should look like this:
0 1 2
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
(my data frame has much more rows, so imagine 70 missings in the second column so that the last number in the 3rd column would be 70)
How can I create the 3rd column?
You can do it this way, I took the liberty of renaming the columns to avoid the confusion of what I am selecting, you can do the same with your dataframe using:
data = data.rename(columns={0:'a',1:'b'})
In [41]:
data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
Out[41]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Some explanation here of the one liner:
# we want just the rows where column 'b' is null:
data[data.b.isnull()]
# now construct a dataset of the length of this dataframe starting from 1:
range(1,len(data[data.b.isnull()]) + 1) # note we have to add a 1 at the end
# construct a new dataframe from this and crucially use the index of the null values:
pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index)
# now perform a merge and tell it we want to perform a left merge and use both sides indices, I've removed the verbose dataframe construction and replaced with new_df here but you get the point
data.merge(new_df,how='left', left_index=True, right_index=True)
Edit
You can also do it another way using #Karl.D's suggestion:
In [56]:
data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
data
Out[56]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Timings also suggest that Karl's method would be faster for larger datasets but I would profile this:
In [57]:
%timeit data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
%timeit data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
1000 loops, best of 3: 1.31 ms per loop
1000 loops, best of 3: 501 µs per loop

Categories

Resources