I have an array with missing values in various places.
import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
print(df)
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 8.0
8 9.0
dtype: float64
For each NaN, I want to take the value proceeding it, an divide it by two. And then propogate that to the next consecutive NaN, so I would end up with:
0 0.75
1 1.5
2 3.0
3 4.0
4 5.0
5 6.0
6 4.0
7 8.0
8 9.0
dtype: float64
I've tried df.interpolate(), but that doesn't seem to work with consecutive NaN's.
Another solution with fillna with method ffill, what it same as ffill() function:
#back order of Series
b = df[::-1].isnull()
#find all consecutives NaN, count them, divide by 2 and replace 0 to 1
a = (b.cumsum() - b.cumsum().where(~b).ffill()).mul(2).replace({0:1})
print(a)
8 1
7 1
6 2
5 1
4 1
3 1
2 1
1 2
0 4
dtype: int32
print(df.bfill().div(a))
0 0.75
1 1.50
2 3.00
3 4.00
4 5.00
5 6.00
6 4.00
7 8.00
8 9.00
dtype: float64
Timings (len(df)=9k):
In [315]: %timeit (mat(df))
100 loops, best of 3: 11.3 ms per loop
In [316]: %timeit (jez(df1))
100 loops, best of 3: 2.52 ms per loop
Code for timings:
import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
print(df)
df = pd.concat([df]*1000).reset_index(drop=True)
df1 = df.copy()
def jez(df):
b = df[::-1].isnull()
a = (b.cumsum() - b.cumsum().where(~b).ffill()).mul(2).replace({0:1})
return (df.bfill().div(a))
def mat(df):
prev = 0
new_list = []
for i in df.values[::-1]:
if np.isnan(i):
new_list.append(prev/2.)
prev = prev / 2.
else:
new_list.append(i)
prev = i
return pd.Series(new_list[::-1])
print (mat(df))
print (jez(df1))
You can do something like this:
import numpy as np
import pandas as pd
x = np.arange(1,10).astype(float)
x[[0,1,6]] = np.nan
df = pd.Series(x)
prev = 0
new_list = []
for i in df.values[::-1]:
if np.isnan(i):
new_list.append(prev/2.)
prev = prev / 2.
else:
new_list.append(i)
prev = i
df = pd.Series(new_list[::-1])
It loops over the values of the df, in reverse. It keeps track of the previous value. It adds the actual value if it is not NaN, otherwise the half of the previous value.
This might not be the most sophisticated Pandas solution, but you can change the behavior quite easy.
Related
Maybe a bit of beginner's question, but my mind is really stuck.
I have a dataframe with certain values in a column called x, split into two groups.
x group
1 1.7 a
2 0 b
3 2.3 b
4 2.7 b
5 8.6 a
6 5.4 b
7 4.2 a
8 5.7 b
My purpose is for each row, to count how many rows of the other group have a value greater than the current one. So to make it more clear, for the first row (group a) I am looking to find how many rows of group b are greater than 1.7 (the answer is 4). The end result should look like :
x group result
1 1.7 a 4
2 0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
I have several rows in the dataframe, so ideally I would also like a relatively fast solution.
Use np.searchsorted:
df['result'] = 0
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
df.loc[a.index, 'result'] = len(b) - np.searchsorted(np.sort(b), a)
df.loc[b.index, 'result'] = len(a) - np.searchsorted(np.sort(a), b)
Output:
>>> df
x group result
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
Performance for 130K records
>>> %%timeit
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
len(b) - np.searchsorted(np.sort(b), a)
len(a) - np.searchsorted(np.sort(a), b)
31.8 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Setup:
N = 130000
df = pd.DataFrame({'x': np.random.randint(1, 1000, N),
'group': np.random.choice(['a', 'b'], N, p=(0.7, 0.3))})
Here is one way. Based on ranking in descending order the x values per group, and merge_asof df with itself, after exchanging the group name to merge a with the ranked values in b, vice versa.
# needed for the merge_asof
df = df.sort_values('x')
res = (
pd.merge_asof(
df.reset_index(), # to keep original index order
df.assign(
# to compare a with b in the merge
group = df['group'].map({'a':'b', 'b':'a'}),
# rank descending to get the number of number above current number
result = df.groupby('group')['x'].rank(ascending=False)),
by='group', # same group first, knowing you exchange groups in second df
on='x', direction='forward') # look forward on x to get the rank
# complete the result column
.fillna({'result':0})
.astype({'result':int})
# for cosmetic
.set_index('index')
.rename_axis(None)
.sort_index()
)
print(res)
# x group result
# 1 1.7 a 4
# 2 0.0 b 3
# 3 2.3 b 2
# 4 2.7 b 2
# 5 8.6 a 0
# 6 5.4 b 1
# 7 4.2 a 2
# 8 5.7 b 1
You can sort the values and use masks to cumsum per the other group:
df2 = df.sort_values(by='x', ascending=False)
m = df2['group'].eq('a')
df['result'] = m.cumsum().mask(m).fillna((~m).cumsum().where(m)).astype(int)
Output:
x group result
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
This should be pretty efficient, just one sort of all x and then just calculating cumsums
df2 = df.sort_values('x', ascending=False).reset_index()
df2['acount'] = (df['group'] == 'a').cumsum()
df2['bcount'] = (df['group'] == 'b').cumsum()
df2 = df2.fillna(0)
df2
at this point df2 looks like this:
index x group acount bcount
0 5 8.6 a 0.0 0.0
1 8 5.7 b 1.0 0.0
2 6 5.4 b 1.0 1.0
3 7 4.2 a 1.0 2.0
4 4 2.7 b 1.0 3.0
5 3 2.3 b 2.0 3.0
6 1 1.7 a 2.0 4.0
7 2 0.0 b 3.0 4.0
now restore the index and choose acount or bcount depending on the group:
df2 = df2.set_index('index').sort_index()
df2['result'] = np.where(df['group']=='a', df2['bcount'],df2['acount']).astype(int)
df2[['x','result']]
final result
x group result
index
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 1
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
performance (on the same 130000 row test as #Corralien, not the same hardware obv)
65.4 ms ± 957 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not too different from Corralien's solution, but you can use broadcasting to do a check of all elements in group 'a' against all elements in group 'b' and count how many satisfy the condition. Then join the result back.
import pandas as pd
import numpy as np
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
result = pd.concat([
pd.Series(np.sum(a.to_numpy() < b.to_numpy()[:, None], axis=0), index=a.index),
pd.Series(np.sum(b.to_numpy() < a.to_numpy()[:, None], axis=0), index=b.index)])
df['result'] = result
x group result
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
A quick solution is to use pandas' DataFrame.apply method.
df['result'] = df.apply(lambda row: df[(df['group'] != row['group']) & (df['x'] > row['x'])].x.count(), axis=1)
I would like to interpolate missing values within groups in dataframe using preceding and following rows value.
Here is the df (there are more records within a group but for this example I left 3 per group):
import numpy as np
import pandas as pd
df = pd.DataFrame({'Group': ['a','a','a','b','b','b','c','c','c'],'Yval': [1,np.nan,5,2,np.nan,8,5,np.nan,10],'Xval': [0,3,2,4,5,8,3,1,9],'PTC': [0,1,0,0,1,0,0,1,0]})
df:
Group Yval Xval PTC
0 a 1.0 0 0
1 a NaN 3 1
2 a 5.0 2 0
3 b 2.0 4 0
4 b NaN 5 1
5 b 8.0 8 0
6 c 5.0 3 0
7 c NaN 1 1
8 c 10.0 9 0
For PTC (point to calculate) I need Yval interpolation using Xval,Yval from -1, +1 rows.
I.e. for A Group I would like:
df.iloc[1,1]=np.interp(3, [0,2], [1,5])
Here is what I tried to do using loc and shift method
and interp function found in this post:
df.loc[(df['PTC'] == 1), ['Yval']]= \
np.interp(df['Xval'], (df['Xval'].shift(+1),df['Xval'].shift(-1)),(df['Yval'].shift(+1),df['Yval'].shift(-1)))
Error I get:
ValueError: object too deep for desired array
df['Xval-1'] = df['Xval'].shift(-1)
df['Xval+1'] = df['Xval'].shift(+1)
df['Yval-1'] = df['Yval'].shift(-1)
df['Yval+1'] = df['Yval'].shift(+1)
df["PTC_interpol"] = df.apply(lambda x: np.interp(x['Xval'], [x['Xval-1'], x['Xval+1']], [x['Yval-1'], x['Yval+1']]), axis=1)
df['PTC'] = np.where(df['PTC'].isnull(), df["PTC_interpol"], df['PTC'])
I have code and it works fine:
import numpy as np
import pandas as pd
x = np.arange(10)
condlist = [x<3, x==5, x>5]
choicelist = [x, x**2, x**3]
a=np.select(condlist, choicelist)
Now lets add:
y=pd.Series(x)
Lets now use y instead of x. Now we need to get same result (same content as a has, and it should be Series.) with pandas only, and the conditions and choices should be coded as above (use above code for coding conditions and choices). How to do that?
construct a dataframe from choicelist and use df.where with condlist
s = pd.DataFrame(choicelist).where(condlist).ffill().fillna(0).iloc[-1]
Out[99]:
0 0.0
1 1.0
2 2.0
3 0.0
4 0.0
5 25.0
6 216.0
7 343.0
8 512.0
9 729.0
Name: 2, dtype: float64
If conditions are not overlapped, you may also use sum
s = pd.DataFrame(choicelist).where(condlist,0).sum()
Out[114]:
0 0
1 1
2 2
3 0
4 0
5 25
6 216
7 343
8 512
9 729
dtype: int64
I have a 21840x39 data frame. A few of my columns are numerically valued and I want to make sure they are all in the same data type (which I want to be a float).
Instead of naming all the columns out and converting them:
df[['A', 'B', 'C', '...]] = df[['A', 'B', 'C', '...]].astype(float)
Can I do a for loop that will allow me to say something like " convert to float from column 18 to column 35"
I know how to do one column: df['A'] = df['A'].astype(float)
But how can I do multiple columns? I tried with list slicing within a loop but couldn't get it right.
First idea is convert selected columns, python counts from 0, so for 18 to 36 columns use:
df.iloc[:, 17:35] = df.iloc[:, 17:35].astype(float)
If not working (because possible bug) use another solution:
df = df.astype(dict.fromkeys(df.columns[17:35], float))
Sample - convert 8 to 15th columns:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8 0 0 8 9 3 7 2 3 6 5
1 0 4 8 6 4 1 1 5 9 5 6 6 6 5 4 6 4 2
2 3 4 7 1 4 9 3 2 0 9 1 2 7 1 0 2 8 8
df = df.astype(dict.fromkeys(df.columns[7:15], float))
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8.0 0.0 0.0 8.0 9.0 3.0 7.0 2.0 3 6 5
1 0 4 8 6 4 1 1 5.0 9.0 5.0 6.0 6.0 6.0 5.0 4.0 6 4 2
2 3 4 7 1 4 9 3 2.0 0.0 9.0 1.0 2.0 7.0 1.0 0.0 2 8 8
Tweaked #jezrael code as typing in column names (I feel) is a good option.
import pandas as pd
import numpy as np
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print(df)
columns = list(df.columns)
#change the first and last column names below as required
df = df.astype(dict.fromkeys(
df.columns[columns.index('h'):(columns.index('o')+1)], float))
print (df)
Leaving the original answer below here but note: Never loop in pandas if vectorized alternatives exist
If I had a dataframe and wanted to change columns 'col3' to 'col5' (human readable names) to floats I could...
import pandas as pd
import re
df = pd.read_csv('dummy_data.csv')
df
columns = list(df.columns)
#change the first and last column names below as required
start_column = columns.index('col3')
end_column = columns.index('col5')
for index, col in enumerate(columns):
if (start_column <= index) & (index <= end_column):
df[col] = df[col].astype(float)
df
...by just changing the column names. Perhaps it's easier to work in column names and 'from this one' and 'to that one' (inclusive).
Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column.
(You might recognize my data from the Titanic data set)...
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 Nan
I want to fill the NaN with a value from series 'pclass_lookup':
pclass_lookup
1 38.1
2 29.4
3 25.2
I have tried doing fillna with indexing like:
df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of
ValueError: cannot reindex from a duplicate axis
lambdas were a try too:
df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]
but, that seems not to fill it right, either. Am I totally missing the boat here? '
Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.
So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:
In [317]:
df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 29.4
4 1 38.1
Timings
For a df with 5000 rows:
In [26]:
%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:
%%timeit
def remove_na(x):
if pd.isnull(x['Age']):
return df1[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:
%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop
So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.
Building on the response of #vrajs5:
# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values
>>> df
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
Following should work for you:
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 NaN
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1 38.1
2 29.4
3 25.2
dtype: float64
def remove_na(x):
if pd.isnull(x['Age']):
return pclass_lookup[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1