Pandas rolling apply with missing data - python

I want to do a rolling computation on missing data.
Sample Code: (For sake of simplicity I'm giving an example of a rolling sum but I want to do something more generic.)
foo = lambda z: z[pandas.notnull(z)].sum()
x = np.arange(10, dtype="float")
x[6] = np.NaN
x2 = pandas.Series(x)
pandas.rolling_apply(x2, 3, foo)
which produces:
0 NaN
1 NaN
2 3
3 6
4 9
5 12
6 NaN
7 NaN
8 NaN
9 24
I think that during the "rolling", window with missing data is being ignored for computation. I'm looking to get a result along the lines of:
0 NaN
1 NaN
2 3
3 6
4 9
5 12
6 9
7 12
8 15
9 24

In [7]: pandas.rolling_apply(x2, 3, foo, min_periods=2)
Out[7]:
0 NaN
1 1
2 3
3 6
4 9
5 12
6 9
7 12
8 15
9 24

It would be better to replace the NA values in the data-set with logical substitutions before operating on them.
For Numerical Data:
For your given example, a simple mean around the NA would fill it perfectly, but what if x[7] = np.NaN were eliminated as well?
Analysis of the surrounding data shows a linear pattern, so a lerp(linear-interpolate) is in order.
Same goes for polynomial, exponential, log, and periodic(cosine) data.
If an inflection point, a change in the second derivative of the data(subtract pairwise points twice, and note if the sign changes), happens during the missing data, its position is unknowable unless the other side picks it up perfectly, if not, pick a random point and continue.
For Categorical Data:
from scipy import stats
Use:
x=pandas.rolling_apply(x2, 3, (lambda x : stats.mode(x,nan_policy='omit')) to replace the missing values with the most common of the nearest 3.
For Static data:
Use:
Replace 0 with the appropriate value.
x = x.fillna(0)

Related

Dynamically Fill NaN Values in Dataframe

I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0

Change NaN values with returned value of API

I need to change the NaN values in one column with the returned value of an API. I already wrote a function that calls the function and returns a value but not I am not sure how to change the values now. Would I need to loop through the whole DataFrame or are there other solutions? The dataframe looks like this:
colname1
colname2
colname3
1
2
3
4
5
6
7
8
9
NaN
11
12
My function takes values 11 and 12, and gives as return 10 to input in df["colname1"] for the last row. My question is how I can loop through the whole DataFrame to solve all such instances.
You can use apply. The best is to first subset the rows with NaNs to avoid requesting the API on the other values:
def api_func(a, b):
return 10
# mask of the rows with NaN in colname1
m = df['colname1'].isna()
# output of API request
s = df.loc[m].apply(lambda r: api_func(r['colname2'], r['colname3']), axis=1)
# updating the column (in place)
df['colname1'].update(s)
output:
colname1 colname2 colname3
0 1.0 2 3
1 4.0 5 6
2 7.0 8 9
3 10.0 11 12

How can I choose a random sample of size n from values from a single pandas dataframe column, with repeating values occurring a maximum of 2 times?

My dataframe looks like this:
Identifier Strain Other columns, etc.
1 A
2 C
3 D
4 B
5 A
6 C
7 C
8 B
9 D
10 A
11 D
12 D
I want to choose n rows at random while maintaining diversity in the strain values. For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.
I've tried converting the Strain column into a numpy array and using the method random.choice but that didn't seem to run. I've also tried using .sample but it does not maximize strain diversity.
This is my latest attempt which outputs a sample of size 7 in order (identifiers 0-7) and the Strains are all the same.
randomsample = df[df.Strain == np.random.choice(df['Strain'].unique())].reset_index(drop=True)
I believe there's something in numpy that does exactly this, but can't recall which. Here's a fairly fast approach:
Shuffle the data for randomness
enumerate the rows within each group
sort by the enumeration above
slice the top n rows
So in code:
n = 6
df = df.sample(frac=1) # step 1
enums = df.groupby('Strain').cumcount() # step 2
orders = np.argsort(enums) # step 3
samples = df.iloc[orders[:n]] # step 4
Output:
Identifier Strain Other columns, etc.
2 3 D NaN
7 8 B NaN
0 1 A NaN
5 6 C NaN
4 5 A NaN
8 9 D NaN

How to add or combine two columns into another one in a dataframe if they meet a condition

I'm new to this, so this may sound weird but basically, I have a large dataframe but for simplification purposes let's assume the dataframe is this:
import pandas as pd
import numpy as np
dfn = pd.DataFrame({'a':[1,2,3,4,5],
'b':[6,7,8,9,10],
'c':np.nan})
dfn
Output:
a b c
0 1 6 NaN
1 2 7 NaN
2 3 8 NaN
3 4 9 NaN
4 5 10 NaN
What I want to do is to fill in values in the 'c' column based off of a condition, namely if the corresponding row value in 'a' is odd, then add it to the corresponding row value 'b' and input into 'c', else, just use the 'a' value for 'c'.
What I currently have is this:
for row in range(dfn.shape[0]):
if dfn.loc[row]['a']%2!=0:
dfn.loc[row]['c']=dfn.loc[row]['a']+dfn.loc[row]['b']
else:
dfn.loc[row]['c']=dfn.loc[row]['a']
dfn
Output:
a b c
0 1 6 NaN
1 2 7 NaN
2 3 8 NaN
3 4 9 NaN
4 5 10 NaN
Nothing seems to happen here and I'm not entirely sure why.
I've also tried a different approach of:
is_odd=dfn[dfn['a']%2!=0]
is_odd['c'] = is_odd['a'] + is+odd['b']
is_odd
Here, weirdly I get the right output:
a b c
0 1 1 2
2 3 3 6
4 5 5 10
But when I call dfn again, it comes out with all NaN values.
I've also tried doing it without using a variable name and nothing happens.
Any idea what I'm missing or if there's a way of doing this?
Thanks!
Use numpy where, which works for conditionals. It is akin to an if statement in Python, but significantly faster. I rarely use iterrows, since I don't find it as efficient as numpy where.
dfn['c'] = np.where(dfn['a']%2 !=0,
dfn.a + dfn.b,
dfn.a)
a b c
0 1 6 7
1 2 7 2
2 3 8 11
3 4 9 4
4 5 10 15
Basically, the first line in np.where defines your condition, which in this case is finding out if the 'a' column is an odd number. If it is, the next line is executed. If it is an even number, then the last line is executed. You can think of it as an if-else statement.
Use Series.mod and Series.where to get a copy of column b with 0 where there is an even value in a, then we add this serie to a.
dfn['c'] = dfn['b'].where(dfn['a'].mod(2).eq(1), 0).add(dfn['a'])
print(dfn)
a b c
0 1 6 7
1 2 7 2
2 3 8 11
3 4 9 4
4 5 10 15
Alternative
dfn['c'] = dfn['a'].mask(dfn['a'].mod(2).eq(1), dfn['a'].add(dfn['b']))
dfn.loc[row]['c']=... is always wrong. dfn.loc[row] may be either a copy or a view, so you cannot know what will happen. The correct way is:
dfn.loc[row, 'c']=...
Anyway here you should avoid the iteration and use np.where as suggested by other answers
Here is my solution which is close to original thought of the author of the question, hope it can be helpful
def oddadd(x):
if x['a']%2!=0:
return x['a']+x['b']
else:
return x['a']
dfn["c"] = dfn.apply(oddadd,axis=1)

Selecting the top 50 % percentage names from the columns of a pandas dataframe

I have a pandas dataframe that looks like this. The rows and the columns have the same name.
name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10
I can get the 5 number of largest values by passing df['column_name'].nlargest(n=5) but if I have to return 50 % of the largest in descending order, is there anything that is inbuilt in pandas of it I have to write a function for it, how can I get them? I am quite new to python. Please help me out.
UPDATE : So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total in this case is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly I want to go through all the other columns and select 50%.
Sorting is flexible :)
df.sort_values('column_name',ascending=False).head(int(df.shape[0]*.5))
Update: frac argument is available only on .sample(), not in .head or .tail. df.sample(frac=.5) does give 50% but head and tail expects only int. df.head(frac=.5) fails with TypeError: head() got an unexpected keyword argument 'frac'
Note: on int() vs round()
int(3.X) == 3 # True Where 0 >= X >=9
round(3.45) == 3 # True
round(3.5) == 4 # True
So when doing .head(int/round ...) do think of what behaviour fits your need.
Updated: Requirements
So let's take column a into consideration and it has values like 10,
5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of
them. so the total, in this case, is 36. 50% of these values would be
18. So from column a, I want to select 10 and 8 only. Similarly, I want to go through all the other columns and select 50%. -Matt
A silly hack would be to sort, find the cumulative sum, find the middle by dividing it with the sum total and then use that to select part of your sorted column. e.g.
import pandas as pd
data = pd.read_csv(
pd.compat.StringIO("""name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10"""),
sep=' ', index_col='name'
).dropna(axis=1).apply(
pd.to_numeric, errors='coerce', downcast='signed')
x = data[['a']].sort_values(by='a',ascending=False)[(data[['a']].sort_values(by='a',ascending=False).cumsum()
/data[['a']].sort_values(by='a',ascending=False).sum())<=.5].dropna()
print(x)
Outcome:
You could sort the data frame and display only 90% of the data
df.sort_values('column_name',ascending=False).head(round(0.9*len(df)))
data.csv
name,a,b,c,d,e,f,g
a,10,5,4,8,5,6,4
b,5,10,6,5,4,3,3
c,-,4,9,3,6,5,7
d,6,9,8,6,6,8,2
e,8,5,4,4,14,9,6
f,3,3,-,4,5,14,7
g,4,5,8,9,6,7,10
test.py
#!/bin/python
import pandas as pd
def percentageOfList(l, p):
return l[0:int(len(l) * p)]
df = pd.read_csv('data.csv')
print(percentageOfList(df.sort_values('b', ascending=False)['b'], 0.9))

Categories

Resources