I am doing some pandas interpolation in a series in which the index is not continuous. So it can be something like this:
Value Customer_id
0. 5 A
1. np.nan A
10. 9 A
11. 10 B
12. np.nan B
13. 30 B
I'm interpolating taking into account the customer_id (in this case it makes no difference, but my dataframe has NaNs in the starting or ending point of a customer)
So I'm doing
series = series.groupby('Customer_id').apply(lambda group: group.interpolate(method= interpolation_method))
Where interpolation_method is 'cubic' or 'index' (I'm testing both. for different purposes).
How can I do the interpolation and keep the original index somehow in a of column or in the index if possible so that I canter join with other data frames?
You can define your own interpolation function using np.polyfit. Let's say you have this dataframe where customer A begins with na:
Value Customer_id
0 NaN A
1 5.0 A
10 9.0 A
11 10.0 B
12 NaN B
13 30.0 B
Fill the missing values with a custom interpolation:
def interpolate(group):
x = group.dropna()
params = np.polyfit(x.index, x['Value'], deg=1)
predicted = np.polyval(params, group.index)
s = pd.Series(predicted, index=group.index)
return group['Value'].combine_first(s)
df.groupby('Customer_id').apply(interpolate).to_frame().reset_index(level=0)
Result:
Customer_id Value
0 A 4.555556
1 A 5.000000
10 A 9.000000
11 B 10.000000
12 B 20.000000
13 B 30.000000
This assumes that there is a minium of 2 valid Value per customer.
Related
I have a dataframe:
x = pd.DataFrame({'1':[1,2,3,2,5,6,7,8,9], '2':[2,5,6,8,10,np.nan,6,np.nan,np.nan],
'3':[10,10,10,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
I am trying to generate an average of a row but only on values greater than 5. For instance - if a row had values of 3, 6, 10. The average would be 8 ((6+10)/2). The 3 would be ignored as it is below 5.
The equivalent in excel would be =AVERAGEIF(B2:DX2,">=5")
You can mask the values greater than 5 then take mean:
x.where(x>5).mean(1)
Or:
x.mask(x<=5).mean(1)
You can create a small custom function which, inside each row, filters out values smaller or equal than a certain value and apply it to each row of your dataframe
def average_if(s, value=5):
s = s.loc[s > value]
return s.mean()
x.apply(average_if, axis=1)
0 10.0
1 10.0
2 8.0
3 8.0
4 10.0
5 6.0
6 6.5
7 8.0
8 9.0
dtype: float64
I want to calculate the max value in the past 3 rolling rows, ignoring NaN if I see them. I assumed that skipna would do that, but it doesn't. How can I ignore NaN, and also what is skipna supposed to do?
In this code
import pandas as pd
df = pd.DataFrame({'sales': [25, 20, 14]})
df['max'] = df['sales'].rolling(3).max(skipna=True)
print(df)
The last column is
sales max
0 25 NaN
1 20 NaN
2 14 25.0
But I want it to be
sales max
0 25 25.0
1 20 25.0
2 14 25.0
skipna= has the default value of True, so adding it explicitly in your code does not have any effect. If you were to set it to False, you would possibly get NaN as the max if you had NaNs in the original sales column. There is a nice explanation of why that would happen here.
In your example, you are getting those NaNs in the first two rows because the .rolling(3) call tells pandas that if there is less than 3 values in the rolling window, they are to be set to NaN. You can set the second parameter (min_periods) in the .rolling() call to require at least one value:
df['max'] = df['sales'].rolling(3,1).max()
df
# sales max
# 0 25 25.0
# 1 20 25.0
# 2 14 25.0
You can also use Series.bfill with your command:
df['max'] = df['sales'].rolling(3).max().bfill()
Output:
sales max
0 25 25.0
1 20 25.0
2 14 25.0
Lets say I have the following df:
Letter Number
a 0
b 0
c 0
d 1
e 2
f 3
I want to apply the following formula to the df
for i in range(1,len(df):
x = df.loc[i,'Number'] /df.loc[i-1,'Number'] + df.loc[i,'Number']
df.loc[i,'Number'] = x
Note: The column 'Number' only has zeros in the first few rows. After, there are no more zeros.
How would I apply the formula to the df without slicing the zeros off?
You can get the previous row's number by using shift(). Then you can compute the value using the formula you defined. For that we can use df.apply()
Here's how we can do it.
import pandas as pd
df = pd.DataFrame({'Letter':list('abcdef'),'Number':[0,0,0,1,2,3]})
print (df)
# capture the previous row's value
df['Prev'] = df.Number.shift()
# check if prev row value is NaN or 0. It will be NaN for first row
# if Nan or 0, don't divide instead use 0. Then add current row value
df['New'] = df.apply(lambda x: x.Number + ((x.Number/x.Prev) if (pd.isnull(x.Prev) or x.Prev !=0) else 0), axis = 1)
print (df)
The output of this will be (Prev is the previous row; New is the computed result):
Letter Number Prev New
0 a 0 NaN NaN
1 b 0 0.0 0.0
2 c 0 0.0 0.0
3 d 1 0.0 1.0
4 e 2 1.0 4.0
5 f 3 2.0 4.5
If you want the first row to have a value of 0, we can modify the .shift() option a bit and fillna(0). That will make the first row value to be 0. You can drop the Prev column after the computation.
This snippet divides, using pd.Series.div, the Series Number by the shifted values of Number and then adds Number using pd.Series.add
>>> df.Number.div(df.Number.shift()).add(df.Number)
0 NaN
1 NaN
2 NaN
3 inf
4 4.0
5 4.5
Name: Number, dtype: float64
I share a part of my big dataframe here to ask my question. In the Age column there are two missing values that are the first two rows. The way I intend to fill them is based on the following steps:
Calculte the mean of age for each group. (Assume the mean value of Age in Group A is X)
Iterate through Age column to detect the null values (which belong to the first two rows)
Return the Group value of each Age null value (which is 'A')
Fill those null values of Age with the mean age value of their corresponding group (The first two rows belong to A then fill their Age null values with X)
I know how to do step 1, I can use data.groupby('Group')['Age'].mean() but don't know how to proceed to the end of step 4.
Thanks.
Use:
df['Age'] = (df['Age'].fillna(df.groupby('Group')['Age'].transform('mean'))
.astype(int))
I'm guessing you're looking for something like this:
df['Age'] = df.groupby(['Name'])['Age'].transform(lambda x: np.where(np.isnan(x), x.mean(),x))
Assuming your data looks like this (I didn't copy the whole dataframe)
Name Age
0 a NaN
1 a NaN
2 b 15.0
3 d 50.0
4 d 45.0
5 a 8.0
6 a 7.0
7 a 8.0
you would run:
df['Age'] = df.groupby(['Name'])['Age'].transform(lambda x: np.where(np.isnan(x), x.mean(),x))
and get:
Name Age
0 a 7.666667 ---> The mean of group 'a'
1 a 7.666667
2 b 15.000000
3 d 50.000000
4 d 45.000000
5 a 8.000000
6 a 7.000000
7 a 8.000000
Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")