May I know how to ignore NaN when performing rolling on a df.
For example, given a df, perform rolling on column a, but ignore the Nan. This requirement should produced something
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 11180.00
5 13426.0 12050.00
6 NaN NaN
7 17514.0 19350.00
8 18408.0 20142.50
9 22128.0 20142.50
10 22520.0 21018.67
11 NaN NaN
12 26164.0 27796.67
13 26590.0 21627.25
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
However, I dont know which part of this line should be tweaked to get the above expected output
df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
Currently, the following code
import numpy as np
import pandas as pd
arr=[[6772],[7182],[8570],[11078],[11646],[13426],[np.nan],[17514],[18408],
[22128],[22520],[np.nan],[26164],[26590],[30636],[3119],[32166],[34774]]
df=pd.DataFrame(arr,columns=['a'])
w = 2
df['avg'] = df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
produced the following,
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 11180.00
5 13426.0 13416.00 <<<
6 NaN 15248.50 <<<
7 17514.0 17869.00 <<<
8 18408.0 20142.50
9 22128.0 20142.50
10 22520.0 22305.00 <<<
11 NaN 24350.50 <<<
12 26164.0 26477.50 <<<
13 26590.0 21627.25
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
<<< indicate where the values are different than the expected output
Update:
adding fillna
df['avg'] = df['a'].fillna(value=0).rolling(2 * w + 1, center=True, min_periods=1).mean()
Does not produced the expected output
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 8944.00
5 13426.0 10732.80
6 NaN 12198.80
7 17514.0 14295.20
8 18408.0 16114.00
9 22128.0 16114.00
10 22520.0 17844.00
11 NaN 19480.40
12 26164.0 21182.00
13 26590.0 17301.80
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
12050=sum(11078 11646 13426 )/3
IIUC, you want to restart the rolling when nan is met. One way would be to use pandas.DataFrame.groupby:
m = df.isna().any(1)
df["avg"] = (df["a"].groupby(m.cumsum())
.rolling(2 * w + 1, center=True, min_periods=1).mean()
.reset_index(level=0, drop=True))
df["avg"] = df["avg"][~m]
Output:
a avg
0 6772.0 7508.000000
1 7182.0 8400.500000
2 8570.0 9049.600000
3 11078.0 10380.400000
4 11646.0 11180.000000
5 13426.0 12050.000000
6 NaN NaN
7 17514.0 19350.000000
8 18408.0 20142.500000
9 22128.0 20142.500000
10 22520.0 21018.666667
11 NaN NaN
12 26164.0 27796.666667
13 26590.0 21627.250000
14 30636.0 23735.000000
15 3119.0 25457.000000
16 32166.0 25173.750000
17 34774.0 23353.000000
Related
I have the following dataframe:
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
Name: Passengers, dtype: float64
As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting it I see the following:
plt.figure(figsize=(15,5))
plt.plot(esta2,color='orange')
plt.show()
I would like to see a continuous line from 1 to 24.
esta2 = esta2.reset_index() will get you 0-23. If you need 1-24 then you could just do esta2.index = np.arange(1, len(esta2) + 1).
quite simply :
df.index = [i for i in range(1,len(df.index)+1)]
df.index.name = 'Month'
print(df)
Val
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
13 -0.075844
14 -0.089111
15 0.042705
16 0.002147
17 -0.010528
18 0.109443
19 0.198334
20 0.209830
21 0.075139
22 -0.062405
23 -0.211774
24 -0.109167
Just reassign the index:
df.index = pd.Index(range(1, len(df) + 1), name='Month')
i have a pandas series like this:
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
i want to get rid of the dollar sign so i can convert the values to numeric. I made a function in order to do this:
def strip_dollar(series):
for number in dollar:
if number[0] == '$':
number[0].replace('$', ' ')
return dollar
This function is returning the original series untouched, nothing changes, and i don't know why.
Any ideas about how to get this right?
Thanks in advance
Use lstrip and convert to floats:
s = s.str.lstrip('$').astype(float)
print (s)
0 233.94
1 214.14
2 208.74
3 232.14
4 187.15
5 262.73
6 176.35
7 266.33
8 174.55
9 221.34
10 199.74
11 228.54
12 228.54
13 196.15
14 269.93
15 257.33
16 246.53
17 226.74
Name: A, dtype: float64
Setup:
s = pd.Series(['$233.94', '$214.14', '$208.74', '$232.14', '$187.15', '$262.73', '$176.35', '$266.33', '$174.55', '$221.34', '$199.74', '$228.54', '$228.54', '$196.15', '$269.93', '$257.33', '$246.53', '$226.74'])
print (s)
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
dtype: object
Using str.replace("$", "")
Ex:
import pandas as pd
df = pd.DataFrame({"Col" : ["$233.94", "$214.14"]})
df["Col"] = pd.to_numeric(df["Col"].str.replace("$", ""))
print(df)
Output:
Col
0 233.94
1 214.14
CODE:
ser = pd.Series(data=['$123', '$234', '$232', '$6767'])
def rmDollar(x):
return x[1:]
serWithoutDollar = ser.apply(rmDollar)
serWithoutDollar
OUTPUT:
0 123
1 234
2 232
3 6767
dtype: object
Hope it helps!
Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15
Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15
You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15
Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15
I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you
I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
I'm wondering if there is a pythonic way to fill nulls for categorical data by randomly choosing from the distribution of unique values. Basically proportionally / randomly filling categorical nulls based on the existing distribution of the values in the column...
-- below is an example of what I'm already doing
--I'm using numbers as categories to save time, I'm not sure how to randomly input letters
import numpy as np
import pandas as pd
np.random.seed([1])
df = pd.DataFrame(np.random.normal(10, 2, 20).round().astype(object))
df.rename(columns = {0 : 'category'}, inplace = True)
df.loc[::5] = np.nan
print df
category
0 NaN
1 12
2 4
3 9
4 12
5 NaN
6 10
7 12
8 13
9 9
10 NaN
11 9
12 10
13 11
14 9
15 NaN
16 10
17 4
18 9
19 9
This is how I'm currently inputting the values
df.category.value_counts()
9 6
12 3
10 3
4 2
13 1
11 1
df.category.value_counts()/16
9 0.3750
12 0.1875
10 0.1875
4 0.1250
13 0.0625
11 0.0625
# to fill categorical info based on percentage
category_fill = np.random.choice((9, 12, 10, 4, 13, 11), size = 4, p = (.375, .1875, .1875, .1250, .0625, .0625))
df.loc[df.category.isnull(), "category"] = category_fill
Final output works, just takes a while to write
df.category.value_counts()
9 9
12 4
10 3
4 2
13 1
11 1
Is there a faster way to do this or a function that would serve this purpose?
Thanks for any and all help!
You could use stats.rv_discrete:
from scipy import stats
counts = df.category.value_counts()
dist = stats.rv_discrete(values=(counts.index, counts/counts.sum()))
fill_values = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = fill_values
EDIT: For general data(not restricted to integers) you can do:
dist = stats.rv_discrete(values=(np.arange(counts.shape[0]),
counts/counts.sum()))
fill_idxs = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = counts.iloc[fill_idxs].index.values