I am trying to build a function that uses .shift() but it is giving me an error.
Consider this:
In [40]:
data={'level1':[20,19,20,21,25,29,30,31,30,29,31],
'level2': [10,10,20,20,20,10,10,20,20,10,10]}
index= pd.date_range('12/1/2014', periods=11)
frame=DataFrame(data, index=index)
frame
Out[40]:
level1 level2
2014-12-01 20 10
2014-12-02 19 10
2014-12-03 20 20
2014-12-04 21 20
2014-12-05 25 20
2014-12-06 29 10
2014-12-07 30 10
2014-12-08 31 20
2014-12-09 30 20
2014-12-10 29 10
2014-12-11 31 10
A normal function works fine. To demonstrate I calculate the same result twice, using the direct and function approach:
In [63]:
frame['horizontaladd1']=frame['level1']+frame['level2']#works
def horizontaladd(x):
test=x['level1']+x['level2']
return test
frame['horizontaladd2']=frame.apply(horizontaladd, axis=1)
frame
Out[63]:
level1 level2 horizontaladd1 horizontaladd2
2014-12-01 20 10 30 30
2014-12-02 19 10 29 29
2014-12-03 20 20 40 40
2014-12-04 21 20 41 41
2014-12-05 25 20 45 45
2014-12-06 29 10 39 39
2014-12-07 30 10 40 40
2014-12-08 31 20 51 51
2014-12-09 30 20 50 50
2014-12-10 29 10 39 39
2014-12-11 31 10 41 41
But while directly applying shift works, in a function it doesn't work:
frame['verticaladd1']=frame['level1']+frame['level1'].shift(1)#works
def verticaladd(x):
test=x['level1']+x['level1'].shift(1)
return test
frame.apply(verticaladd)#error
results in
KeyError: ('level1', u'occurred at index level1')
I also tried applying to a single column which makes more sense in my mind, but no luck:
def verticaladd2(x):
test=x-x.shift(1)
return test
frame['level1'].map(verticaladd2)#error, also with apply
error:
AttributeError: 'numpy.int64' object has no attribute 'shift'
Why not call shift directly? I need to embed it into a function to calculate multiple columns at the same time, along axis 1. See related question Ambiguous truth value with boolean logic
Try passing the frame to the function, rather than using apply (I am not sure why apply doesn't work, even column-wise):
def f(x):
x.level1
return x.level1 + x.level1.shift(1)
f(frame)
returns:
2014-12-01 NaN
2014-12-02 39
2014-12-03 39
2014-12-04 41
2014-12-05 46
2014-12-06 54
2014-12-07 59
2014-12-08 61
2014-12-09 61
2014-12-10 59
2014-12-11 60
Freq: D, Name: level1, dtype: float64
Check if the values you are trying to shift is not an array. Then you need to convert the array to series. With this you will be able to shift the values. I was having same issues,now I am able to get the shift values.
This is my part of the code for your reference.
X = grouped['Confirmed_day'].values
X_series=pd.Series(X)
X_lag1 = X_series.shift(1)
I'm not entirely following along, but if frame['level1'].shift(1) works, then I can only imagine that frame['level1'] is not a numpy.int64 object while whatever you are passing into the verticaladd function is. Probably need to look at your types.
Related
My question tackles a slightly different angle of this question:
I have a helper function which computes scipy's ttest for independence. Here it is:
# Helper Function for Testing for Independence
def conduct_ttest(data, variable_1="bias", variable_2="score", nan_policy="omit"):
test_result = ttest_ind(data[variable_1], data[variable_2], nan_policy=nan_policy)
test_statistic = test_result[0]
p_value = test_result[1]
return test_statistic, p_value
I would like to run it using a 5 period rolling window so that it outputs the test results into the dataframe, "data". The dataframe looks like this:
date bias score
1/1/2021 5 1000
1/2/2021 13 1089
1/3/2021 21 1178
1/4/2021 29 1267
1/5/2021 37 1356
1/6/2021 45 1445
1/7/2021 53 1534
1/8/2021 61 1623
1/9/2021 69 1712
1/10/2021 77 1801
1/11/2021 85 1890
1/12/2021 93 1979
1/13/2021 101 2068
1/14/2021 109 2157
1/15/2021 117 2246
1/16/2021 125 2335
1/17/2021 133 2424
What I have tried:
data[["test_statistic", "p_value"]] = \
data.rolling(5).apply(lambda x: conduct_ttest(x, variable_1="bias", variable_2="score", nan_policy="omit")
However, it is not working. Does anyone have tips on what I can do?
I failed to find a buit-in rolling method, so try this simple iterative workaround solution:
#in this function I just added index to returning values:
def conduct_ttest(data, variable_1="bias", variable_2="score", nan_policy="omit"):
test_result = ttest_ind(data[variable_1], data[variable_2], nan_policy=nan_policy)
test_statistic = test_result[0]
p_value = test_result[1]
return data.index.max(), test_statistic, p_value
#define rolling apply period:
window = 5
pd.DataFrame(
[conduct_ttest(df.iloc[range(i,i+window)]) for i in range(len(df)-window)],
columns=['index','test_statistic','p_value']
).set_index('index', drop=True)
result:
test_statistic p_value
index
4 -18.310951 8.140624e-08
5 -19.592876 4.788281e-08
6 -20.874800 2.909324e-08
7 -22.156725 1.819271e-08
8 -23.438650 1.167216e-08
9 -24.720575 7.663247e-09
10 -26.002500 5.136947e-09
11 -27.284425 3.509024e-09
12 -28.566349 2.438519e-09
13 -29.848274 1.721420e-09
14 -31.130199 1.232845e-09
15 -32.412124 8.947394e-10
I have a random list of series (integers) along with dates in a csv like:
1/1/2019,34 44 57 62 70
12/28/2018,09 10 25 37 38
12/25/2018,02 08 42 43 50
12/21/2018,10 13 61 62 70
12/18/2018,13 22 32 60 69
12/14/2018,05 22 26 43 49
12/11/2018,04 38 39 54 59
12/7/2018,04 10 20 33 57
12/4/2018,28 31 41 42 50
The list goes all the way back to year 1997. What I am trying is to predict the next series (or as closest as possible) based on these data:
The size of the list (2336)
What have I tried?
The approach that I've used so far is (e.g. for 1/1/2019,34 44 57 62 70):
1) Get the occurrence of each number in the list, i.e. the number 34 has occurred 170 times out the total list (2336).
2) Find the percentage of each number that has occurred. i.e.
Perc/Chances(34) = Occurrence/TotalNo.
Chances(34) = 170/2336
Chances(34) = 0.072 ~ 07
One way to get the list would be to just find the 5 numbers from the list with the least Percentages. but that won't be much effective.
On the other hand, Now I have a data which has each number, its percentage and its occurrence. Is there any way I can somehow train a neural network that predicts the next series? or closest.
Hierarchy:
Where comp_data.csv contains data like:
1/1/2019,34 44 57 62 70
12/28/2018,09 10 25 37 38
12/25/2018,02 08 42 43 50
12/21/2018,10 13 61 62 70
12/18/2018,13 22 32 60 69
12/14/2018,05 22 26 43 49
12/11/2018,04 38 39 54 59
12/7/2018,04 10 20 33 57
12/4/2018,28 31 41 42 50
and occurrence.csv contains:
34,170
44,197
57,36
62,38
70,37
09,186
10,210
25,197
37,185
38,206
02,217
08,185
and report.csv contains the number, occurrence and its percentage:
34,3,11
44,1,03
57,5,19
62,5,19
70,5,19
09,1,03
10,5,19
25,2,07
37,3,11
38,2,07
02,1,03
08,2,07
So I have the list of series, its occurrences over a period of time, and the percentages. Is there anyway I can create a NN that expects some INPUTS trains over a data and predicts the OUT (a series in this case)
The Problem:
Which ones would be the Input? As it is a pure random problem. PS. I cannot provide any Input since I need a series without INPUT. Perhaps, a LSTM Network for Regression?
Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36
It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))
I have been trying to make a bokeh line chart, however I am running into the issue of indexing the x-axis with a column of time stamps from my pandas data frame. Currently my data frame looks like this:
TMAX TMIN TAVG DAY NUM
2007-04-30 65 46 55.5 2007-04-30 1
2007-05-01 75 45 60.0 2007-05-01 2
2007-05-02 66 52 59.0 2007-05-02 3
2007-05-03 65 43 54.0 2007-05-03 4
2007-05-04 61 45 53.0 2007-05-04 5
2007-05-05 65 43 54.0 2007-05-05 6
2007-05-06 77 51 64.0 2007-05-06 7
2007-05-07 89 66 77.5 2007-05-07 8
2007-05-08 91 56 73.5 2007-05-08 9
2007-05-09 83 48 65.5 2007-05-09 10
2007-05-10 68 47 57.5 2007-05-10 11
2007-05-11 65 46 55.5 2007-05-11 12
2007-05-12 63 43 53.0 2007-05-12 13
2007-05-13 65 46 55.5 2007-05-13 14
2007-05-14 71 46 58.5 2007-05-14 15
....
[3592 rows x 5 columns]
I want to index the line plot with the values of the "DAY" column, however, I get an error no matter the approach I take. The documentation for line plots says that "x (str or list(str), optional) – specifies variable(s) to use for x axis". My code is as follows:
xyvalues = np.array([df['TAVG'], df_reg['ry'], df['DAY']])
regr = Line(data=xyvalues, x='DAY', title="Linear Regression of Data", ylabel="Average Daily Temperature", xlabel="Number of Days")
output_file("regression.html")
show(regr)
This gives me the error "TypeError: Cannot compare type 'Timestamp' with type 'float64'". I have tried converting it to float, but it doesn't seem to have an effect. Any help would be much appreciated. The df_reg['ry'] is data from a linear regression data frame.
Documentation for line graphs can be found here: http://docs.bokeh.org/en/latest/docs/reference/charts.html#line
Inside Line, you need to pass a pandas data frame to the data argument in order to be able to refer to your variable DAY for the x axis ticks. Here I create a new pandas DataFrame from the other two:
import pandas as pd
df2 = pd.DataFrame(data=dict(TAVG=df['TAVG'], ry=df_reg['ry'], DAY=df['DAY']))
regr = Line(data=df2, x='DAY',
title="Linear Regression of Data",
ylabel="Average Daily Temperature",
xlabel="Number of Days")
output_file("regression.html")
show(regr)