I am looking to predict values based on seasonal data. Bonuses are paid quarterly/ annually/ monthly and amount usually goes up after couple of time periods. Data is given below. I have converted the Bonus event as the numerical value (Yes = 1, No = 0). I have tried using Excel's forecast functions but it was not useful.
Is there a package with the help of which I can predict Next Month & Amount of Bonus? where the recent data points have higher weightage than the older ones.
My dataset has about 10 years worth of data and about 10,000 personnel. So, it is not possible to predict both Month and Amount manually. I am trying to predict the next Bonus Month and Amount.
Date
Bonus
Amount
Jan-15
0
000
Feb-15
0
000
Mar-15
1
100
Apr-15
0
000
May-15
0
000
Jun-15
1
100
Jul-15
0
000
Aug-15
0
000
Sep-15
1
145
Oct-15
0
000
Nov-15
0
000
Dec-15
1
145
Jan-16
0
000
Feb-16
0
000
Mar-16
1
145
Apr-16
0
000
May-16
1
150
Jun-16
0
000
Jul-16
0
000
Aug-16
1
150
Sep-16
0
000
Oct-16
0
000
Nov-16
1
150
Dec-16
0
000
Thanks for the help.
Have a look at Pandas package (https://pypi.org/project/pandas/). The problem you have is fundamental time series analysis as far as I can tell and there are many guides online on how to implement it (just search for ARIMA pandas).
Related
Hello am doing my assignment and I have encountered a question that I can't answer. The question is to create another DataFrame df_urban consisting of all columns of the original dataset but comprising of only applicants with Urban status in their Property_Area attribute (exclude Rural and Semiurban) with ApplicantIncome of at least S$10,000. Reset the row index and display the last 10 rows of this DataFrame.
Picture of the question
My code however will not meet the criteria of Applicant Income of at least 10,000 as well as only urban status in the area.
df_urban = df
df_urban.iloc[-10:[11]]
I Was wondering what is the solution to the question.
Data picture
you can use the '&' operator to limit the data by multiple column conditions:
df_urban = df[(df[col]==<condition>) & (df[col] >= <condition>)]
Following is a simple code snippet performing a proof of principle in extracting a subset of the primary data frame to produce a subset data frame of only "Urban" locations.
import pandas as pd
df=pd.read_csv('Applicants.csv',delimiter='\t')
print(df)
df_urban = df[(df['Property_Area'] == 'Urban')]
print(df_urban)
Using a simply built CSV file, here is a sample of the output.
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
0 4583 1508 128000 360 1 Rural
1 1222 0 55000 360 1 Rural
2 8285 0 64000 360 1 Urban
3 3988 1144 75000 360 1 Rural
4 2588 0 84700 360 1 Urban
5 5248 0 48550 360 1 Rural
6 7488 0 111000 360 1 SemiUrban
7 3252 1112 14550 360 1 Rural
8 1668 0 67500 360 1 Urban
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
2 8285 0 64000 360 1 Urban
4 2588 0 84700 360 1 Urban
8 1668 0 67500 360 1 Urban
Hope that helps.
Regards.
See below. I leave it to you to work out how to reset index. You might want to look at .tail() to display last rows.
df_urban = df[(df['ApplicantIncome'] > 10000) & (df['Property_Area'] == 'Urban')]
Experiment Source RMSE
0 Experiment 10 sat8 931.453756
1 Experiment 10 sat8 861.855506
2 Experiment 10 sat8 859.305796
3 Experiment 10 sat8 655.863104
4 Experiment 10 sat8 935.915268
.. ... ... ...
571 Experiment 27 nel1 807.975352
572 Experiment 27 nel1 1146.975889
573 Experiment 27 nel1 1005.450225
574 Experiment 27 nel1 967.833854
575 Experiment 27 nel1 793.703938
I want to process the dataframe above to find the number of times a Source has the least RMSE value for a given Experiment. Result should look something like this:
For any given Experiment, only one of the Source can have the least RMSE so any given column sums up to 1.
sat8 0 0
nel1 1 1
Experiment 10 .... Experiment 27
I tried using pivot table but not sure how to determine the Source with least RMSE for a given Experiment
Use get_dummies with DataFrameGroupBy.idxmin for minimal index (Source) per groups by RMSE column:
df2 = (pd.get_dummies(df.set_index('Source')
.groupby('Experiment')['RMSE']
.idxmin()
).T
)
print (df2)
Experiment Experiment 10 Experiment 27
nel1 0 1
sat8 1 0
Detail:
print (df.set_index('Source').groupby('Experiment')['RMSE'].idxmin())
Experiment
Experiment 10 sat8
Experiment 27 nel1
Name: RMSE, dtype: object
I have a dataframe and looks something like the one below.
Spent Products bought Target Variable
0 2300 Car/Mortgage/Leisure 0
1 1500 Car/Education 0
2 150 Groceries 1
3 700 Groceries/Education 1
4 900 Mortgage 1
5 180 Education/Sports 1
6 1800 Car/Mortgage/Others 0
7 900 Sports/Groceries 1
8 1000 Self-Enrichment/Car 1
9 140 Car/Groceries 1
I used pd.get_dummies to one hot encode all the "products bought" column. Now I have a shape of (5000,150).
I train/test/split my data and thereafter, applied PCA. I fit_transform the train set, and applied only transform on the test set. Following that I used a decision tree classifier to predict which got me a 90% accuracy.
Now here comes the problem. I have new set of data. I know my model was trained on a shape of (,150) and this **new data only has a shape of (150, 28) after** applying encoding with pd.get_dummies.
I know merging the new data with the old dataset is not a solution. I'm kind of stuck, and I'm not sure how to go about solving this. Anyone has any input? Thanks
Edit: I tried reindexing the new dataset but it did not work. There are more unique variables in the "products bought" column my training set and less so in my new dataset.
The new dataframe looks more like something like the one below.
Spent Products bought Target Variable
0 230 Leisure 1
1 150 Others 1
2 100 Groceries 1
3 700 Education 1
4 900 Mortgage 0
5 180 Education/Sports 1
6 1800 Car/Mortgage 0
7 400 Groceries 1
8 4000 Car 1
9 140 Car/Groceries 1
My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2
ds tickets_sold y event_id
3/12/19 90 90 1
3/13/19 40 130 1
3/14/19 13 143 1
3/15/19 8 151 1
3/16/19 13 164 1
3/17/19 14 178 1
3/20/19 10 188 1
3/20/19 15 203 1
3/20/19 13 216 1
3/21/19 6 222 1
3/22/19 11 233 1
3/23/19 12 245 1
3/12/19 30 30 2
3/13/19 23 53 2
3/14/19 43 96 2
3/15/19 24 120 2
3/16/19 3 123 2
3/17/19 5 128 2
3/20/19 3 131 2
3/20/19 25 156 2
3/20/19 64 220 2
3/21/19 6 226 2
3/22/19 4 230 2
3/23/19 63 293 2
I want to predict sales for the next 10 days of that data:
ds tickets_sold y event_id
3/24/19 20 20 3
3/25/19 30 50 3
3/26/19 20 70 3
3/27/19 12 82 3
3/28/19 12 94 3
3/29/19 12 106 3
3/30/19 12 118 3
So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?
# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()
# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500
# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)
# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.
events = pd.DataFrame({
'holiday': 'events',
'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
'lower_window': 0,
'upper_window': 1,
})
m = Prophet(growth='linear', holidays=events)
m.fit(df)
Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.
I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?
The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.