I have a data set I have made with random numbers containing the sales data for each sales representative for all previous months and I want to know if there is a way to predict what the sales would look like for each representative for the upcoming month. I'm not sure if machine learning methods are something that can be used here.
I am mostly asking for the best way to solve this, not necessary a code but maybe a method that is best for these types of questions. This is something I am interested in and would like to apply to a bigger data sets in the future.
data = [[1 , 55, 12, 25, 42, 66, 89, 75, 32, 43, 15, 32, 45],
[2 , 35, 28, 43, 25, 54, 76, 92, 34, 12, 14, 35, 63],
[3 ,13, 31, 15, 75, 4, 14, 54, 23, 15, 72, 12, 51],
[4 ,42, 94, 22, 34, 32, 45, 31, 34, 65, 10, 15, 18],
[5 ,7, 51, 29, 14, 92, 28, 64, 100, 69, 89, 4, 95],
[6 , 34, 20, 59, 49, 94, 92, 45, 91, 28, 22, 43, 30],
[7 , 50, 4, 5, 45, 62, 71, 87, 8, 74, 30, 3, 46],
[8 , 12, 54, 35, 25, 52, 97, 67, 56, 62, 99, 83, 9],
[9 , 50, 75, 92, 57, 45, 91, 83, 13, 31, 89, 33, 58],
[10 , 5, 89, 90, 14, 72, 99, 51, 29, 91, 34, 25, 2]]
df = pd.DataFrame (data, columns = ['sales representative ID#',
'January Sales Quantity',
'Fabruary Sales Quantity',
'March Sales Quantity',
'April Sales Quantity',
'May Sales Quantity' ,
'June Sales Quantity',
'July Sales Quantity',
'August Sales Quantity',
'September Sales Quantity',
'October Sales Quantity',
'November Sales Quantity',
'December Sales Quantity'])
Your case with multiple sales representatives is more complex, because since they are responsible for the same product, there may be a complex correlation between their performance, besides seasonality, autocorrelation, etc. Your data is not even a pure time series — it rather belongs to the class of so called "panel" datasets.
I've recently written a Python micro-package salesplansuccess, which deals with prediction of the current (or next) year's annual sales from historic monthly sales data. But a major assumption for that model is a quarterly seasonality (more specifically a repeating drift from the 2nd to the 3rd month in each quarter), which is more characteristic for wholesalers.
The package is installed as usual with pip install salesplansuccess.
You can modify its source code for it to better fit your needs.
The minimalistic use case is below:
import pandas as pd
from salesplansuccess.api import SalesPlanSuccess
myHistoricalData = pd.read_excel('myfile.xlsx')
myAnnualPlan = 1000
sps = SalesPlanSuccess(data=myHistoricalData, plan=myAnnualPlan)
sps.fit()
sps.simulate()
sps.plot()
For more detailed illustration of its use, you may want to refer to a Jupyter Notebook illustration file at its GitHub repository.
Choose method of prediction and iterate over reps calculating their parameters. Here you have simple linear regression in python you can use. With time you can add something smarter.
#!/usr/bin/python
data = [[1 , 55, 12, 25, 42, 66, 89, 75, 32, 43, 15, 32, 45],
(...)
months = []
for m in range(len(data[0])):
months.append(m+1)
for rep in range(len(data)):
linear_regression(months, data[rep])
Related
This question already has answers here:
Rolling window for 1D arrays in Numpy?
(7 answers)
Closed 1 year ago.
I have an original array, e.g.:
import numpy as np
original = np.array([56, 30, 48, 47, 39, 38, 44, 18, 64, 56, 34, 53, 74, 17, 72, 13, 30, 17, 53])
The desired output is an array made up of a fixed-size window sliding through multiple iterations, something like
[56, 30, 48, 47, 39, 38],
[30, 48, 47, 39, 38, 44],
[48, 47, 39, 38, 44, 18],
[47, 39, 38, 44, 18, 64],
[39, 38, 44, 18, 64, 56],
[38, 44, 18, 64, 56, 34],
[44, 18, 64, 56, 34, 53],
[18, 64, 56, 34, 53, 74],
[64, 56, 34, 53, 74, 17],
[56, 34, 53, 74, 17, 72]
At the moment I'm using
def myfunc():
return np.array([original[i: i+k] for i in range(i_range)])
with parameters i_range = 10 and k = 6, using python's timeit module (10000 iter), I'm getting close to 0.1 seconds. Can this be improved 100x by any chance?
I've also tried Numba but the result wasn't ideal, as it shines better with larger arrays.
NOTE: the arrays used in this post are reduced for demo purpose, actual size of original is at around 500.
As RandomGuy suggested, you can use stride_tricks:
np.lib.stride_tricks.as_strided(original,(i_range,k),(8,8))
For larger arrays (and i_range and k) this is probably the most efficient, as it does not allocate any additional memory, there's a drawback - editing the created array would modify the original array as well, unless you make a copy.
The (8,8) parameter define how many bytes in the memory you advance in each direction, I use 8 as its the original array stride size.
Another option, which works better for smaller arrays:
def myfunc2():
i_s = np.arange(i_range).reshape(-1,1)+np.arange(k)
return original[i_s]
This is faster than your original version.
Both, however, are not 100x faster.
Use np.lib.stride_tricks.sliding_window_view
I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.
# An example of the dataframe and a test doi:
ScID.xml journal year topic1
0 0009-3570(2017)050[0199:omfg]2.3.co.xml Journal_1 2017 0.000007
1 0001-3568(2001)750[0199:smdhmf]2.3.co.xml Journal_3 2001 0.000648
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
3 0003-3568(2011)150[0299:easayy]2.3.co.xml Journal_1 2011 0.000003
# A dummy doi:
test_doi = '0002-3568(2004)450'
In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.
I have searched this site and applied the methods described for similar scenarios.
Including:
df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()
This returns:
AttributeError: 'str' object has no attribute 'contains'
and:
df.filter(regex=test_doi)
gives:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[287459 rows x 0 columns]
and finally:
df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]
which also returns the Empty DataFrame as above.
All help is appreciated. Thank you.
There are two reasons why your first approach does not work:
First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.
Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.
test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
ScID.xml journal year topic1
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
Bye the way, pandas filter function filters on the label of the index, not the values.
So, for fun and practice I have been following along a college friend and their more advanced programming class and tackling assignments they are assigned to ready myself for the same Python class. I'm now at dictionaries and tuples sorting them based on the least to greatest number in an assigned dictonary. so far I have this code:
cityRevenues = {'Alabaster':[40,50,23,18], 'Anniston':[56,78,34,11],
'Athens':[40,34,18,30],'Auburn':[55,67,23,11],
'Decatur':[44,23,56,11],'Florence':[55,67,33,23],'Gadsden':[45,67,54,77]}
a = (sorted(cityRevenues.items(), key=lambda x: x[1]))
print('Sort the cities by their Quarter 1 Revenues.')
print("")
print(a)
print("")
b = (sorted(cityRevenues.items(), key=lambda x: x[1][1]))
print('Sort the cities by their Quarter 2 Revenues.')
print("")
print(b)
print("")
c = (sorted(cityRevenues.items(), key=lambda x: x[1][2]))
print("Sort the cities by their Quarter 3 Revenues.")
print("")
print(c)
print("")
d = (sorted(cityRevenues.items(), key=lambda x: x[1][3]))
print("Sort the cities by their Quarter 4 Revenues.")
print("")
print(d)
print("")
which gives the output:
Sort the cities by their Quarter 1 Revenues.
[('Athens', [40, 34, 18, 30]), ('Alabaster', [40, 50, 23, 18]), ('Decatur', [44, 23, 56, 11]), ('Gadsden', [45, 67, 54, 77]), ('Auburn', [55, 67, 23, 11]), ('Florence', [55, 67, 33, 23]), ('Anniston', [56, 78, 34, 11])]
Sort the cities by their Quarter 2 Revenues.
[('Decatur', [44, 23, 56, 11]), ('Athens', [40, 34, 18, 30]), ('Alabaster', [40, 50, 23, 18]), ('Auburn', [55, 67, 23, 11]), ('Florence', [55, 67, 33, 23]), ('Gadsden', [45, 67, 54, 77]), ('Anniston', [56, 78, 34, 11])]
Sort the cities by their Quarter 3 Revenues.
[('Athens', [40, 34, 18, 30]), ('Auburn', [55, 67, 23, 11]), ('Alabaster', [40, 50, 23, 18]), ('Florence', [55, 67, 33, 23]), ('Anniston', [56, 78, 34, 11]), ('Gadsden', [45, 67, 54, 77]), ('Decatur', [44, 23, 56, 11])]
Sort the cities by their Quarter 4 Revenues.
[('Auburn', [55, 67, 23, 11]), ('Decatur', [44, 23, 56, 11]), ('Anniston', [56, 78, 34, 11]), ('Alabaster', [40, 50, 23, 18]), ('Florence', [55, 67, 33, 23]), ('Athens', [40, 34, 18, 30]), ('Gadsden', [45, 67, 54, 77])]
I managed to sort them based on the least to greatest in each tuple, but I do not know how to code it to where it shows only the one value that is greatest in the tuple. i.e.:
Sort the cities by their Quarter 1 Revenues.
[('Athens', [40]), ('Alabaster', [40]), ('Decatur', [44]), ('Gadsden', [45]), ('Auburn', [55]), ('Florence', [55]), ('Anniston', [56])]
How would I go about having the specific tuple values be the only ones printed?
Try the following:
# To get the first value of the list for the city, use pair[1][0].
# You can get the second value of the list for the city by using pair[1][1]
# An so on...
>>> d = {pair[0]: [pair[1][0]] for pair in sorted(cityRevenues.items(), key=lambda x: x[1][3])}
>>> list(d.items())
[('Gadsden', [45]), ('Athens', [40]), ('Anniston', [56]), ('Alabaster', [40]), ('Florence', [55]), ('Auburn', [55]), ('Decatur', [44])]
Since the largest element is at the end of the list, you can use index -1
e.g.
cityRevenues = {'Alabaster':[40,50,23,18], 'Anniston':[56,78,34,11],
'Athens':[40,34,18,30],'Auburn':[55,67,23,11],
'Decatur':[44,23,56,11],'Florence':[55,67,33,23],'Gadsden':[45,67,54,77]}
a = (sorted(cityRevenues.items(), key=lambda x: x[1]))
print('Sort the cities by their Quarter 1 Revenues.')
print("\n")
print(a[-1])
print("\n")
It will give
('Anniston', [56, 78, 34, 11])
You print the last item in the list:
print (a[-1])
Format as desired.
By the way, you can make this code much shorter if you use the same variable name for each sorted list, and also collect your keys in a list you can index according to the input choice.
This is likely a really simple question, but it's one I've been confused about and stuck on for a while, so I'm hoping I might get some help.
I'm using cross validation to test my data set, but I'm finding that indexing the pandas df is not working as I'm expecting. Specifically, when I print out x_test, I find that there are no data points for x_test. In fact, there are indexes but no columns.
k = 10
N = len(df)
n = N/k + 1
for i in range(k):
print i*n, i*n+n
x_train = df.iloc[i*n: i*n+n]
y_train = df.iloc[i*n: i*n+n]
x_test = df.iloc[0:i*n, i*n+n:-1]
print x_test
Typical output:
0 751
Empty DataFrame
Columns: []
Index: []
751 1502
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
I'm trying to work out how to get the data to show up. Any thoughts?
Why don't you use sklearn.cross_validation.KFold? There is a clear example on this site...
UPDATE:
At all subsets you have to specify columns as well: at x_train and x_test you have to exclude target column, at y_train only the target column have to be present. See slicing and indexing for more details.
target = 'target' # name of target column
list_features = df.columns.tolist() # use all columns at model training
list_features.remove(target) # excluding "target" column
k = 10
N = len(df)
n = int(N/k) + 1 # 'int()' is necessary at Python 3
for i in range(k):
print i*n, i*n+n
x_train = df.loc[i*n: i*n+n-1, list_features] # '.loc[]' is inclusive, that's why "-1" is present
y_train = df.loc[i*n: i*n+n-1, target] # specify columns after ","
x_test = df.loc[~df.index.isin(range(int(i*n), int(i*n+n))), list_features]
print x_test
I have been doing multivariable linear regression (equations of the form: y=b0+b1*x1+b2*x2+...+bnxn) in python. I could successfully solve the following function:
def MultipleRegressionFunc(X,B):
y=B[0]
for i in range(0,len(X)): y += X[i]*B[i+1]
return y
I'll skip the details of that function for now. Suffices to say that using the curve_fit wrapper in scipy with this function has successfully allowed me to solve systems with many variables.
Now I've been wanting to consider possible interactions between variables, so I've modified the function as follows:
def MultipleRegressionFuncBIS(X,B):
#Define terms of the equation
#The first term is 1*b0 (intercept)
terms=[1]
#Adding terms for the "non-interaction" part of the equation
for x in X: terms.append(x)
#Adding terms for the 'interaction' part of the equations
for x in list(combinations(X, 2)): terms.append(x[0]*x[1])
#I'm proceeding in this way because I found that some operations on iterables are not well handled when curve_fit passes numpy arrays to the function
#Setting a float object to hold the result of the calculation
y = 0.0
#Iterating through each term in the equation, and adding the value to y
for i in range(0, len(terms)): y += B[i]*terms[i]
return y
I made a wrapper function for the above to be able to pass multiple linear coefficients to it via curve_fit.
def wrapper_func(X,*B):
return MultipleRegressionFuncBIS(X,B)
Here's some mock input, generated by applying the following formula: 1+2*x1+3*x2+4*x3+5*x1*x2+6*x1*x3+7*x2*x3
x1=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
x2=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
x3=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]
y=[91, 192, 329, 502, 711, 956, 1237, 1554, 1907, 2296, 2721, 3182, 3679, 4212, 4781, 5386, 6027, 6704, 7417, 8166, 8951, 9772, 10629, 11522, 12451, 13416, 14417, 15454, 16527, 17636, 18781, 19962, 21179, 22432, 23721, 25046, 26407, 27804, 29237, 30706, 32211, 33752, 35329, 36942, 38591, 40276, 41997, 43754, 45547, 47376, 49241, 51142, 53079]
Then I obtain the linear coefficients by calling the following:
linear_coeffs=list(curve_fit(wrapper_func,[x1,x2,x3],y,p0=[1.1,2.2,3.1,4.1,5.1,6.1,7.1],bounds=(0.0,10.0))[0])
print linear_coeffs
Notice that here the p0 estimates are manually set to values extremely close to the real values to rule out the possibility that curve_fit is having a hard time converging.
Yet, the output for this particular case deviates more than I would expect from the real values (expected: [1.0,2.0,3.0,4.0,5.0,6.0,7.0]):
[1.1020684140370627, 2.1149407566785214, 2.9872182044259676, 3.9734017072175436, 5.0575156518729969, 5.9605293645760549, 6.9819549835509491]
Now, here's my problem. While the coefficients do not perfectly match the input model, that is a secondary concern. I do expect some error in real life examples, although that is puzzling in this noice-less mock example. My main problem is that the error is systematic. In the example above, using the coefficients estimated by curve_fit the residuals are systematically equal to 0.10206841, for all values of x1,x2,x3. Other mock datasets produce different, but still systematic residuals.
Can you think of any explanation for this systematic error ?
I post here because I suspect it is a coding issue, rather than a statistical one. I'm very willing to move this question to Cross Validated if it turns out I made a stats error.
Many thanks !