Estimating price with Linear Regression

Estimating price with Linear Regression - python

I'm posting here because I couldn't find any solution to my problem anywhere else. Basically we are learning Linear Regression using python at school and the professor wants us to estimate the price of each ingredient in a sandwich as well as the fixed profit of each sandwich based on a csv table. So far we only messed with one X variable and one Y variable, so I'm pretty confused what should I do here? Thank you. Here is the table:
tomato,lettuce,cheese,pickles,palmetto,burger,corn,ham,price
0.05,1,0.05,0,0.05,0.2,0.05,0,18.4
0.05,0,0.05,0.05,0,0.2,0.05,0.05,16.15
0.05,1,0.05,0,0.05,0.4,0,0,22.15
0.05,1,0.05,0,0.05,0.2,0.05,0.05,19.4
0.05,1,0,0,0,0.2,0.05,0.05,18.4
0,0,0.05,0,0,0,0.05,0.05,11.75
0.05,1,0,0,0,0.2,0,0.05,18.15
0.05,1,0.05,0.05,0.05,0.2,0.05,0,18.65
0,0,0.05,0,0,0.2,0.05,0.05,15.75
0.05,1,0.05,0,0.05,0,0.05,0.05,15.4
0.05,1,0,0,0,0.2,0,0,17.15
0.05,1,0,0,0.05,0.2,0.05,0.05,18.9
0,1,0.05,0,0,0.2,0.05,0.05,18.75

You have 9 separate variables for regression (tomato ... price), and 13 samples for each of them (the 13 lines).
So the first approach could be doing a regression for "tomato" on data points
0.05
0.05
0.05
0.05
0.05
0
0.05
0.05
0
0.05
0.05
0.05
0
then doing another one for "lettuce" and the others, up to "price" with
18.4
16.15
22.15
19.4
18.4
11.75
18.15
18.65
15.75
15.4
17.15
18.9
18.75
Online viewer for looking at your CSV data: http://www.convertcsv.com/csv-viewer-editor.htm, but Google SpreadSheet, Excel, etc. can display it nicely too.
SciPy can probably (most likely) do the task for you on vectors too (so handling the 9 variables together), but the part of having 13 samples in the 13 rows, remains.
EDIT: bad news, I was tired and have not answered the full question, sorry about that.
While it is true that you can take the first 8 columns (tomato...ham) as time series, and make individual regressions for them (which is probably the first part of this assignment), the last column (price) is expected to be estimated from the first 8.
Using the notation in Wikipedia, https://en.wikipedia.org/wiki/Linear_regression#Introduction, your y vector is the last column (the prices), the X matrix is the first 8 columns of your data (tomato...ham), extended with a column of 1-s somewhere.
Then pick an estimation method (some are listed in that page too, https://en.wikipedia.org/wiki/Linear_regression#Estimation_methods, but you may want to pick one you have learned about at class). The actual math is there, and NumPy can do the matrix/vector calculations. If you go for "Ordinary least squares", numpy.linalg.lstsq does the same (https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq - you may find adding that column of 1-s familiar), so it can be used for verifying the results.

Related

Scipy - adjust National Team means based on sample size

I have a population of National Teams (32), and parameter (mean) that I want to measure for each team, aggregated per match.
For example: I get the mean scouts for all strikers for each team, per match, and then I get the the mean (or median) for all team matches.
Now, one group of teams have played 18 matches and another group has played only 8 matches for the World Cup Qualifying.
I have an hypothesis that, for two teams with equal mean value, the one with larger sample size (18) should be ranked higher.
less_than_8 = all_stats[all_stats['games']<=8]
I get values:
3 0.610759
7 0.579832
14 0.537579
20 0.346510
25 0.403606
27 0.536443
and with:
sns.displot(less_than_8, x="avg_attack",kind='kde',bw_adjust=2)
I plot:
with a mean of 0.5024547681196802
Now, for:
more_than_18 = all_stats[all_stats['games']>=18]
I get values:
0 0.148860
1 0.330585
4 0.097578
6 0.518595
8 0.220798
11 0.200142
12 0.297721
15 0.256037
17 0.195157
18 0.176994
19 0.267094
21 0.295228
22 0.248932
23 0.420940
24 0.148860
28 0.297721
30 0.350516
31 0.205128
and I plot the curve:
with a lower mean, of 0.25982701104003497.
It seems clear that sample size does affect the mean, diminishing it as size increases.
Is there a way I can adjust the means of larger sample size AS IF they were being calculated on a smaller sample size, or vice versa, using prior and posteriori assumptions?
NOTE. I have std for all teams.
There is a proposed solution for a similar matter, using Empirical Bayes estimation and a beta distribution, which can be seen here Understanding empirical Bayes estimation (using baseball statistics), but I'm not sure as to how it could prior means could be extrapolated from successful attempts.

Sample size does affect the mean, but it's not exactly like mean should increase or decrease when sample size is increased. Moreover; sample size will get closer and closer to the population mean μ and standard deviation σ.
I cannot give an exact proposal without more information like; how many datapoints per team, per match, what are the standard deviations in these values. But just looking at the details I have to presume the 6 teams qualified with only 8 matches somehow smashed whatever the stat you have measured. (Probably this is why they only played 8 matches?)
I can make a few simple proposals based on the fact that you would want to rank these teams;
Proposal 1:
Extend these stats and calculate a population mean, std for a season. (If you have prior seasons use them as well)
Use this mean value to rank teams (Without any sample adjustments) - This would likely result in the 6 teams landing on top
Proposal 2:
Calculate per game mean across all teams(call it mean_gt) [ for game 01. mean for game 02.. or mean for game in Week 01, Week 02 ] - I recommend based on week as 6 teams will only have 8 games and this would bias datapoints in the beginning or end.
plot mean_gt and compare each team's given Week's mean with mean_gt [ call this diff diff_gt]
diff_gt gives a better perspective of a team's performance on each week. So you can get a mean of this value to rank teams.
When filling datapoint for 6 teams with 8 matches I suggest using the population mean rather than extrapolating to keep things simple.
But it's possible to get creative; like using the difference of aggregate total for 32 teams also. Like [ 32*mean_gt_of_week_1 - total of [32-x] teams]/x
I do have another idea. But rather wait for a feedback as I am way off the simple solution for adjusting a sample mean. :)

Identifying outliers in an event sequence using a Python Dataframe

I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.

Remove zeros in pandas dataframe without effecting the imputation result

I have a timeseries dataset with 5M rows.
The column has 19.5% missing values, 80% zeroes (don't go by the percentage values - although it means only 0.5% of data is useful but then 0.5% of 5M is enough). Now, I need to impute this column.
Given the number of rows, it's taking around 2.5 hours for KNN to impute the whole thing.
To make it faster, I thought of deleting all the zero values rows and then carry out the imputation process. But I feel that using KNN naively after this would lead to overestimation (since all the zero values are gone and keeping the number of neighbours fixed, the mean is expected to increase).
So, is there a way:
To modify the data input to the KNN model
Carry out the imputation process after removing the rows with zeros so that the values obtained after imputation are the same or at least near
To understand the problem more clearly, consider the following dummy dataframe:
DATE VALUE
0 2018-01-01 0.0
1 2018-01-02 8.0
2 2018-01-03 0.0
3 2018-01-04 0.0
4 2018-01-05 0.0
5 2018-01-06 10.0
6 2018-01-07 NaN
7 2018-01-08 9.0
8 2018-01-09 0.0
9 2018-01-10 0.0
Now, if I use KNN (k=3), then with zeros, the value would be the weighted mean of 0, 10 and 9. But if I remove the zeros naively, the value will be imputed with the weighted mean of 8, 10 and 9.
A few rough ideas which I thought of but could not proceed through were as follows:
Modifying the weights (used in the weighted mean computation) of the KNN imputation process so that the removed 0s are taken into account during the imputation.
Adding a column which says how many neighbouring zeros a particular column has and then somehow use it to modify the imputation process.
Points 1. and 2. are just rough ideas which came across my mind while thinking about how to solve the problem and might help one while answering the answer.
PS -
Obviously, I am not feeding the time series data directly into KNN. What I am doing is extracting month, day, etc. from the date column, and then using this for imputation.
I do not need parallel processing as an answer to make the code run faster. The data is so large that high RAM usage hangs my laptop.

Let's think logically, leave the machine learning part aside for the moment.
Since we are dealing with time series, it would be good if you impute the data with the average of values for the same date in different years, say 2-3 years ( if we consider 2 years, then 1 year before and 1 year after the missing value year), would recommend not to go beyond 3 years. We have computed x now.
Further to make this computed value x close to the current data, use an average of x and y, y is linear interpolation value.
In the above example, y = (10 + 9)/2, i.e. average of one value before and one value after the data to be imputed.

Binning data with non-uniform bin size and variable numbers with Python/Panda

I'm trying to use Python (3.5.1) to sort my data in bins and found some old threads that are related to this. However, so far I've only been able to see how to sort into a pre-defined number of bins and retrive the number of datapoints in each bin (e.g. thread 1, thread 2, thread 3 thread 4) which is not quite what I had in mind.
So my situation is this:
I have a Panda dataframe with two columns of data. It looks something like this
4646.06 1.69
4886.33 1.17
4989.14 1.93
4992.14 1.00
5057.03 1.36
6417.99 1.15
6418.01 1.26
6418.02 1.04
6418.03 1.34
6419.01 1.20
6419.02 1.09
6422.24 2.01
...... .....
There are some 200 entries like these in the two columns. As you can see the data is separated by a variable interval and occasionally there are multiple numbers bundled together.
What I want:
is to bin every value in the colums such that values between .8 and .4 is binned together and taken as an average. For example in the above a series like 17.99 and the three 18.something values belong to the same measurement and I need the mean of them and the corresponding entries in the second column which are to replace the original entries. So far I've done this by exporting to excel, manually find the mean and then reload it into a dataframe which cuts the number of entries in half. As long as the number of entries is small this is possible but if I at one point include more it will take to much time by hand so I would really like to do it automatically somehow.
This is where I'm stuck. I can't just define a set of bins from the beginning to the end since my data is not uniformly separated. Neither can I enter all the bins I want because this too becomes impractical at longer sets of data or make them automatically as in this example because the bins are not regularly spaced.
So just to be clear a dataframe like the above would instead become
4646.06 1.69
4886.33 1.17
4989.14 1.93
4992.14 1.00
5057.03 1.36
(6417.99 + 6418.01 + 6418.02 + 6418.03)/4 (1.15 + 1.26 + 1.04 + 1.34)/4
(6419.01 + 6419.02)/2 (1.20 + 1.09)/2
6422.24 2.01
...... .....
I am really at a loss on how to do this - if it is even possible. Any advice would be greatly appreciated.

Python - Zero-Order Hold Interpolation (Nearest Neighbor)

I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA

For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.