I want to store the coefficients of a statsmodels.api model for future use (so I don't have to run the model every time). When I get a new dataframe for which I want to make a prediction on, I want to be able to multiply each row of the dataframe by the coefficients( i.e. model.params). I would then sum the results of each row*coefficients to get the prediction for that row. However, it does not seem to be working for me when I try:
preds = []
for row in df.iterrows():
preds.append((model.params*row).sum())
Edit: example
df:
Height Weight Color
6 5 3
6 2 4
9 1 9
10 3 3
coefficients:
Height: -1.6403
Weight: 2.0435
Color: 300.4532
I would consider doing something like:
df.dot(model.params)
This computes the dot product on each of the rows of the DataFrame.
It seems like you need:
coeff_dict = {
'Height': -1.6403,
'Weight': 2.0435,
'Color': 300.4532
}
df.assign(prediction=df.assign(**coeff_dict).mul(df).sum(axis=1))
Output:
Height Weight Color prediction
0 6 5 3 901.7353
1 6 2 4 1196.0580
2 9 1 9 2691.3596
3 10 3 3 891.0871
Related
I've trained a machine learning model using sklearn and want to simulate the result by sampling the predictions according to the predict_proba probabilities. So I want to do something like
samples = np.random.choice(a = possible_outcomes, size = (n_data, n_samples), p = probabilities)
Where probabilities would be is an (n_data, n_possible_outcomes) array
But np.random.choice only allows 1d arrays for the p argument. I've currently gotten around this using a for-loop like the following implementation
sample_outcomes = np.zeros((len(probs), n_samples))
for i in trange(len(probs)):
sample_outcomes[i, :] = np.random.choice(outcomes, s = n_samples, p=probs[i])
but that's relatively slow. Any suggestions to speed this up would be much appreciated!
If I understood correctly you want a vectorize way of applying choice
several times and each time with a different probabilities vector.
You could implement this by hand as follows:
import numpy as np
# for reproducibility
np.random.seed(42)
# number of samples
k = 5
# possible outcomes
outcomes = np.arange(10)
# generate a random probability matrix for 15 runs
probabilities = np.random.random((15, 10))
probs = probabilities / probabilities.sum(1)[:, None]
# generate the choices by picking those probabilities above a random generated number
# the higher the value in probs the higher the probability to pick it
choices = probs - np.random.random((15, 10))
# to pick the top k using argpartition need to multiply by -1
choices = -1 * choices
# pick the top k values
res = outcomes[np.argpartition(choices, k, axis=1)][:, :k]
# flatten to match the expected output
print(res.flatten())
Output
[1 8 2 5 3 6 4 8 7 0 1 5 9 3 7 1 4 9 0 8 5 0 4 3 6 8 5 1 2 6 5 3 2 0 6 5 4
2 3 7 7 9 4 6 1 3 6 4 2 1 4 9 3 0 1 6 9 2 3 8 5 4 7 6 1 5 3 8 2 1 1 0 9 7
4]
In the above example the code sample 5 (k) elements from a population of 10 (outcomes) 15 times each time with a different probability vector (probs with a shape of 15 by 10).
Here is an example of what you can do, if I understand your question correctly:
import numpy as np
#create a list of indices
index_list = np.arange(len(possible_outcomes))
# sample indices based on the probabilities
choice = np.random.choice(a = index_list, size = n_samples, p = probabilities)
# get samples based on randomly chosen indices
samples = possible_outcomes[choice]
I'm making sure I understand you problem correctly. Can you just create samples as an array of size n_data * n_samples and then use the resize method to get it to the right size?
samples = np.random.choice(a = possible_outcomes, size = n_data * n_samples, p = probabilities)
samples.resize((n_data, n_samples))
I have the following dataset, for 36 fragments in total (36 rows × 3 columns):
Fragment lower upper
0 1 1 5
1 2 2 5
2 3 3 5
3 4 2 5
4 5 1 5
5 6 1 5
I've calculated these lower and upper bounds from this dataset (966 rows × 2 columns):
Fragment Confidence Value
0 33 4
1 26 4
2 23 3
3 16 2
4 36 3
which contains multiple instance of a fragment and an associated Confidence value.
The confidence values are data from a Likert scale, i.e. 1-5. I want to create an error bar plot, for example like this:
So on the y-axis to have each fragment 1-36 and on the x-axis to show the range/std/mean (?) of the confidence values for each fragment.
I've tried this, but it's not exactly what I want, I think using the lower and upper bounds isn't the best idea, maybe I need std/range...
#confpd is the second dataset from above
meanconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].mean()
minconfs = confpd.groupby(Fragment', as_index=False)['Confidence Value'].min()
maxconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].max()
data_dict = {}
data_dict['Fragment'] = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18',
'19','20','21','22','23','24','25','26','27','28','29','39','31','32','33','34','35','36']
data_dict['lower'] = minconfs['Confidence Value']
data_dict['upper'] = maxconfs['Confidence Value']
dataset = pd.DataFrame(data_dict)
##dataset is the first dataset I show above
for lower,upper,y in zip(dataset['lower'],dataset['upper'],range(len(dataset))):
plt.plot((lower,upper),(y,y),'ro-',color='orange')
plt.yticks(range(len(dataset)),list(dataset['Fragment']))
The result of this code is this, which is not what I want.
Any help is greatly appreciated!!
I have a dataframe that looks like this
initial year0 year1
0 0 12
1 1 13
2 2 14
3 3 15
Note that the number of year columns year0, year1... (year_count) is completely variable but will be constant throughout this code
I first wanted to apply a function to each of the 'year' columns to generate 'mod' columns like so
def mod(year, scalar):
return (year * scalar)
s = 5
year_count = 2
# Generate new columns
df[[f"mod{y}" for y in range (year_count)]] = df[[f"year{y}" for y in range(year_count)]].apply(mod, scalar=s)
initial year0 year1 mod0 mod1
0 0 12 0 60
1 1 13 5 65
2 2 14 10 70
3 3 15 15 75
All good so far. The problem is that I now want to apply another function to both the year column and its corresponding mod column to generate another set of val columns, so something like
def sum_and_scale(year_col, mod_col, scale):
return (year_col + mod_col) * scale
Then I apply this to each of the columns (year0, mod0), (year1, mod1) etc to generate the next tranche of columns.
With scale = 10 I should end up with
initial year0 year1 mod0 mod1 val0 val1
0 0 12 0 60 0 720
1 1 13 5 65 60 780
2 2 14 10 70 120 840
3 3 15 15 75 180 900
This is where I'm stuck - I don't know how to put two existing df columns together in a function with the same structure as in the first example, and if I do something like
df[['val0', 'val1']] = df['col1', 'col2'].apply(lambda x: sum_and_scale('mod0', 'mod1', scale=10))
I don't know how to generalise this to have arbitrary inputs and outputs and also apply the constant scale parameter. (I know the last piece of won't work but it's the other avenue to a solution I've seen)
The reason I'm asking is because I believe the loop that I currently have working is creating performance issues with the number of columns and the length of each column.
Thanks
IMHO, it's better with a simple for loop:
for i in range(2):
df[f'val{i}'] = sum_and_scale(df[f'year{i}'], df[f'mod{i}'], scale=10)
I was wondering if anyone could help me with how to make a bar chart to show the frequencies of values in a Pandas Series.
I start with a Pandas DataFrame of shape (2000, 7), and from there I extract the last column. The column is shape (2000,).
The entries in the Series that I mentioned vary from 0 to 17, each with different frequencies, and I tried to plot them using a bar chart but faced some difficulties. Here is my code:
# First, I counted the number of occurrences.
count = np.zeros(max(data_val))
for i in range(count.shape[0]):
for j in range(data_val.shape[0]):
if (i == data_val[j]):
count[i] = count[i] + 1
'''
This gives us
count = array([192., 105., ... 19.])
'''
temp = np.arange(0, 18, 1) # Array for the x-axis.
plt.bar(temp, count)
I am getting an error on the last line of code, saying that the objects cannot be broadcast to a single shape.
What I ultimately want is a bar chart where each bar corresponds to an integer value from 0 to 17, and the height of each bar (i.e. the y-axis) represents the frequencies.
Thank you.
UPDATE
I decided to post the fixed code using the suggestions that people were kind enough to give below, just in case anybody facing similar issues will be able to see my revised code in the future.
data = pd.read_csv("./data/train.csv") # Original data is a (2000, 7) DataFrame
# data contains 6 feature columns and 1 target column.
# Separate the design matrix from the target labels.
X = data.iloc[:, :-1]
y = data['target']
'''
The next line of code uses pandas.Series.value_counts() on y in order to count
the number of occurrences for each label, and then proceeds to sort these according to
index (i.e. label).
You can also use pandas.DataFrame.sort_values() instead if you're interested in sorting
according to the number of frequencies rather than labels.
'''
y.value_counts().sort_index().plot.bar(x='Target Value', y='Number of Occurrences')
There was no need to use for loops if we use the methods that are built into the Pandas library.
The specific methods that were mentioned in the answers are pandas.Series.values_count(), pandas.DataFrame.sort_index(), and pandas.DataFrame.plot.bar().
I believe you need value_counts with Series.plot.bar:
df = pd.DataFrame({
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[1,1,6,1,6,5],
})
print (df)
a b c d
0 4 7 1 1
1 5 8 3 1
2 4 9 5 6
3 5 4 7 1
4 5 2 1 6
5 4 3 0 5
df['d'].value_counts(sort=False).plot.bar()
If possible some value missing and need set it to 0 add reindex:
df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0).plot.bar()
Detail:
print (df['d'].value_counts(sort=False))
1 3
5 1
6 2
Name: d, dtype: int64
print (df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0))
0 0
1 3
2 0
3 0
4 0
5 1
6 2
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
Name: d, dtype: int64
Here's an approach using Seaborn
import numpy as np
import pandas as pd
import seaborn as sns
s = pd.Series(np.random.choice(17, 10))
s
# 0 10
# 1 13
# 2 12
# 3 0
# 4 0
# 5 5
# 6 13
# 7 9
# 8 11
# 9 0
# dtype: int64
val, cnt = np.unique(s, return_counts=True)
val, cnt
# (array([ 0, 5, 9, 10, 11, 12, 13]), array([3, 1, 1, 1, 1, 1, 2]))
sns.barplot(val, cnt)
I have build a model for the optimization of my energy consumption.
Some of the variables are given in datasheets.
For simplicity and the availability of different datasheets I build my model inside a class.
The different energy consumptions are defined in the init function of the class:
def __init__(self, PV, refrigerator, clotheswasher, dishwasher, freezer):
self.PV = PV
self.refrigerator = refrigerator
self.clotheswasher = clotheswasher
self.dishwasher = dishwasher
self.freezer = freezer
These values are used in my model for all timestamps of the data (i.e. one day, every 5 min data)
for t in self.model.T:
self.model.PV[t] = self.PV
self.model.refrigerator[t] = self.refrigerator
self.model.clotheswasher[t] = self.clotheswasher
self.model.dishwasher[t] = self.dishwasher
self.model.freezer[t] = self.freezer
I want to add them up to make a plot of my total energy consumption of the day
self.model.total[t] = self.model.PV[t] + self.model.refrigerator[t] + self.model.clotheswasher[t] + self.model.dishwasher[t] + self.model.freezer[t]
However, by doing so for every t in self.model.total[t] I get a dataframe with the summations, i.e., when adding A B and C:
index A B C
1 3 4 2
2 2 1 4
3 1 3 2
I would like to get a dataframe like:
index tot
1 9
2 7
3 6
but I get:
index tot
1 9
7
6
2 9
7
6
3 9
7
6
can someone help me out?
If we simplify what you are trying to do by working on your simplified data example, we could use the following code:
import pandas as pd
DF = pd.read_csv("Data.csv")
print DF
print
for i in range(len(DF)):
print i, sum(DF.iloc[i])
which would yield the following output:
A B C
0 3 4 2
1 2 1 4
2 1 3 2
0 9
1 7
2 6
You are probably just making some simple mistake with your class instantiation and data loading. Once you fix that, your results will likely come out right. Start from a small simple data set until you find your issue, and trouble shoot each step.