Suppose I have the following toy data:
import pandas as pd
from linearmodels.panel import PanelOLS
y = pd.DataFrame(
index=[[1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3]],
data=[70, 60, 50, 30, 33, 27],
columns=["y"],
)
y.index.set_names(["Entity", "Time"], inplace=True)
x = pd.DataFrame(
index=[[1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3]],
data=[[100], [89], [62], [29], [49], [23]],
columns=["X"],
)
x.index.set_names(["Entity", "Time"], inplace=True)
And build a model using PanelOLS with entity_effects=True:
model_within = PanelOLS(dependent=y, exog=x, entity_effects=True).fit()
And then wanted to use the predict() method to see how a new "entity" would be modelled. First creating a new entity with:
new_x = pd.DataFrame(
index=[[3, 3, 3], [1, 2, 3]],
data=[[40], [70], [33]],
columns=["X"],
)
new_x.index.set_names(["Entity", "Time"], inplace=True)
Then predicting with:
model_within.predict(new_x)
To get the following output:
predictions
Entity
Time
3
1
16.136230
2
28.238403
3
13.312390
According to Wooldridge, 2012, pg 485, the within estimator is estimating:
Since this is modelling a change in expected y from the average of past y's for this entity, how should the predictions be interpreted? My intuition is that the prediction is saying:
For this new entity, 3, in time period 1, given these X inputs, y at time 1 should be 16 units higher than it's average y across all time, for this entity. Is this interpretation correct? How might it be improved?
linearmodels .predict() documentation
Posting results from seeking clarification through the repo:
https://github.com/bashtage/linearmodels/issues/465
"The model is always Y=XB + epsilon + (eta_t ) + (nu_i ). The effects are treated as errors, and so when you predict you get new_x # params and so the entity effects are not used."
So the predictions are for actual values of y, not time-demeaned predictions. However, to achieve time-demeaned predictions, one can create the same model using data that has first been time-demeaned, and pass in new time-demeaned data to predict on.
Related
import numpy as np
arr = np.random.random((5, 3))
labels = [1, 1, 2, 2, 3]
arr
Out[136]:
array([[0.20349907, 0.1330621 , 0.78268978],
[0.71883378, 0.24783927, 0.35576746],
[0.17760916, 0.25003952, 0.29058267],
[0.90379712, 0.78134806, 0.49941208],
[0.08025936, 0.01712403, 0.53479622]])
labels
Out[137]: [1, 1, 2, 2, 3]
assume I have this dataset.
I would like, using the labels as indicators, to perform np.mean over the rows.
(The labels here indicates the class of each row.
labels could also be [0, 1, 1, 0, 4, 1, 4] So have no assumptions over them.)
So the output here will be an average over the:
1st and 2nd row.
3rd and 4th row.
5th row.
in the most efficient way numpy offers. like so:
[np.mean(arr[:2], axis=0),
np.mean(arr[2:4], axis=0),
np.mean(arr[4:], axis=0)]
Out[180]:
[array([0.46116642, 0.19045069, 0.56922862]),
array([0.54070314, 0.51569379, 0.39499737]),
array([0.08025936, 0.01712403, 0.53479622])]
(in real life scenario the matrix dimensions could be (100000, 256))
First we would like to sort our label and matrix:
labels = np.array(labels)
# Getting the indices of a sorted array
sorted_indices = np.argsort(labels)
# Use the indices to sort both labels and matrix
sorted_labels = labels[sorted_indices]
sorted_matrix = matrix[sorted_indices]
Then, we calculate the "steps" or pairs of indices, (from, to) we want to calculate average over, We sum them and divide by their count.
# Here we're getting the amount of rows per label to average (over the sorted_matrix).
# Infact, we're getting the start and end indices per label.
label_indices = np.concatenate(([0], np.where(np.diff(sorted_labels) != 0)[0] + 1, [len(sorted_labels)]))
# using add + reduceat to add all rows with regard to the label indices
group_sums = np.add.reduceat(sorted_matrix, label_indices[:-1], axis=0)
# getting count for each group using the diff in label_indices
group_counts = np.diff(label_indices)
# Calculating the mean
group_means = group_sums / group_counts[:, np.newaxis]
Example:
matrix
Out[265]:
array([[0.69524902, 0.22105336, 0.65631557, 0.54823511, 0.25248685],
[0.61675048, 0.45973729, 0.22410694, 0.71403135, 0.02391662],
[0.02559926, 0.41640708, 0.27931808, 0.29139379, 0.76402121],
[0.27166955, 0.79121862, 0.23512671, 0.32568048, 0.38712154],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.01685432, 0.8395658 , 0.73460083, 0.08056013, 0.02522956],
[0.27274409, 0.64602305, 0.05698037, 0.23214598, 0.75130743],
[0.65069115, 0.32383729, 0.86316629, 0.69659358, 0.26667206],
[0.91971818, 0.02011127, 0.91776206, 0.79474582, 0.39678431],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
labels
Out[266]: array([3, 3, 2, 3, 1, 0, 2, 0, 2, 5])
group_means
Out[267]:
array([[0.33377274, 0.58170155, 0.79888356, 0.38857686, 0.14595081],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.40602051, 0.36084713, 0.41802017, 0.43942853, 0.63737099],
[0.52788969, 0.49066976, 0.37184974, 0.52931565, 0.221175 ],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
and the results are suited for: np.unique(sorted_labels)
np.unique(sorted_labels)
Out[271]: array([0, 1, 2, 3, 5])
I did not understand the labels part in your question. but there is a way to calculate the mean of each row in a matrix.
use --> np.mean(arr, axis = 1).
If lables to be used, please go through below mentioned script.
import numpy as np
arr = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[1,2,3],
[4,5,6]])
labels =np.array([0, 1, 1, 0, 4])
#print(arr)
#print('LABEL IS :', labels)
#print('MEAN VALUES ARE : ',np.mean(arr[:2], axis = 1))
id = labels.argsort()
eq_lal = labels[id]
print(eq_lal)
print(arr[eq_lal])
print(np.mean(arr[eq_lal], axis = 1))
Getting this error: AttributeError: 'GPT2Tokenizer' object has no
attribute 'train_new_from_iterator'
Very similar to hugging face documentation. I changed the input and that's it (shouldn't affect it). It worked once. Came back to it 2 hrs later and it doesn't... nothing was changed NOTHING. Documentation states train_new_from_iterator only works with 'fast' tokenizers and that AutoTokenizer is supposed to pick a 'fast' tokenizer by default. My best guess is, it is having some trouble with this. I also tried downgrading transformers and reinstalling to no success. df is just one column of text.
from transformers import AutoTokenizer
import tokenizers
def batch_iterator(batch_size=10, size=5000):
for i in range(100): #2264
query = f"select note_text from cmx_uat.note where id > {i * size} limit 50;"
df = pd.read_sql(sql=query, con=cmx_uat)
for x in range(0, size, batch_size):
yield list(df['note_text'].loc[0:5000])[x:x + batch_size]
old_tokenizer = AutoTokenizer.from_pretrained('roberta')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
There are two things for keeping in mind:
First: The train_new_from_iterator works with fast tokenizers only.
(here you can read more)
Second: The training corpus. Should be
a generator of batches of texts, for instance, a list of lists of
texts if you have everything in memory. (official documents)
def batch_iterator(batch_size=3, size=8):
df = pd.DataFrame({"note_text": ['fghijk', 'wxyz']})
for x in range(0, size, batch_size):
yield df['note_text'].to_list()
old_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
training_corpus = batch_iterator()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
print(old_tokenizer( ['fghijk', 'wxyz']))
print(new_tokenizer( ['fghijk', 'wxyz']))
output:
{'input_ids': [[0, 506, 4147, 18474, 2], [0, 605, 32027, 329, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
{'input_ids': [[0, 22, 2], [0, 21, 2]], 'attention_mask': [[1, 1, 1], [1, 1, 1]]}
I'm trying to mask bad pixels in a dataset taken from a detector. In my attempt to come up with a general way to do this so I can run the same code across different images, I tried a few different methods, but none of them ended up working. I'm pretty new with coding and data analysis in Python, so I could use a hand putting things in terms that the computer will understand.
As an example, consider the matrix
A = np.array([[3,5,50],[30,2,6],[25,1,1]])
What I'm wanting to do is set any element in A that is two standard deviations away from the mean equal to zero. The reason for this is that later in the code, I'm defining a function that only uses the nonzero values for the calculation, since the zeros are part of the mask.
I know this masking technique works, but I tried extending the following code to work with the standard deviation:
mask = np.ones(np.shape(A))
mask.flat[A.flat > 20] = 0
What I tried was:
mask = np.ones(np.shape(A))
for i,j in A:
mask.flat[A[i,j] - 2*np.std(A) < np.mean(A) < A[i,j] + 2*np.std(A)] = 0
Which throws the error:
ValueError: too many values to unpack (expected 2)
If anyone has a better technique to statistically remove bad pixels in an image, I'm all ears. Thanks for the help!
==========
EDIT
After some trial and error, I got to a place that could help clarify my question. The new code is:
for i in A:
for j in i:
mask.flat[ j - 2*np.std(A) < np.mean(A) < j + 2*np.std(A)] = 0
This throws an error saying 'unsupported iterator index'. What I'm wanting to happen is that the for loop iterates across each element in the array, checks if it's less/greater than 2 standard deviations from the mean, and it is, sets it to zero.
Here is an approach that will be sligthly faster on larger images:
import numpy as np
import matplotlib.pyplot as plt
# generate dummy image
a = np.random.randint(1,5, (5,5))
# generate dummy outliers
a[4,4] = 20
a[2,3] = -6
# initialise mask
mask = np.ones_like(a)
# subtract mean and normalise to standard deviation.
# then any pixel in the resulting array that has an absolute value > 2
# is more than two standard deviations away from the mean
cond = (a-np.mean(a))/np.std(a)
# find those pixels and set them to zero.
mask[abs(cond) > 2] = 0
Inspection:
a
array([[ 1, 1, 3, 4, 2],
[ 1, 2, 4, 1, 2],
[ 1, 4, 3, -6, 1],
[ 2, 2, 1, 3, 2],
[ 4, 1, 3, 2, 20]])
np.round(cond, 2)
array([[-0.39, -0.39, 0.11, 0.36, -0.14],
[-0.39, -0.14, 0.36, -0.39, -0.14],
[-0.39, 0.36, 0.11, -2.12, -0.39],
[-0.14, -0.14, -0.39, 0.11, -0.14],
[ 0.36, -0.39, 0.11, -0.14, 4.32]])
mask
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 0]])
You A is three dimensional so you need to unpack using three variables like below.
A = np.array([[3,5,50],[30,2,6],[25,1,1]])
for i in A:
for j in i:
print(j)
I have two datasets, One of which has time array in datetime.datetime form, and x,y,z coordinates array of that time, like time[0]=datetime.datetime(2000,1,21,0,7,25), x[0]=-6.7, etc.
I'd like to calculate something from the coordinates, but that needs another parameter (Ma) which depend on time. Second data set has another time array in same datetime form, and the parameter recorded at that time, like time[0]=datetime.datetime(2000,1,1,0,3), Ma[0]=2.73
The problem is that the time array of two data set is different (though the ranges are similar)
So I want to interpolate the parameter's value at each time of data set 1, like Ma[0], but 0 is not index of time of dataset2, but corresponds to index of dataset 1.
How can I do that?
PS. Can I convert the time form to simpler one? datetime.datetime seems quite cumbersome.
The following is an example of how to interpolate your values. The coord_ and ma_ arrays will be your imported data.
The first thing the script does is build some more sensible data structures from your disparate 1 dimensional arrays. The part that you're actually looking for is the call to np.interp, documented here.
import numpy as np
import datetime
import time
# Numpy cannot interpolate between datetimes
# This function converts a datetime to a timestamp
def to_ts(dt):
return time.mktime(dt.timetuple())
coord_dts = np.array([
datetime.datetime(2000, 1, 1, 12),
datetime.datetime(2000, 1, 2, 12),
datetime.datetime(2000, 1, 3, 12),
datetime.datetime(2000, 1, 4, 12)
])
coord_xs = np.array([3, 5, 8, 13])
coord_ys = np.array([2, 3, 5, 7])
coord_zs = np.array([1, 3, 6, 10])
ma_dts = np.array([
datetime.datetime(2000, 1, 1),
datetime.datetime(2000, 1, 2),
datetime.datetime(2000, 1, 3),
datetime.datetime(2000, 1, 4)
])
ma_vals = np.array([1, 2, 3, 4])
# Handling the data as separate arrays will be painful.
# This builds an array of dictionaries with the form:
# [ { 'time': timestamp, 'x': x coordinate, 'y': y coordinate, 'z': z coordinate }, ... ]
coords = np.array([
{ 'time': to_ts(coord_dts[idx]), 'x': coord_xs[idx], 'y': coord_ys[idx], 'z': coord_zs[idx] }
for idx, _ in enumerate(coord_dts)
])
# Build array of timestamps from ma datetimes
ma_ts = [ to_ts(dt) for dt in ma_dts ]
for coord in coords:
print("ma interpolated value", np.interp(coord['time'], ma_ts, ma_vals))
print("at coordinates:", coord['x'], coord['y'], coord['z'])
I want to build a classifier that provides labels given a time series of vectors. I have the code for a static LSTM-based classifier, but I don't know how I can incorporate the time information:
Training set:
time = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11,12,13,14,15,16,17,18]
f1 = [1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2]
f2 = [2, 1, 3, 2, 4, 2, 3, 1, 9, 2, 1, 2, 1, 6, 1, 8, 2, 2]
labels = [a, a, b, b, a, a, b, b, a, a, b, b, a, a, b, b, a, a]
Test set:
time = [1, 2, 3, 4, 5, 6]
f1 = [2, 2, 2, 1, 1, 1]
f2 = [2, 1, 2, 1, 6, 1]
labels = [?, ?, ?, ?, ?, ?]
Following this post, I implemented the following in pybrain:
from pybrain.datasets import SequentialDataSet
from itertools import cycle
import matplotlib.pyplot as plt
from pybrain.tools.shortcuts import buildNetwork
from pybrain.structure.modules import LSTMLayer
from pybrain.supervised import RPropMinusTrainer
from sys import stdout
data = [1,2,3,4,5,6,7]
ds = SequentialDataSet(1, 1)
for sample, next_sample in zip(data, cycle(data[1:])):
ds.addSample(sample, next_sample)
print ds
net = buildNetwork(2, 5, 1, hiddenclass=LSTMLayer, outputbias=False, recurrent=True)
trainer = RPropMinusTrainer(net, dataset=ds)
train_errors = [] # save errors for plotting later
EPOCHS_PER_CYCLE = 5
CYCLES = 100
EPOCHS = EPOCHS_PER_CYCLE * CYCLES
for i in xrange(CYCLES):
trainer.trainEpochs(EPOCHS_PER_CYCLE)
train_errors.append(trainer.testOnData())
epoch = (i+1) * EPOCHS_PER_CYCLE
print("\r epoch {}/{}".format(epoch, EPOCHS))
stdout.flush()
print()
print("final error =", train_errors[-1])
plt.plot(range(0, EPOCHS, EPOCHS_PER_CYCLE), train_errors)
plt.xlabel('epoch')
plt.ylabel('error')
plt.show()
for sample, target in ds.getSequenceIterator(0):
print(" sample = %4.1f" % sample)
print("predicted next sample = %4.1f" % net.activate(sample))
print(" actual next sample = %4.1f" % target)
print()
This trains a classifier, but I don't know how to incorporate the time information. How can I include the information about the order of the vectors?
This is how I implemented my sequence labeling. I have six classes of labels. I have 20 sample sequence for each class. Each sequence consist of 100 timesteps of datapoints with 10 variables.
input_variable = 10
output_class = 1
trndata = SequenceClassificationDataSet(input_variable,output_label, nb_classes=6)
# input 1st sequence into dataset for class label 0
for i in range(100):
trndata.appendLinked(sequence1_class0[i,:], [0])
trndata.newSequence()
# input 2nd sequence into dataset for class label 0
for i in range(100):
trndata.appendLinked(sequence2_class0[i,:], [0])
trndata.newSequence()
......
......
# input 20th sequence into dataset for class label 5
for i in range(100):
trndata.appendLinked(sequence20_class5[i,:], [5])
trndata.newSequence()
You could put all of them in a for loop eventually. The trndata.newSequence() is called every time a new sample sequence is given as dataset.
The training of the network should be similar to your existing code.