I'm aligning multiple datasets (model and observations) and I thought it would make a lot of sense if xarray.align had a method to propagate NaNs/missing data in one dataset to the others. For now, I'm using xr.dataset.where in combination with np.isfinite, but especially my attempt to generalize this for more than two arrays feels a bit tricky. Is there a better way to do this?
a = xr.DataArray(np.arange(10).astype(float))
b = xr.DataArray(np.arange(10).astype(float))
a[[4, 5]] = np.nan
print(a.values)
print(b.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Default behaviour
c, d = xr.align(a, b)
print(c.values)
print(d.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Desired behaviour
e, f = xr.align(a.where(np.isfinite(b)), b.where(np.isfinite(a)))
print(e.values)
print(f.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
# Attempt to generalize for multiple arrays
c = b.copy()
c [[1, -1]] = np.nan
def align_better(*dataarrays):
allvalid = np.all(np.array([np.isfinite(x) for x in dataarrays]), axis=0)
return xr.align(*[da.where(allvalid) for da in dataarrays])
g, h, i = align_better(a, b, c)
print(g.values)
print(h.values)
print(i.values)
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
From the xarray docs:
Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes and dimension sizes.
Array from the aligned objects are suitable as input to mathematical operators, because along each dimension they have the same index and size.
Missing values (if join != 'inner') are filled with NaN.
Nothing about this function deals with the values in the arrays, just the dimensions and coordinates. This function is used for setting up arrays for operations against each other.
If your desired behavior is a function that returns NaN for all arrays where any arrays are NaN, your align_better function seems like a decent way to do it.
The function in my initial attempt was slow because dataarrays were casted to numpy arrays. In this modified version, I first align the datasets. Then I can safely use the .values method. This is much faster.
def align_better(*dataarrays):
""" Align datasets and propage NaNs """
aligned = xr.align(*dataarrays)
allvalid = np.all(np.asarray([np.isfinite(x).values for x in aligned]), axis=0)
return [da.where(allvalid) for da in aligned]
Related
I have a long array (could be pandas or numpy, as convenient) where some rows have the first two columns identical (x-y position), and the third is unique (time), eg:
x y t
0. 0. 10.
0. 0. 11.
0. 0. 12.
0. 1. 13.
0. 1. 14.
1. 1. 15.
Positions are grouped, but there may be 1, 2 or 3 time values listed for each, meaning there may be 1, 2 or 3 columns with identical x and y. The array needs to be reshaped/condensed such that each position has its own row, with min and max values of time - i.e., target is:
x y t1 t2
0. 0. 10. 12.
0. 1. 13. 14.
1. 1. 15. inf
Is there a simple/elegant way of doing this in pandas or numpy? I've tried loops but they're messy and terribly inefficient, and I've tried using np.unique:
target_array = np.unique(initial_array[:, 0:2], axis=0)
That yields
x y
0. 0.
0. 1.
1. 1.
which is a good start, but then I'm stuck on generating the last two columns.
IIUC, you can use
out = (df.groupby(['x', 'y'])['t']
.agg(t1='min', t2='max', c='count')
.reset_index()
.pipe(lambda df: df.assign(t2=df['t2'].mask(df['c'].eq(1), np.inf)) )
.drop(columns='c')
)
print(out)
x y t1 t2
0 0.0 0.0 10.0 12.0
1 0.0 1.0 13.0 14.0
2 1.0 1.0 15.0 inf
I have a multidimensions matrix and want to mask all values which are NOT NaN values. I know there is a mask for invalid where one can mask NaN values but I want the opposite - to only want to keep the NaN values. I've tried using where but am not sure if I am writing it correctly.
Code, tt & tt2 produce (same thing)
tt = np.ma.array([[[0,1,2],[3,np.nan,5],[6,7,8]],
[[10,11,12],[13,np.nan,15],[16,17,18]],
[[20,21,22],[23,np.nan,25],[26,27,28]]])
tt2 = np.ma.where(tt == np.nan, tt == np.nan, tt)
[[[ 0. 1. 2.]
[ 3. nan 5.]
[ 6. 7. 8.]]
[[10. 11. 12.]
[13. nan 15.]
[16. 17. 18.]]
[[20. 21. 22.]
[23. nan 25.]
[26. 27. 28.]]]
Desired Result:
All integers to be masked (--), leaving only Nan
I think you want:
tt2 = np.ma.masked_where(~np.isnan(tt), tt)
Note the use of np.isnan (i.e., note that np.NaN == np.NaN is False!), and the not (~) operator. In other words, this does, "mask where the array tt is not NaN". Good luck.
Say I have a large dataframe, and some lists of columns, and I want to be able to put them in a patsy dmatricies without having to write out each name individually. That is, I want to call the names from a list a list of column names to form the terms. Rather than write out each and every single term in my data frame column.
For example take the following df
df=pd.DataFrame( {'a':[1,2,3,4], 'b':[5,6,7,8],
'c':[8,4,5,3], 'd':[1,3,55,3],
'e':[8,4,5,3]})
df
>>
a b c d e
0 1 5 8 1 8
1 2 6 4 3 4
2 3 7 5 55 5
3 4 8 3 3 3
As I understand it to call this into a d matrix requires me to do the following:
y,x = dmatrices('a~b+c+d+e', data=df)
However I would like to be able to run something more along the lines of:
regress=['b', 'c']
control=['e', 'd']
y,x=dmatricies('a~{}+{}'.format(' '.join(e for e in regressors),
' '.join(c for c in control)), data=df)
However this was unsuccesful.
I also attempted to use a dictionary with two entries, say regress and control, that filled with lists of the column names, and then input that into the first entry of dmatricies but it didnt work either.
Does anyone have any suggestions on a more efficient way to get things into patsy's dmatricies rather than writing out each and every column name we would like to include in the matrix?
Thanks in advance and let me know if I was not clear on anything.
Doing with for loop here
for z in regress:
for t in control:
y,x=dmatrices('a~{}+{}'.format(z,t), data=df)
print('a~{}+{}'.format(z,t))
print(y,x)
a~b+e
[[1.]
[2.]
[3.]
[4.]] [[1. 5. 8.]
[1. 6. 4.]
[1. 7. 5.]
[1. 8. 3.]]
a~c+e
[[1.]
[2.]
[3.]
[4.]] [[1. 8. 8.]
[1. 4. 4.]
[1. 5. 5.]
[1. 3. 3.]]
a~d+e
[[1.]
[2.]
[3.]
[4.]] [[ 1. 1. 8.]
[ 1. 3. 4.]
[ 1. 55. 5.]
[ 1. 3. 3.]]
Currently I can use pandas to create columns and export lists as columns. Ex:
newRatiosDF = pd.DataFrame(newRatios)
entityidC = pd.DataFrame(X['entity_id'])
final = pd.concat([entityidC['entity_id'],newRatiosDF])
newRatiosDF.to_csv('../venv/output/letsseeit.csv')
Output Looks Like:
entity_id Ratios
0 3000 5
1 3001 7
...
500 3099 2
However this is not how I want to output my code. I would much rather have the index's be the entity_ids, so I have index's in a column and a row and then those index's reference the ratios in each box.
Ex of what I am looking for:
3000 3001 ... 3099
3000 nan 7 12
3001 4 nan 6
...
3099 2 8 nan
The reason I have nans is because I do not want to find the ratio of the something between the same thing. If anyone has any examples of how to code this using pandas (or a different library) it would be much appreciated. My ratios are in a 2D array already so all I need to do is figure out how to make the indices the entity_ids and then put the ratios in their correct places.
You can assume that the entity_id's are already in order and they match with their corresponding ratio values. All I'm looking for is the "How To" on manipulating pandas (or a different library) to do this.
If you have questions about my code please let me know and I'll do my best to answer them quickly, your help is much appreciated.
EDIT 1
I build the array doing this:
newRatios = np.zeros((xdfSize,xdfSize))
for ep in XDF['EmailPrefix']:
for ep2 in XDF['EmailPrefix']:
if rows <= xdfSize:
if columns != rows:
newRatios[columns,rows] = fuzz.token_sort_ratio(ep,ep2)
else:
newRatios[columns,rows] = None
rows += 1
rows = 0
columns += 1
Then the print out looks like this:
[[0. 1. 1. ... 1. 1. 1.]
[1. 0. 1. ... 1. 1. 1.]
[1. 1. 0. ... 1. 1. 1.]
...
[1. 1. 1. ... 0. 1. 1.]
[1. 1. 1. ... 1. 0. 1.]
[1. 1. 1. ... 1. 1. 0.]]
I have a pandas dataframe like this, where each ID is an observation with variables attr1, attr2 and attr3:
ID attr1 attr2 attr3
20 2 1 2
10 1 3 1
5 2 2 4
7 1 2 1
16 1 2 3
28 1 1 3
35 1 1 1
40 1 2 3
46 1 2 3
21 3 1 3
and made a similarity matrix I want to use where the IDs are compared based on the sum of the pairwise attribute differences.
[[ 0. 4. 3. 3. 3. 2. 2. 3. 3. 2.]
[ 4. 0. 5. 1. 3. 4. 2. 3. 3. 6.]
[ 3. 5. 0. 4. 2. 3. 5. 2. 2. 3.]
[ 3. 1. 4. 0. 2. 3. 1. 2. 2. 5.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 4. 3. 3. 1. 0. 2. 1. 1. 2.]
[ 2. 2. 5. 1. 3. 2. 0. 3. 3. 4.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 6. 3. 5. 3. 2. 4. 3. 3. 0.]]
I tried DBSCAN from sklearn for clustering the data, but it seems only the clusters themselves are labeled? I want to find the ID for the data points in the visualization later. So I only want to cluster the difference between the IDs, but not the IDs themselves. Is there another algorithm better for this kind of data, or a way I can label the distance matrix values so it can be used with the DBSCAN or another method?
ps.the dataset has over 50 attributes and 10000 observations
The labels_ attribute will give you an array of labels for each of your data points from training. The first index of that array is the label of your first training data point and so on.