What dimensions should my Numpy Array be ? Obspy Traces - python

I currently have seismic data with 175x events with 3 traces for each event (traces are numpy arrays of seismic data). I have classification labels for whether the seismic data is an earthquake or not for each of those 175 samples. I'm looking to format my data into numpy arrays for modelling. I've tried placing into a dataframe of numpy arrays with each column being a different trace. So columns would be 'Trace one' 'Trace two' 'Trace three'. This did not work. I have tried lots of different methods of arranging the data to use with keras.
I'm now looking to create a numpy matrix for the data to go into and to then use for modelling.
I had thought that the shape may be (175,3,7501) as (#number of events, #number of traces,#number of samples in trace), however I then iterate through and try to add the three traces to the numpy matrix and have failed. I'm used to using dataframes and not numpy for inputting to Keras.
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
data = numpy.vstack([data, newrow])
The data shape is (175,3,7510). The newrow shape is (3,1,7510) and does not allow me to add newrow to data.
The form in which I receive the data is in obspy streams and each stream has the 3 trace objects. With each trace object, it holds the trace data in numpy arrays and so I'm having to access and append those to a dataframe for modelling as obviously I can't feed a stream or trace object to keras model.

If I understand your data correctly you can try one of the following method:
If your data shape is (175, 3, 7510) define newrow as follows newrow = np.array([trace_copy_1,trace_copy_2,trace_copy_3]) with trace_copy_x being a numpy array with shape 7510.
Use the reshape function (either with numpy.reshape(new_row, (3, 7510)) or new_row.reshape((3, 7510))
If you're familiar with dataframes you can still use pandas dataframes by reducing the dimension of your data (you can for example add the different traces at the end of one another on the same row, something you often see when working with images). Here it could be something like pandas.DataFrame(data.reshape((175, 3*7510)))
In addition to that I recommend using numpy.concatenate instead of numpy.vstack (more general).
I hope it will works.
Cheers

Thanks for the answers. The way I solved this was I created the NumPy array of the desired fit shape. (index or number of events, number of traces (or number of arrays), then sample amount (or amount of values in each array)
I then created a new row. I then reshaped and added. Following this, I then split the data to remove the original data before I started appending my new data.
data = np.zeros(shape=(175,3,7501))
newrow = [[trace_copy_1],[trace_copy_2],[trace_copy_3]]
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
newrow = newrow.reshape((1,3,7501))

Related

How to export 3D array into a single row in excel using python

I am attempting to export a large array of 3D points into excel.
import numpy as np
import pandas as pd
d = np.asarray(data)
df = pd.Dataframe(d)
df.to_csv("C:/Users/Fred/Desktop/test.csv")
This exports the data into rows as below:
3.361490011 -27.39559937 -2.934410095
4.573401244 -26.45699201 -3.845634521
.....
Each line representing the x,y,z coordinates. However, for my analysis, I would like that the 2nd row is moved to columns beside the 1st row, and so on, so that all the coordinates for one shape are on the one row of the excel. I tried turning the data into a string but this returned the above too.
The reason is so I can add some population characteristics to the row for each 3d shape. Thanks for any help that anyone can give.
you can use x = df.to_numpy().flatten() to flatten your data and then save it to csv using np.savetxt.

Apply Mann Whitney U test on multidimensional array and replace single values of variable of xarray data array in Python?

I'm new to Python and need some help with xarray.
I have two 3 dimensional data arrays (rlon, rlat, time) for future and past climate. I want to compute the Mann-Whitney-U-test for each grid point to analyse significance of temperature change in future compared to past. I already got the Mann-Whitney-U-test work with selecting a time serie from one grid point of historical and future data each. Example:
import numpy as np
import xarray as xr
import scipy.stats as sts
#selecting time period and grid point of past and future data
tp = fileHis['tas']
tf = fileFut['tas']
gridpoint_past=tp.sel(rlon=-6.375, rlat=1.375, time=slice('1999-01-01', '1999-01-31'))
gridpoint_future=tf.sel(rlon=-6.375, rlat=1.375, time=slice('2099-01-01', '2099-01-31'))
#mannwhintey-u-test
result=sts.mannwhitneyu(gridpoint_past, gridpoint_future, alternative='two-sided')
print('pvalue =',result[1])
Output:
pvalue = 0.05922372345359562
My problem now is that I need to do this for each grid point and each month and in the end I would like to have a data array with pvalues for each grid point and each month of a year.
I was thinking about looping through all rlat, rlon and months and run the Mann-Whitney-U-test for each, unless there is a better way to do.?
And how can I write the pvalues one by one into a new data array with the same rlat, rlon dimension?
I was trying this, but it does not work:
I created a data array pvalue_mon, which has the same rlat, rlon as tp and tf and has 12 months as time steps.
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=th.time.dt.month.isin([1])) = result[1]
SyntaxError: can't assign to function call
or this:
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=pvalue_mon.time.dt.month.isin([1])).update(result[1])
TypeError: 'numpy.float64' object is not iterable
How can I replace a single value of an existing variable?
Instead of using the .sel() function, try using .loc[ ] as described here:
http://xarray.pydata.org/en/stable/indexing.html#assigning-values-with-indexing

Converting an array structure to a dataframe to get the column names

I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it
X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values
Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code
dff=df.iloc[:, 0:345]
The above statement is to get the first 345 columns (of the data frame).
Then,
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
I get the following error
ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)
I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
X_test being a numpy.ndarray
Modified the above statement to just this:
df_new=pd.DataFrame(X_test)
df_new.columns=list(dff.columns)
The new dataframe contains the X_test data and the column names are assigned from the dff dataframe to the newly created dataframe as well.
I would recommend using the DataFrame for train_test_split, and then passing in arrays to your algorithm using numpy:
my_algorithm(np.asarray(X_train), np.asarray(y_train))
This way you can look at your data the same way you would for any df, but can run the model with the array. I'm not sure what library you are using - but I'm pretty sure some can take DataFrames now for modeling.

How do I subset a 2D grid from another 2D grid in python?

I have gridded data over the contiguous United States and I'm trying to select a chunk of it over a specific area.
import numpy as np
from netCDF4 import Dataset
import matplotlib.pyplot as plt
filename = '/Users/me/myfile.nc'
full_data = Dataset(filename,'r')
latitudes = full_data.variables['latitude'][0,:,:]
longitudes = full_data.variables['longitude'][0,:,:]
temperature = full_data.variables['temperature'][0,:,:]
All three variables are 2-dimensional matrices of shape (337,451). I'm trying to do the following to get a sub-selection of the data over a specific region.
index = (latitudes>=44.0)&(latitudes<=45.0)&(longitudes>=-91.0)&(longitudes<=-89.0)
temp_subset = temperature[index]
lat_subset = latitudes[index]
lon_subset = longitudes[index]
I would expect all three of these variables to be 2-dimensional, but instead they all return a flattened array with a shape of (102,). I've tried another approach:
index2 = np.where((latitudes>=44.0)&(latitudes<=45.0)&(longitudes>=-91.0)&(longitudes<=-89.0))
temp = temperatures[index2[0],:]
temp2 = temp[:,index2[1]]
plt.imshow(temp2,origin='lower')
plt.colobar()
But my data looks quite incorrect. Is there a better way to get a 2D subset grid from a larger grid?
Edub,
I suggest looking on at numpy's matrix indexing documentation, specifically http://docs.scipy.org/doc/numpy-1.10.1/user/basics.indexing.html#other-indexing-options . Currently, you are providing two dimensions for indexing, but no slicing information (resulting in only receiving one dimensional results). I hope this proves useful!

How to add data to H5py data? [duplicate]

Does any one have an idea for updating hdf5 datasets from h5py?
Assuming we create a dataset like:
import h5py
import numpy
f = h5py.File('myfile.hdf5')
dset = f.create_dataset('mydataset', data=numpy.ones((2,2),"=i4"))
new_dset_value=numpy.zeros((3,3),"=i4")
Is it possible to extend the dset to a 3x3 numpy array?
You need to create the dataset with the "extendable" property. It's not possible to change this after the initial creation of the dataset. To do this, you need to use the "maxshape" keyword. A value of None in the maxshape tuple means that that dimension can be of unlimited size. So, if f is an HDF5 file:
dset = f.create_dataset('mydataset', (2,2), maxshape=(None,3))
creates a dataset of size (2,2), which may be extended indefinitely along the first dimension and to 3 along the second. Now, you can extend the dataset with resize:
dset.resize((3,3))
dset[:,:] = np.zeros((3,3),"=i4")
The first dimension can be increased as much as you like:
dset.resize((10,3))

Categories

Resources