Convert multi-dimension Xarray into DataFrame - Python

Convert multi-dimension Xarray into DataFrame - Python - python

I have a big array with 4 dimensions, as follow:
>>> raw_data
<xarray.DataArray 'TRAC04' (time: 3, Z: 34, YC: 588, XC: 2160)>
[129548160 values with dtype=float32]
Coordinates: (12/15)
iter (time) int64 ...
* time (time) datetime64[ns] 2017-01-30T12:40:00 ... 2017-04-01T09:20:00
* XC (XC) float32 0.08333 0.25 0.4167 0.5833 ... 359.6 359.8 359.9
* YC (YC) float32 -77.98 -77.95 -77.91 -77.88 ... -30.02 -29.87 -29.72
* Z (Z) float32 -2.1 -6.7 -12.15 -18.55 ... -614.0 -700.0 -800.0
rA (YC, XC) float32 ...
... ...
maskC (Z, YC, XC) bool ...
maskCtrlC (Z, YC, XC) bool ...
rhoRef (Z) float32 ...
rLowC (YC, XC) float32 ...
maskInC (YC, XC) bool ...
rSurfC (YC, XC) float32 ...
Attributes:
standard_name: TRAC04
long_name: Variable concentration
units: mol N/m^3
I want to transform it into a Dataframe with 5 columns, as 'XC', 'YC', 'Z', 'time', 'TRAC04'.
I tried to follow this question like this:
import itertools
data = list(itertools.chain(*raw_data))
df = pd.DataFrame.from_records(data)
it runs it, however, I do not see creating anything in the environment. Furthermore, if I try to look at df with pd.head(df), it does run forever, without giving back outputs.
I tried, in any case, to save df, following this question, but it runs without ending also in this case:
np.savetxt(r'c:\data\DF_TRAC04.txt', df.values, fmt='%d')
df.to_csv(r'c:\data\DF_TRAC04.csv', header=None, index=None, sep=' ', mode='a')

I hope my answer can still help.
Let's first create a mock data with space variables x, y, z, and a time variable t.
import numpy as np
import xarray as xr
val = np.arange(54).reshape(2,3,3,3)
xc = np.array([10, 20, 30])
yc = np.array([50, 60, 70])
zc = np.array([1000, 2000, 3000])
t = np.array([0, 1])
da = xr.DataArray(
val,
coords={'time': t,
'z': zc,
'y': yc,
'x': xc},
dims=["time","z","y", "x"]
)
You will get the following DataArray:
<xarray.DataArray (time: 2, z: 3, y: 3, x: 3)>
array([[[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]],
[[[27, 28, 29],
[30, 31, 32],
[33, 34, 35]],
[[36, 37, 38],
[39, 40, 41],
[42, 43, 44]],
[[45, 46, 47],
[48, 49, 50],
[51, 52, 53]]]])
Coordinates:
* time (time) int64 0 1
* z (z) int64 1000 2000 3000
* y (y) int64 50 60 70
* x (x) int64 10 20 30
If you want to have a flat file representation of the DataArray, you can use
da.to_dataframe(name='value').reset_index()
and this is the result:
time z y x value
0 0 1000 50 10 0
1 0 1000 50 20 1
2 0 1000 50 30 2
3 0 1000 60 10 3
4 0 1000 60 20 4
...
49 1 3000 60 20 49
50 1 3000 60 30 50
51 1 3000 70 10 51
52 1 3000 70 20 52
53 1 3000 70 30 53
For saving the DataFrame to an ASCII file without the index, use:
da.to_dataframe(name='value').reset_index().to_csv('dump.csv', index=False)

Related

Alternate method to avoid loop in pandas dataframe

I have the following dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
And I wrote the following code to generate a new DataFrame with a modified output for each 'sim'
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
df = table2.filter(regex="sim|Type")
Output:
>>> df
Product Type sim_1 sim_2
0 A 35.0 60.0
1 B -39.0 36.0
2 C 56.0 92.0
3 D 23.0 33.0
I want to run this on 10,000 sims, and currently each loop takes about .25 seconds. Is there any way to modify this code to avoid the loop and be more time efficient?
Edit: If you're curious what this code is trying to accomplish you can see my self-answered somewhat disorganized question here: Pandas DataFrame: Complex linear interpolation

I was able to accomplish this with no loops using the following code:
As a result on my 10k x 200 table it ran in 3 minutes instead of the previous 2 hours.
Unfortunately now I need to run it on a 10k x 4k table, and I hit MemoryError on that one, but it may be out of the scope of this question.
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

Python pandas: flatten with arrays in column

I have a pandas Data Frame having one column containing arrays. I'd like to "flatten" it by repeating the values of the other columns for each element of the arrays.
I succeed to make it by building a temporary list of values by iterating over every row, but it's using "pure python" and is slow.
Is there a way to do this in pandas/numpy? In other words, I try to improve the flatten function in the example below.
Thanks a lot.
toConvert = pd.DataFrame({
'x': [1, 2],
'y': [10, 20],
'z': [(101, 102, 103), (201, 202)]
})
def flatten(df):
tmp = []
def backend(r):
x = r['x']
y = r['y']
zz = r['z']
for z in zz:
tmp.append({'x': x, 'y': y, 'z': z})
df.apply(backend, axis=1)
return pd.DataFrame(tmp)
print(flatten(toConvert).to_string(index=False))
Which gives:
x y z
1 10 101
1 10 102
1 10 103
2 20 201
2 20 202

Here's a NumPy based solution -
np.column_stack((toConvert[['x','y']].values.\
repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Sample run -
In [78]: toConvert
Out[78]:
x y z
0 1 10 (101, 102, 103)
1 2 20 (201, 202)
In [79]: np.column_stack((toConvert[['x','y']].values.\
...: repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Out[79]:
array([[ 1, 10, 101],
[ 1, 10, 102],
[ 1, 10, 103],
[ 2, 20, 201],
[ 2, 20, 202]])

You need numpy.repeat with str.len for creating columns x and y and for z use this solution:
import pandas as pd
import numpy as np
from itertools import chain
df = pd.DataFrame({
"x": np.repeat(toConvert.x.values, toConvert.z.str.len()),
"y": np.repeat(toConvert.y.values, toConvert.z.str.len()),
"z": list(chain.from_iterable(toConvert.z))})
print (df)
x y z
0 1 10 101
1 1 10 102
2 1 10 103
3 2 20 201
4 2 20 202

How to get pandas dataframe where columns are the subsequent n-elements from another column dataframe?

A very simple example just for understanding.
I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({'A':pd.Series([1, 2, 13, 14, 25, 26, 37, 38])})
df
A
0 1
1 2
2 13
3 14
4 25
5 26
6 37
8 38
Set n = 3
First example
How to get a new dataframe df1 (in an efficient way), like the following:
D1 D2 D3 T
0 1 2 13 14
1 2 13 14 25
2 13 14 25 26
3 14 25 26 37
4 25 26 37 38
Hint: think at the first n-columns as the data (Dx) and the last columns as the target (T). In the 1st example the target (e.g 25) depends on the preceding n-elements (2, 13, 14).
Second example
What if the target is some element ahead (e.g.+3)?
D1 D2 D3 T
0 1 2 13 26
1 2 13 14 37
2 13 14 25 38
Thank you for your help,
Gilberto
P.S. If you think that the title can be improved, please suggest me how to modify it.
Update
Thanks to #Divakar and this post the rolling function can be defined as:
import numpy as np
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = np.arange(1000000000)
b = rolling(a, 4)
In less than 1 second!

Let's see how we can solve it with NumPy tools. So, let's imagine you have the column data as a NumPy array, let's call it a. For such sliding windowed operations, we have a very efficient tool in NumPy as strides, as they are views into the input array without actually making copies.
Let's directly use the methods with the sample data and start with case #1 -
In [29]: a # Input data
Out[29]: array([ 1, 2, 13, 14, 25, 26, 37, 38])
In [30]: m = a.strides[0] # Get strides
In [31]: n = 3 # parameter
In [32]: nrows = a.size - n # Get number of rows in o/p
In [33]: a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,n+1),strides=(m,m))
In [34]: a2D
Out[34]:
array([[ 1, 2, 13, 14],
[ 2, 13, 14, 25],
[13, 14, 25, 26],
[14, 25, 26, 37],
[25, 26, 37, 38]])
In [35]: np.may_share_memory(a,a2D)
Out[35]: True # a2D is a view into a
Case #2 would be similar with an additional parameter for the Target column -
In [36]: n2 = 3 # Additional param
In [37]: nrows = a.size - n - n2 + 1
In [38]: part1 = np.lib.stride_tricks.as_strided(a,shape=(nrows,n),strides=(m,m))
In [39]: part1 # These are D1, D2, D3, etc.
Out[39]:
array([[ 1, 2, 13],
[ 2, 13, 14],
[13, 14, 25]])
In [43]: part2 = a[n+n2-1:] # This is target col
In [44]: part2
Out[44]: array([26, 37, 38])

I found another method: view_as_windows
import numpy as np
from skimage.util.shape import view_as_windows
window_shape = (4, )
aa = np.arange(1000000000) # 1 billion!
bb = view_as_windows(aa, window_shape)
bb
array([[ 0, 1, 2, 3],
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
...,
[999999994, 999999995, 999999996, 999999997],
[999999995, 999999996, 999999997, 999999998],
[999999996, 999999997, 999999998, 999999999]])
Around 1 second.
What do you think?

How to use map to check if each list element is inside any set of ranges? (TypeError: len() of unsized object)

I have a list of elements and I want to use mapping functions to generate an element-wise list of whether they are within any ranges in a list of ranges. I already have a solution that uses a for-loop, but for-loops are too slow because both my element and range lists will be much larger.
Here is my code so far:
import pandas as pd
# check element-wise if [1,0,45,60] within ranges 1-10, 21-30, or 41-50
# expected output: true, false, true, false
s = pd.Series([1,0,45,60])
f = lambda x: any((x >= pd.Series([1,20,40])) & (x <= pd.Series([10,30,50])))
print map(f, s)
Error:
elif isinstance(other, (np.ndarray, pd.Index)):
--> if len(self) != len(other):
raise ValueError('Lengths must match to compare')
return self._constructor(na_op(self.values, np.asarray(other)),
TypeError: len() of unsized object

Figured it out. Seems like everything works and is still fast if I convert to numpy. Normally I'd frown on introducing a new library but pandas is built on top of numpy.
import pandas as pd, numpy as np
s = pd.Series([1,0,45,60])
mins = np.array(pd.Series([1,20,40]))
maxes = np.array(pd.Series([10,30,50]))
f = lambda x: np.any((x >= mins) & (x <= maxes))
print map(f, s)

I think you can first create all ranges and for check use isin with tolist:
import pandas as pd
s = pd.Series([1,0,45,60])
print s
0 1
1 0
2 45
3 60
dtype: int64
rng = range(1,11) + range(21,31) + range(41,51)
print rng
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
print s.isin(rng)
0 True
1 False
2 True
3 False
dtype: bool
print s.isin(rng).tolist()
[True, False, True, False]
EDIT:
For creating ranges you can use numpy.arange and numpy.concatenate:
import numpy as np
rng = np.concatenate((np.arange(1, 11), np.arange(21, 31), np.arange(41, 51)))
print rng
[ 1 2 3 4 5 6 7 8 9 10 21 22 23 24 25
26 27 28 29 30 41 42 43 44 45 46 47 48 49 50]
Another solution for generating ranges can be slicing:
s = range(0,51)
print s
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
print s[1:11] + s[21:31] + s[41:51]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]

you can use cut() function for categorizing your values:
In [296]: s[pd.cut(s, bins=range(0, 110, 10), labels=labels).isin(['1 - 10','21 - 30','41 - 50'])]
Out[296]:
0 1
3 23
5 45
dtype: int64
Explanation:
original series:
In [291]: s
Out[291]:
0 1
1 0
2 19
3 23
4 35
5 45
6 60
dtype: int64
labels for categories:
In [292]: labels = [ "{0} - {1}".format(i, i + 9) for i in range(1, 100, 10) ]
In [293]: labels
Out[293]:
['1 - 10',
'11 - 20',
'21 - 30',
'31 - 40',
'41 - 50',
'51 - 60',
'61 - 70',
'71 - 80',
'81 - 90',
'91 - 100']
using cut() for categorizing your series:
In [294]: pd.cut(s, bins=range(0, 110, 10), labels=labels)
Out[294]:
0 1 - 10
1 NaN
2 11 - 20
3 21 - 30
4 31 - 40
5 41 - 50
6 51 - 60
dtype: category
Categories (10, object): [1 - 10 < 11 - 20 < 21 - 30 < 31 - 40 ... 61 - 70 < 71 - 80 < 81 - 90 <
91 - 100]
select only intereseting categories:
In [295]: pd.cut(s, bins=range(0, 110, 10), labels=labels).isin(['1 - 10','21 - 30','41 - 50'])
Out[295]:
0 True
1 False
2 False
3 True
4 False
5 True
6 False
dtype: bool
and finally:
In [296]: s[pd.cut(s, bins=range(0, 110, 10), labels=labels).isin(['1 - 10','21 - 30','41 - 50'])]
Out[296]:
0 1
3 23
5 45
dtype: int64

Iterate over a matrix, sum over some rows and add the result to another array

Hi there I have the following matrix
[[ 47 43 51 81 54 81 52 54 31 46]
[ 35 21 30 16 37 11 35 30 39 37]
[ 8 17 11 2 5 4 11 9 17 10]
[ 5 9 4 0 1 1 0 3 9 3]
[ 2 7 2 0 0 0 0 1 2 1]
[215 149 299 199 159 325 179 249 249 199]
[ 27 49 24 4 21 8 35 15 45 25]
[100 100 100 100 100 100 100 100 100 100]]
I need to iterate over the matrix summing all elements in rows 0,1,2,3,4 only
example: I need
row_0_sum = 47+43+51+81....46
Furthermore I need to store each rows sum in an array like this
[row0_sum, row1_sum, row2_sum, row3_sum, row4_sum]
So far I have tried this code but its not doing the job:
mu = np.zeros(shape=(1,6))
#get an average
def standardize_ratings(matrix):
sum = 0
for i, eli in enumerate(matrix):
for j, elj in enumerate(eli):
if(i<5):
sum = sum + matrix[i][j]
if(j==elj.len -1):
mu[i] = sum
sum = 0
print "mu[i]="
print mu[i]
This just gives me an Error: numpy.int32 object has no attribute 'len'
So can someone help me. What's the best way to do this and which type of array in Python should I use to store this. Im new to Python but have done programming....
Thannks

Make your data, matrix, a numpy.ndarray object, instead of a list of lists, and then just do matrix.sum(axis=1).
>>> matrix = np.asarray([[ 47, 43, 51, 81, 54, 81, 52, 54, 31, 46],
[ 35, 21, 30, 16, 37, 11, 35, 30, 39, 37],
[ 8, 17, 11, 2, 5, 4, 11, 9, 17, 10],
[ 5, 9, 4, 0, 1, 1, 0, 3, 9, 3],
[ 2, 7, 2, 0, 0, 0, 0, 1, 2, 1],
[215, 149, 299, 199, 159, 325, 179, 249, 249, 199],
[ 27, 49, 24, 4, 21, 8, 35, 15, 45, 25],
[100, 100, 100, 100, 100, 100, 100, 100, 100, 100]])
>>> print matrix.sum(axis=1)
[ 540 291 94 35 15 2222 253 1000]
To get the first five rows from the result, you can just do:
>>> row_sums = matrix.sum(axis=1)
>>> rows_0_through_4_sums = row_sums[:5]
>>> print rows_0_through_4_sums
[540 291 94 35 15]
Or, you can alternatively sub-select only those rows to begin with and only apply the summation to them:
>>> rows_0_through_4 = matrix[:5,:]
>>> print rows_0_through_4.sum(axis=1)
[540 291 94 35 15]
Some helpful links will be:
NumPy for Matlab Users, if you are familiar with these things in Matlab/Octave
Slicing/Indexing in NumPy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert multi-dimension Xarray into DataFrame - Python - python

Related

Alternate method to avoid loop in pandas dataframe

Python pandas: flatten with arrays in column

How to get pandas dataframe where columns are the subsequent n-elements from another column dataframe?

How to use map to check if each list element is inside any set of ranges? (TypeError: len() of unsized object)

Iterate over a matrix, sum over some rows and add the result to another array

Categories

Resources