Pandas Dataframe or Panel to 3d numpy array - python

Setup:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
output:
c d e
a b
0.439502 0.115087 0.832546 0.760513 0.776555
0.609107 0.247642 0.031650 0.727773
0.995370 0.299640 0.053523 0.565753 0.857235
0.392132 0.832560 0.774653 0.213692
Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:
array([[[-1.38655912, -0.90145951, -0.95106951, 0.76570984],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576]],
[[ 0.0768149 , -0.7566995 , -2.57770951, 0.70834656],
[-0.99097395, -0.81592084, -1.21075386, 0.12361382]]])
Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?

I just had an extremely similar problem and solved it like this:
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
output:
array([[[ 0.47780308, 0.93422319, 0.00526572, 0.41645868, 0.82089215],
[ 0.47780308, 0.15372096, 0.20948369, 0.76354447, 0.27743855]],
[[ 0.75146799, 0.39133973, 0.25182206, 0.78088926, 0.30276705],
[ 0.75146799, 0.42182369, 0.01166461, 0.00936464, 0.53208731]]])
verifying it is 3d, a3d.shape gives (2, 2, 5).
Lastly, to make the newly created dimension the last dimension (instead of the first) then use:
a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
which has a shape of (2, 5, 2)
For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.
Example setup with ragged shape:
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
dataframe:
c d e
a b
0.460013 0.577535 0.299304 0.617103 0.378887
0.167907 0.244972 0.615077 0.311497
0.318823 0.640575 0.768187 0.652760 0.822311
0.424744 0.958405 0.659617 0.998765
0.077048 0.407182 0.758903 0.273737
One possible solution:
n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
.apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))
a3d.shape gives (2, 3, 5)

as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .
import pandas as pd
import numpy as np
from typing import List
def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
"""Make an array cube from a Dataframe
Args:
df: Dataframe
idx_cols: columns defining the dimensions of the cube
Returns:
multi-dimensional array
"""
assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'
df = df.set_index(keys=idx_cols) # don't overwrite a parameter, thus copy!
idx_dims = [len(level) + 1 for level in df.index.levels]
idx_dims.append(len(df.columns))
cube = np.empty(idx_dims)
cube.fill(np.nan)
cube[tuple(np.array(df.index.to_list()).T)] = df.values
return cube
Test:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
# a, b must be integer
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give : (2, 2, 3)
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give s (2, 3, 3) .

panel.values
will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).

Instead of deprecated .as_matrix or alternativly .values() pandas documentation recommends to use .to_numpy()
'Warning: We recommend using DataFrame.to_numpy() instead.'

Related

How can pandas concat function duplicate behavior of append function in pandas,

I've just inherited some code that uses pandas' append method. This code causes Pandas to issue the following warning:
The frame.append method is deprecated and will be removed from pandas
in a future version. Use pandas.concat instead.
So, I want to use pandas.concat, without changing the behavior the append method gave. However, I can't.
Below I've recreated code that illustrates my problem. It creates an empty DataFrame with 31 columns and shape (0,31). When a new, empty row is appended to this DataFrame, the result has shape (1,31). In the code below, I've tried several ways to use concat and get the same behavior as append.
import pandas as pd
# Create Empty Dataframe With Column Headings
obs = pd.DataFrame(columns=['basedatetime_before', 'lat_before', 'lon_before',
'sog_before',
'cog_before',
'heading_before',
'vesselname_before', 'imo_before',
'callsign_before',
'vesseltype_before', 'status_before',
'length_before', 'width_before',
'draft_before',
'cargo_before',
'basedatetime_after', 'lat_after',
'lon_after',
'sog_after',
'cog_after', 'heading_after',
'vesselname_after', 'imo_after',
'callsign_after',
'vesseltype_after', 'status_after',
'length_after', 'width_after',
'draft_after',
'cargo_after'])
# Put initial values in DataFrame
desired = pd.Timestamp('2016-03-20 00:05:00+0000', tz='UTC')
obs['point'] = desired
obs['basedatetime_before'] = pd.to_datetime(obs['basedatetime_before'])
obs['basedatetime_after'] = pd.to_datetime(obs['basedatetime_after'])
obs.rename(lambda s: s.lower(), axis = 1, inplace = True)
# Create new 'dummy' row
new_obs = pd.Series([desired], index=['point'])
# Get initial Shape Information
print("Orig obs.shape", obs.shape)
print("New_obs.shape", new_obs.shape)
print("--------------------------------------")
# Append new dummy row to Data Frame
obs1 = obs.append(new_obs, ignore_index=True)
# Attempt to duplicate effect of append with concat
obs2 = pd.concat([obs, new_obs])
obs3 = pd.concat([obs, new_obs], ignore_index=True)
obs4 = pd.concat([obs, new_obs.T])
obs5 = pd.concat([obs, new_obs.T], ignore_index=True)
obs6 = pd.concat([new_obs, obs])
obs7 = pd.concat([new_obs, obs], ignore_index=True)
obs8 = pd.concat([new_obs.T, obs])
obs9 = pd.concat([new_obs.T, obs], ignore_index=True)
# Verify original DataFrame hasn't changed and append still works
obs10 = obs.append(new_obs, ignore_index=True)
# Print results
print("----> obs1.shape",obs1.shape)
print("obs2.shape",obs2.shape)
print("obs3.shape",obs3.shape)
print("obs4.shape",obs4.shape)
print("obs5.shape",obs5.shape)
print("obs6.shape",obs6.shape)
print("obs7.shape",obs7.shape)
print("obs8.shape",obs8.shape)
print("obs9.shape",obs9.shape)
print("----> obs10.shape",obs10.shape)
However, every way I've tried to use concat to add a new row to the DataFrame results in a new DataFrame with shape (1,32). This can be seen in the results shown below:
Orig obs.shape (0, 31)
New_obs.shape (1,)
--------------------------------------
----> obs1.shape (1, 31)
obs2.shape (1, 32)
obs3.shape (1, 32)
obs4.shape (1, 32)
obs5.shape (1, 32)
obs6.shape (1, 32)
obs7.shape (1, 32)
obs8.shape (1, 32)
obs9.shape (1, 32)
----> obs10.shape (1, 31)
How can I use concat to add new_obs to the obs DataFrame and get a DataDrame with shape (1, 31) instead of (1,32)?
new_obs = pd.Series([desired], index=['point'])
new_obs=pd.DataFrame(new_obs)
new_obs.columns=['point']
In Series data type, it does not contain "column name". Therefore in your orginal code, it will append into a table below as a undifined table column name. PLease add a column name after converse it to dataframe type
You can first transform new_obs as a dataframe, and then use concat:
new_obs2 = pd.DataFrame(new_obs).transpose()
obs11 = pd.concat([obs, new_obs2])
print("obs11.shape",obs11.shape)
Output:
obs11.shape (1, 31)
But maybe there is a more direct way.

Accessing different level of multi-level index, then convert column for this index to an array, to then pass to function

I have dataframe. A snippet can be seen bellow:
import pandas as pd
data = {'EVENT_ID': [112335580,112335580,112335580,112335580,112335580,112335580,112335580,112335580, 112335582,
112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,
112335582,112335582,112335582],
'SELECTION_ID': [6356576,2554439,2503211,6297034,4233251,2522967,5284417,7660920,8112876,7546023,8175276,8145908,
8175274,7300754,8065540,8175275,8106158,8086265,2291406,8065533,8125015],
'BSP': [5.080818565,6.651493872,6.374683435,24.69510797,7.776082305,11.73219964,270.0383021,4,8.294425408,335.3223613,
14.06040142,2.423340019,126.7205863,70.53780982,21.3328554,225.2711962,92.25113066,193.0151362,3.775394142,
95.3786641,17.86333041],
'WIN_LOSE':[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]}
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df.sortlevel(level=0, ascending=True, sort_remaining=True)
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df = df.sort_values(["EVENT_ID","BSP"])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df['Win_Percentage'] = 1/df['BSP']
df['Lose_Percentage'] = 1 - df['Win_Percentage']
For each EVENT_ID, so index level zero, I would like to fit an equation of a line, exponential, power and log based on Lose_Percentage column.
So the fitted lines for EVENT_ID 112335580 would be based on the points (1, 0.750000), (2, 0.803181), (3, 0.843129), (4, 0.849658), (5, 0.871401), (6, 0.914764), (7, 0.959506), (8, 0.996297). This would then be done for all other EVENT_ID indexes.
To try and do this I want to convert Lose_Percentage column into an array for each EVENT_ID. To do this I have tried the following:I want to convert Lose_Percentage column into an array for each EVENT_ID. To do this I have tried the following:
df["Lose_Percentage"][112335580].tolist()
I don't want to just access one I want to access each value in the Lose_Percentage column for each EVENT_ID and pass this list to a function.
To fit a line to the data I can use polyfit. So I will need to pass the array to this.
Also, I have had a look to see how I can fit log, power and exponential line but cannot find a function which can do this
Any help would be appreciated, cheers.
Sandy
It's not necessary to extract the values. At first you define a function which fits and evaluates
def fit_eval(df):
y = df.values
x = np.arange(0, len(y)) + 1
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
return p(x)
This function can be used in a groupy:
df['fit'] = df.groupby(level=0)['Lose_Percentage'].transform(fit_eval)
You can select the required list by using loc -
extract = pd.Series(df.loc[112335580]["Lose_Percentage"])
extract.reset_index()

Numpy get maximum value based on XYZ

I'm trying to read an CSV file with some XYZ data but when gridding using Python Natgrid is causing an error: two input triples have the same x/y coordinates. Here is my array:
np.array([[41.540588, -100.348335, 0.052785],
[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.021798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
I want to remove XY duplicates and get the maximum Z value. Based on the example above, I want to remove any minimum values of this array:
np.array([[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
I have tried using np.unique, but so far I haven't had any luck because it doesn't work with rows (only columns).
Here is a numpy way, sorting first by Z, then finding the first of each unique X and Y pair, and indexing:
a = np.array([[41.540588, -100.348335, 0.052785],
[41.540588, -100.348335, 0.053798],
[42.540588, -102.348335, 0.021798],
[42.540588, -102.348335, 0.022798],
[43.540588, -103.348335, 0.031798]])
# sort by Z
b = a[np.argsort(a[:,2])[::-1]]
# get first index for each unique x,y pair
u = np.unique(b[:,:2],return_index=True,axis=0)[1]
# index
c = b[u]
>>> c
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])
If you are able to use pandas, you can take advantage of groupby and max
>>> pandas.DataFrame(arr).groupby([0,1], as_index=False).max().values
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])
You can use Pandas via sorting and dropping duplicates:
import pandas as pd
df = pd.DataFrame(arr)
res = df.sort_values(2, ascending=False)\
.drop_duplicates([0, 1])\
.sort_values(0).values
print(res)
array([[ 4.15405880e+01, -1.00348335e+02, 5.37980000e-02],
[ 4.25405880e+01, -1.02348335e+02, 2.27980000e-02],
[ 4.35405880e+01, -1.03348335e+02, 3.17980000e-02]])

Convert numpy array to pandas dataframe

I have a numpy array of size 31x36 and i want to transform into pandas dataframe in order to process it. I am trying to convert it using the following code:
pd.DataFrame(data=matrix,
index=np.array(range(1, 31)),
columns=np.array(range(1, 36)))
However, I am receiving the following error:
ValueError: Shape of passed values is (36, 31), indices imply (35, 30)
How can I solve the issue and transform it properly?
As to why what you tried failed, the ranges are off by 1
pd.DataFrame(data=matrix,
index=np.array(range(1, 32)),
columns=np.array(range(1, 37)))
As the last value isn't included in the range
Actually looking at what you're doing you could've just done:
pd.DataFrame(data=matrix,
index=np.arange(1, 32)),
columns=np.arange(1, 37)))
Or in pure pandas:
pd.DataFrame(data=matrix,
index=pd.RangeIndex(range(1, 32)),
columns=pd.RangeIndex(range(1, 37)))
Also if you don't specify the index and column params, an auto-generated index and columns is made, which will start from 0. Unclear why you need them to start from 1
You could also have not passed the index and column params and just modified them after construction:
In[9]:
df = pd.DataFrame(adaption)
df.columns = df.columns+1
df.index = df.index + 1
df
Out[9]:
1 2 3 4 5 6
1 -2.219072 -1.637188 0.497752 -1.486244 1.702908 0.331697
2 -0.586996 0.040052 1.021568 0.783492 -1.263685 -0.192921
3 -0.605922 0.856685 -0.592779 -0.584826 1.196066 0.724332
4 -0.226160 -0.734373 -0.849138 0.776883 -0.160852 0.403073
5 -0.081573 -1.805827 -0.755215 -0.324553 -0.150827 -0.102148
You meet an error because the end argument in range(start, end) is non-inclusive. You have a couple of options to account for this:
Don't pass index and columns
Just use df = pd.DataFrame(matrix). The pd.DataFrame constructor adds integer indices implicitly.
Pass in the shape of the array
matrix.shape gives a tuple of row and column count, so you need not specify them manually. For example:
df = pd.DataFrame(matrix, index=range(matrix.shape[0]),
columns=range(matrix.shape[1]))
If you need to start at 1, remember to add 1:
df = pd.DataFrame(matrix, index=range(1, matrix.shape[0] + 1),
columns=range(1, matrix.shape[1] + 1))
In addition to the above answer,range(1, X) describes the set of numbers from 1 up to X-1 inclusive. You need to use range(1, 32) and range(1, 37) to do what you describe.

Sort numpy string array using positional data

I have a numpy array of strings
names = array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
And corresponding position data
X = array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
How can I sort names using X and Y such that I get out a sorted grid of names with shape (6, 6)? Note that there are essentially 6 unique X and Y positions -- I'm not just arbitrarily choosing 6x6.
names = array([
['p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00'],
['p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01'],
['p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02'],
['p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03'],
['p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04'],
['p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05']])
I realize in this case that I could simply reshape the array, but in general the data will not work out this neatly.
You can use numpy.argsort to get the indexes of the elements of an array after it's sorted. These indices you can then use to sort your names array.
import numpy as np
names = np.array([
'p00x00', 'p01x00', 'p02x00', 'p03x00', 'p04x00', 'p05x00',
'p00x01', 'p01x01', 'p02x01', 'p03x01', 'p04x01', 'p05x01',
'p00x02', 'p01x02', 'p02x02', 'p03x02', 'p04x02', 'p05x02',
'p00x03', 'p01x03', 'p02x03', 'p03x03', 'p04x03', 'p05x03',
'p00x04', 'p01x04', 'p02x04', 'p03x04', 'p04x04', 'p05x04',
'p00x05', 'p01x05', 'p02x05', 'p03x05', 'p04x05', 'p05x05'])
X = np.array([2.102235, 2.094113, 2.086038, 2.077963, 2.069849, 2.061699])
Y = np.array([-7.788431, -7.780364, -7.772306, -7.764247, -7.756188, -7.748114])
x_order = np.argsort(X)
y_order = np.argsort(Y)
names_ordered = names.reshape(6,6)[np.meshgrid(x_order,y_order)]
print(names_ordered)
gives the following output:
[['p00x05' 'p00x04' 'p00x03' 'p00x02' 'p00x01' 'p00x00']
['p01x05' 'p01x04' 'p01x03' 'p01x02' 'p01x01' 'p01x00']
['p02x05' 'p02x04' 'p02x03' 'p02x02' 'p02x01' 'p02x00']
['p03x05' 'p03x04' 'p03x03' 'p03x02' 'p03x01' 'p03x00']
['p04x05' 'p04x04' 'p04x03' 'p04x02' 'p04x01' 'p04x00']
['p05x05' 'p05x04' 'p05x03' 'p05x02' 'p05x01' 'p05x00']]

Categories

Resources