Related
I've just inherited some code that uses pandas' append method. This code causes Pandas to issue the following warning:
The frame.append method is deprecated and will be removed from pandas
in a future version. Use pandas.concat instead.
So, I want to use pandas.concat, without changing the behavior the append method gave. However, I can't.
Below I've recreated code that illustrates my problem. It creates an empty DataFrame with 31 columns and shape (0,31). When a new, empty row is appended to this DataFrame, the result has shape (1,31). In the code below, I've tried several ways to use concat and get the same behavior as append.
import pandas as pd
# Create Empty Dataframe With Column Headings
obs = pd.DataFrame(columns=['basedatetime_before', 'lat_before', 'lon_before',
'sog_before',
'cog_before',
'heading_before',
'vesselname_before', 'imo_before',
'callsign_before',
'vesseltype_before', 'status_before',
'length_before', 'width_before',
'draft_before',
'cargo_before',
'basedatetime_after', 'lat_after',
'lon_after',
'sog_after',
'cog_after', 'heading_after',
'vesselname_after', 'imo_after',
'callsign_after',
'vesseltype_after', 'status_after',
'length_after', 'width_after',
'draft_after',
'cargo_after'])
# Put initial values in DataFrame
desired = pd.Timestamp('2016-03-20 00:05:00+0000', tz='UTC')
obs['point'] = desired
obs['basedatetime_before'] = pd.to_datetime(obs['basedatetime_before'])
obs['basedatetime_after'] = pd.to_datetime(obs['basedatetime_after'])
obs.rename(lambda s: s.lower(), axis = 1, inplace = True)
# Create new 'dummy' row
new_obs = pd.Series([desired], index=['point'])
# Get initial Shape Information
print("Orig obs.shape", obs.shape)
print("New_obs.shape", new_obs.shape)
print("--------------------------------------")
# Append new dummy row to Data Frame
obs1 = obs.append(new_obs, ignore_index=True)
# Attempt to duplicate effect of append with concat
obs2 = pd.concat([obs, new_obs])
obs3 = pd.concat([obs, new_obs], ignore_index=True)
obs4 = pd.concat([obs, new_obs.T])
obs5 = pd.concat([obs, new_obs.T], ignore_index=True)
obs6 = pd.concat([new_obs, obs])
obs7 = pd.concat([new_obs, obs], ignore_index=True)
obs8 = pd.concat([new_obs.T, obs])
obs9 = pd.concat([new_obs.T, obs], ignore_index=True)
# Verify original DataFrame hasn't changed and append still works
obs10 = obs.append(new_obs, ignore_index=True)
# Print results
print("----> obs1.shape",obs1.shape)
print("obs2.shape",obs2.shape)
print("obs3.shape",obs3.shape)
print("obs4.shape",obs4.shape)
print("obs5.shape",obs5.shape)
print("obs6.shape",obs6.shape)
print("obs7.shape",obs7.shape)
print("obs8.shape",obs8.shape)
print("obs9.shape",obs9.shape)
print("----> obs10.shape",obs10.shape)
However, every way I've tried to use concat to add a new row to the DataFrame results in a new DataFrame with shape (1,32). This can be seen in the results shown below:
Orig obs.shape (0, 31)
New_obs.shape (1,)
--------------------------------------
----> obs1.shape (1, 31)
obs2.shape (1, 32)
obs3.shape (1, 32)
obs4.shape (1, 32)
obs5.shape (1, 32)
obs6.shape (1, 32)
obs7.shape (1, 32)
obs8.shape (1, 32)
obs9.shape (1, 32)
----> obs10.shape (1, 31)
How can I use concat to add new_obs to the obs DataFrame and get a DataDrame with shape (1, 31) instead of (1,32)?
new_obs = pd.Series([desired], index=['point'])
new_obs=pd.DataFrame(new_obs)
new_obs.columns=['point']
In Series data type, it does not contain "column name". Therefore in your orginal code, it will append into a table below as a undifined table column name. PLease add a column name after converse it to dataframe type
You can first transform new_obs as a dataframe, and then use concat:
new_obs2 = pd.DataFrame(new_obs).transpose()
obs11 = pd.concat([obs, new_obs2])
print("obs11.shape",obs11.shape)
Output:
obs11.shape (1, 31)
But maybe there is a more direct way.
I'm trying to apply a function row-by-row which takes 5 inputs, 3 of which are lists. I want these lists to come from each row of 3 correspondings dataframes.
I've tried using 'apply' and 'lambda' as follows:
sol['tf_dd']=sol.apply(lambda tsol, rfsol, rbsol:
taurho_difdif(xy=xy,
l=l,
t=tsol,
rf=rfsol,
rb=rbsol),
axis=1)
However I get the error <lambda>() missing 2 required positional arguments: 'rfsol' and 'rbsol'
The DataFrame sol and the DataFrames tsol, rfsol and rbsol all have the same length. For each row, I want the entire row from tsol, rfsol and rbsol to be input as three lists.
Here is much simplified example (first with single lists, which I then want to replicate row by row with dataframes):
The output with single lists is a single value (120). With dataframes as inputs I want an output dataframe of length 10 where all values are 120.
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
def simple_func(t, rf, rb):
x=sum(t)
y=sum(rf)
z=sum(rb)
return x+y+z
out=simple_func(t,rf,rb)
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=['output'])
out2['output'] = out2.apply(lambda tsol, rfsol, rbsol:
simple_func(t=tsol.tolist(),
rf=rfsol.tolist(),
rb=rbsol.tolist()),
axis=1)
Try to use "name" field in Series Type to get index value, and then get the same index for the other DataFrame
import pandas as pd
import numpy as np
def postional_sum(inot, df1, df2, df3):
"""
Get input index and gather the same position for the other DataFrame collection
"""
position = inot.name
x = df1.iloc[position].sum()
y = df2.iloc[position].sum()
z = df3.iloc[position].sum()
return x + y + z
# dataframe rows as lists
tsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rfsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rbsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = out2.apply(lambda x: postional_sum(x, tsol, rfsol, rbsol), axis=1)
out2
Hope this helps!
When you run df.apply() with axis=1, it does not pass on the columns as individual arguments to the function, but as a Series object, as explained here. The correct way to do this would be
out2['output'] = out2.apply(lambda row:
simple_func(t=row["tsol"],
rf=row["rfsol"],
rb=row["rbsol"]),
axis=1)
You can eliminate the simple function using this:
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
Here is the complete code:
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
print(out2)
OUTPUT:
output
0 120
1 120
2 120
3 120
4 120
5 120
6 120
7 120
8 120
9 120
I have dataframe. A snippet can be seen bellow:
import pandas as pd
data = {'EVENT_ID': [112335580,112335580,112335580,112335580,112335580,112335580,112335580,112335580, 112335582,
112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,
112335582,112335582,112335582],
'SELECTION_ID': [6356576,2554439,2503211,6297034,4233251,2522967,5284417,7660920,8112876,7546023,8175276,8145908,
8175274,7300754,8065540,8175275,8106158,8086265,2291406,8065533,8125015],
'BSP': [5.080818565,6.651493872,6.374683435,24.69510797,7.776082305,11.73219964,270.0383021,4,8.294425408,335.3223613,
14.06040142,2.423340019,126.7205863,70.53780982,21.3328554,225.2711962,92.25113066,193.0151362,3.775394142,
95.3786641,17.86333041],
'WIN_LOSE':[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]}
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df.sortlevel(level=0, ascending=True, sort_remaining=True)
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df = df.sort_values(["EVENT_ID","BSP"])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df['Win_Percentage'] = 1/df['BSP']
df['Lose_Percentage'] = 1 - df['Win_Percentage']
For each EVENT_ID, so index level zero, I would like to fit an equation of a line, exponential, power and log based on Lose_Percentage column.
So the fitted lines for EVENT_ID 112335580 would be based on the points (1, 0.750000), (2, 0.803181), (3, 0.843129), (4, 0.849658), (5, 0.871401), (6, 0.914764), (7, 0.959506), (8, 0.996297). This would then be done for all other EVENT_ID indexes.
To try and do this I want to convert Lose_Percentage column into an array for each EVENT_ID. To do this I have tried the following:I want to convert Lose_Percentage column into an array for each EVENT_ID. To do this I have tried the following:
df["Lose_Percentage"][112335580].tolist()
I don't want to just access one I want to access each value in the Lose_Percentage column for each EVENT_ID and pass this list to a function.
To fit a line to the data I can use polyfit. So I will need to pass the array to this.
Also, I have had a look to see how I can fit log, power and exponential line but cannot find a function which can do this
Any help would be appreciated, cheers.
Sandy
It's not necessary to extract the values. At first you define a function which fits and evaluates
def fit_eval(df):
y = df.values
x = np.arange(0, len(y)) + 1
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
return p(x)
This function can be used in a groupy:
df['fit'] = df.groupby(level=0)['Lose_Percentage'].transform(fit_eval)
You can select the required list by using loc -
extract = pd.Series(df.loc[112335580]["Lose_Percentage"])
extract.reset_index()
I've got a pandas DataFrame object which contains nans. I would like to find all blocks of subsequent valid frames for each column and from these blocks the first and the last index.
Example data:
[
[ 1,nan],
[ 2,nan],
[ 3,nan],
[ 4,3.0],
[ 5,1.0],
[ 6,4.0],
[ 7,1.0],
[ 8,5.0],
[ 9,9.0],
[10,2.0],
[11,nan],
[12,nan],
[13,6.0],
[14,5.0],
[15,3.0],
[16,5.0]
]
where first column is index, second column is value I'd like to filter on. Result of this should be something like
[(4,10), (13,16)]
I would like to avoid manually iterating through the data by means of a for-loop for performance reasons...
Update 1:
Two additional criteria:
The valid values in the value column don't have to be equal. They can take any valid float value between -inf and +inf
I only need the first and the last index of valid blocks, not the NaN blocks in between.
I think you can use:
#set column names and set index by first column
df.columns = ['idx', 'a']
df = df.set_index('idx')
#find groups
df['b'] = (df.a.isnull() != df.a.shift(1).isnull()).cumsum()
#remove NaN
df = df[df.a.notnull()].reset_index()
#aggregate first and last values of column idx
df = df['idx'].groupby(df.b).agg(['first', 'last'])
print zip(df['first'], df['last'])
[(4, 10), (13, 16)]
Then I try modify solution of cggarvey:
#set column names and set index by first column
df.columns = ['idx', 'a']
df = df.set_index('idx')
#find edges
pre = df['a'] - df['a'].diff(-1)
pst = df['a'] - df['a'].diff(1)
a = pre.notnull() & pst.isnull()
z = pre.isnull() & pst.notnull()
print zip(a[a].index, z[z].index)
[(4, 10), (13, 16)]
Here's an example using Numpy. Not sure how it compares to #jezrael's solution, but you mentioned performance as a requirement so you can compare the two.
Note: This assumes your columns are named "index" and "val"
import numpy as np
pre = np.array(df['val'] - df.diff(-1)['val'])
pst = np.array(df['val'] - df.diff(1)['val'])
a = np.where(~np.isnan(pre) & np.isnan(pst))
z = np.where(np.isnan(pre) & ~np.isnan(pst))
output = zip(df.ix[a[0]]['index'],df.ix[z[0]]['index'])
Output:
[(4, 10), (13, 16)]
Setup:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
output:
c d e
a b
0.439502 0.115087 0.832546 0.760513 0.776555
0.609107 0.247642 0.031650 0.727773
0.995370 0.299640 0.053523 0.565753 0.857235
0.392132 0.832560 0.774653 0.213692
Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:
array([[[-1.38655912, -0.90145951, -0.95106951, 0.76570984],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576]],
[[ 0.0768149 , -0.7566995 , -2.57770951, 0.70834656],
[-0.99097395, -0.81592084, -1.21075386, 0.12361382]]])
Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?
I just had an extremely similar problem and solved it like this:
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
output:
array([[[ 0.47780308, 0.93422319, 0.00526572, 0.41645868, 0.82089215],
[ 0.47780308, 0.15372096, 0.20948369, 0.76354447, 0.27743855]],
[[ 0.75146799, 0.39133973, 0.25182206, 0.78088926, 0.30276705],
[ 0.75146799, 0.42182369, 0.01166461, 0.00936464, 0.53208731]]])
verifying it is 3d, a3d.shape gives (2, 2, 5).
Lastly, to make the newly created dimension the last dimension (instead of the first) then use:
a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
which has a shape of (2, 5, 2)
For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.
Example setup with ragged shape:
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
dataframe:
c d e
a b
0.460013 0.577535 0.299304 0.617103 0.378887
0.167907 0.244972 0.615077 0.311497
0.318823 0.640575 0.768187 0.652760 0.822311
0.424744 0.958405 0.659617 0.998765
0.077048 0.407182 0.758903 0.273737
One possible solution:
n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
.apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))
a3d.shape gives (2, 3, 5)
as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .
import pandas as pd
import numpy as np
from typing import List
def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
"""Make an array cube from a Dataframe
Args:
df: Dataframe
idx_cols: columns defining the dimensions of the cube
Returns:
multi-dimensional array
"""
assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'
df = df.set_index(keys=idx_cols) # don't overwrite a parameter, thus copy!
idx_dims = [len(level) + 1 for level in df.index.levels]
idx_dims.append(len(df.columns))
cube = np.empty(idx_dims)
cube.fill(np.nan)
cube[tuple(np.array(df.index.to_list()).T)] = df.values
return cube
Test:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
# a, b must be integer
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give : (2, 2, 3)
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give s (2, 3, 3) .
panel.values
will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).
Instead of deprecated .as_matrix or alternativly .values() pandas documentation recommends to use .to_numpy()
'Warning: We recommend using DataFrame.to_numpy() instead.'