Suppose that you apply a function to a groupby object, so that every g.apply for every g in the df.groupby(...) gives you a series/dataframe. How do I combine these results into a single dataframe, but with the group names as columns?
I have a dataframe event_df that looks like this:
index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0
I want to create a sampling of the event for every note, and the sampling is done at times as given by t_df:
index t
0 0
1 0.5
2 1.0
So that I'd get something like this.
t C D
0 off off
0.5 on off
1.0 off on
What I've done so far:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)
So what I get is a number of dataframes for each note, all of the same size (same as t_df):
t event
0 on
0.5 off
t event
0 off
0.5 on
How do I go from here to my desired dataframe, with each group corresponding to a column in a new dataframe, and the index being t?
EDIT: sorry, I didn't take into account below, that you rescale your time column and can't present a whole solution now because I have to leave, but I think, you could do the rescaling by using pandas.merge_asof with your two dataframes to get the nearest "rescaled" time and from the merged dataframe you could try the code below. I hope this is, what you wanted.
import pandas as pd
import io
sio= io.StringIO("""index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)
df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')
Take the first row in each time-note group by agg({'event': 'first'}), then use the note-index column and transpose it, so the note values become columns. Then at the end fill all cells, for which no datapoints could be found with 'off' by fillna.
This outputs:
note C D
0.50 on off
0.75 off on
1.00 off off
You might also want to try min or max in case on/off is not unambiguous for a combination of time/note (if there are more rows for the same time/note where some have on and some have off) and you prefer one of these values (say if there is one on, then no matter how many offs are there, you want an on etc.). If you want something like a mayority-vote, I would suggest to add a mayority vote column in the aggregated dataframe (before the unstack()).
Oh so I found it! All I had to do was to unstack the groupby results. Going back to generating the groupby result:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
## print(group_with_t) ## unnecessary!
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)
At this point, result is a dataframe with note as an index:
>> print(result)
note t
C 0 off
0.5 on
1.0 off
D 0 off
0.5 off
1.0 on
Doing result = result.unstack('note') does the trick:
>> result = result.unstack('note')
>> print(result)
note C D
0 off off
0.5 on on
1.0 off off
D 0 off
0.5 off
1.0 on
I have a pandas dataframe df which looks like this
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.225660 0.083903
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.029690 0.188627 0.200235 0.224703 0.081434
3 0.009938 0.059595 0.109310 0.069609 0.009970 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009
Then I have a vector dk that looks like this:
What I need to do is:
calculate a new vector which is
psik = [np.log2(dki/1e3) for dki in dk]
calculate the sum of each row multiplied with the psik vector (just as the SUMPRODUCT function of excel)
calculate the log2 of each psik value
expected output should be:
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10 psig dg
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083 -5.848002631 0.017361042
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.22566 0.083903 -5.903532822 0.016705502
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.02969 0.188627 0.200235 0.224703 0.081434 -5.908820802 0.016644383
3 0.009938 0.059595 0.10931 0.069609 0.00997 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249 -5.930608559 0.016394906
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009 -5.924408689 0.016465513
I would do that with a for loop cycling over the rows like this
for r in rows:
psig_i = sum([d[i]*ri for i,ri in enumerate(r)])
psig.append(sum([d[i]*ri for i,ri in enumerate(r)]))
df['psig'] = psig
df['dg'] = dg
Is there any other way to update the df without iterating through its rows?
EDIT: I found the solution and I am ashamed for how simple it is
df['dg'] = df[psig].apply(lambda x: np.log2(x))
EDIT2: now, my df has more entries, so I have to filter it with a regex to find only the columns with a name starting with "basesub".
I have my array psik and a new column ``psigin thedf. I would like to calculate for each row (i.e. each value of psig```):
I did it like this, but maybe there's a better way?
PsimPsig2 = [[(psik_i-psig_i)**2 for psik_i in psik] for psig_i in list(df['psig'])]
psikmpsigname = ['psikmpsig'+str(i) for i in range(len(psik))]
dfPsimPsig2 = pd.DataFrame(data=PsimPsig2,columns=psikmpsigname)
siggAL = np.power(2,(np.power(pd.DataFrame(df.filter(regex=r'^betasub[0-9]',axis=1).values*dfPsimPsig2.values).sum(axis=1),0.5)))
df['siggAL'] = siggAL
I have the following issue:
I have a dataframe with 3 columns :
The first is userID, the second is invoiceType and the third the time of creation of the invoice.
df = pd.read_csv('invoice.csv')
Output: UserID InvoiceType CreateTime
1 a 2018-01-01 12:31:00
2 b 2018-01-01 12:34:12
3 a 2018-01-01 12:40:13
1 c 2018-01-09 14:12:25
2 a 2018-01-12 14:12:29
1 b 2018-02-08 11:15:00
2 c 2018-02-12 10:12:12
I am trying to plot the invoice cycle for each user. I need to create2 new columns, time_diff, and time_diff_wrt_first_invoice. time_diff will represent the time difference between each invoice for each user and time_diff_wrt_first_invoice will represent the time difference between all the invoices and the first invoice, which will be interesting for ploting purposes. This is my code:
********** Exploding a variable that is a list in each dataframe cell
def explode_list(df,x):
return (df[x].apply(pd.Series)
.reset_index(level = 1, drop=True)
****** applying explode_list to all the columns ******
def explode_listDF(df):
exploaded_df = pd.DataFrame()
for x in df.columns.tolist():
exploaded_df = pd.concat([exploaded_df, explode_list(df,x)],
axis = 1)
return exploaded_df
******** Getting the time difference column in pivot table format
def pivoted_diffTime(df1, _freq=60):
# _ freq is 1 for minutes frequency
# _freq is 60 for hour frequency
# _ freq is 60*24 for daily frequency
# _freq is 60*24*30 for monthly frequency
df = df.sort_values(['UserID', 'CreateTime'])
df_pivot = df.pivot_table(index = 'UserID',
aggfunc= lambda x : list(v for v in x)
df_pivot['time_diff'] = [[0]]*len(df_pivot)
for user in df_pivot.index:
_list = [0]+[math.floor((x - y).total_seconds()/(60*_freq))
for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:],
df_pivot.loc[user, 'CreateTime'][:-1])]
df_pivot.loc[user, 'time_diff'] = _list
print('There is a prob here :', user)
return df_pivot
***** Pipelining the two functions to obtain an exploaded dataframe
with time difference ******
def get_timeDiff(df, _frequency):
df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))
return df
And once I have time_diff, I am creating time_diff_wrt_first_variable this way:
# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] =
# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():
df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])
The problem is that I have a dataframe with hundreds of thousands of users and it's so time consuming. I am wondering if there is a solution that fits better my need.
Check out .loc[] for pandas.
df_1 = pd.DataFrame(some_stuff)
df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']
you can access specific columns, run a loop to check for certain types of conditions, and if you add a comma after the condition and put in a specific column name it'll only return that column.
I'm not 100% sure if that answers whatever question you're asking, cause I didn't actually see one, but it seemed like you were running a lot of for loops and stuff to isolate columns, which is what .loc[] is for.
I have found a better solution. Here's my code :
def next_diff(x):
return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])
def create_timediff(df):
df.sort_values(['UserID', 'CreateTime'], inplace=True)
a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
b = a.apply(np.cumsum)
a = a.reset_index()
b = b.reset_index()
# Here I explode the lists inside the cell
rows1= []
_ = a.apply(lambda row: [rows1.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
rows2 = []
__ = b.apply(lambda row: [rows2.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])
df = df.set_index('UserID')
df['time_diff']= df1_new['CreateTime']
df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
return df
I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
Dec Group R RA
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454
I'm trying to melt certain columns of a pd.DataFrame while preserving columns of the other. In this case, I want to melt sine and cosine columns into values and then which column they came from (i.e. sine or cosine) into a new columns entitled data_type then preserving the original desc column.
How can I use pd.melt to achieve this without melting and concatenating each component manually?
# Data
a = np.linspace(0,2*np.pi,100)
DF_data = pd.DataFrame([a, np.sin(np.pi*a), np.cos(np.pi*a)], index=["t", "sine", "cosine"], columns=["t_%d"%_ for _ in range(100)]).T
DF_data["desc"] = ["info about this" for _ in DF_data.index]
The round about way I did it:
# Melt each part
DF_melt_A = pd.DataFrame([DF_data["t"],
pd.Series(DF_data.shape[0]*["sine"], index=DF_data.index, name="data_type"),
DF_melt_A.columns = ["idx","t","values","data_type","desc"]
DF_melt_B = pd.DataFrame([DF_data["t"],
pd.Series(DF_data.shape[0]*["cosine"], index=DF_data.index, name="data_type"),
DF_melt_B.columns = ["idx","t","values","data_type","desc"]
# Merge
pd.concat([DF_melt_A, DF_melt_B], axis=0, ignore_index=True)
If I do pd.melt(DF_data I get a complete meltdown
In response to the comments:
allright so I had to create a similar df because I did not have access to your a variable. I change your a variable for a list from 0 to 99... so t will be 0 to 99
you could do this :
a = range(0, 100)
DF_data = pd.DataFrame([a, [np.sin(x)for x in a], [np.cos(x)for x in a]], index=["t", "sine", "cosine"], columns=["t_%d"%_ for _ in range(100)]).T
DF_data["desc"] = ["info about this" for _ in DF_data.index]
df = pd.melt(DF_data, id_vars=['t','desc'])
this should return what you are looking for.
t desc variable value
0 0.0 info about this sine 0.000000
1 1.0 info about this sine 0.841471
2 2.0 info about this sine 0.909297
3 3.0 info about this sine 0.141120
4 4.0 info about this sine -0.756802
I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.
I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.
The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries (length of the longer series) array full of nan.
The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.
Some minimal sample data:
import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)
Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:
bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)
This is what I get when I correlate with pandas and shift one dataset:
In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523
And trying scipy:
In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
array([[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan]])
which according to whos is
scicorr ndarray 1801x1: 1801 elems, type `float64`, 14408 bytes
But I'd just like to have 12 entries.
The idea I have come up with, is to implement a time-lag-correlation myself, like so:
corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on
But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.
As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr, you can see you've got the right idea:
def autocorr(self, lag=1):
Lag-N autocorrelation
lag : int, default 1
Number of lags to apply before performing autocorrelation.
autocorr : float
return self.corr(self.shift(lag))
So a simple timelagged cross covariance function would be
def crosscorr(datax, datay, lag=0):
""" Lag-N cross correlation.
lag : int, default 0
datax, datay : pandas.Series objects of equal length
crosscorr : float
return datax.corr(datay.shift(lag))
Then if you wanted to look at the cross correlations at each month, you could do
xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]
There is a better approach: You can create a function that shifted your dataframe first before calling the corr().
Get this dataframe like an example:
d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]}
df = pd.DataFrame(data=d)
>>> df
prcp stp
0 0.1 0.0
1 0.2 0.1
2 0.3 0.2
3 0.0 0.3
Your function to shift others columns (except the target):
def df_shifted(df, target=None, lag=0):
if not lag and not target:
return df
new = {}
for c in df.columns:
if c == target:
new[c] = df[target]
new[c] = df[c].shift(periods=lag)
return pd.DataFrame(data=new)
Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)
If you do at the present will be:
>>> df.corr()
prcp stp
prcp 1.0 -0.2
stp -0.2 1.0
But if you shifted 1(one) period all other columns and keep the target (prcp):
df_new = df_shifted(df, 'prcp', lag=-1)
>>> print df_new
prcp stp
0 0.1 0.1
1 0.2 0.2
2 0.3 0.3
3 0.0 NaN
Note that now the column stp is shift one up position at period, so if you call the corr(), will be:
>>> df_new.corr()
prcp stp
prcp 1.0 1.0
stp 1.0 1.0
So, you can do with lag -1, -2, -n!!
To build up on Andre's answer - if you only care about (lagged) correlation to the target, but want to test various lags (e.g. to see which lag gives the highest correlations), you can do something like this:
lagged_correlation = pd.DataFrame.from_dict(
{x: [df[target].corr(df[x].shift(-t)) for t in range(max_lag)] for x in df.columns})
This way, each row corresponds to a different lag value, and each column corresponds to a different variable (one of them is the target itself, giving the autocorrelation).