I want to use aggregate to apply some manipulations to a set of matrices, grouped by the customer_id, which is one column of my dataframe, df.
For example, I want to take the subsets of df that correspond to different customer_id's and add some columns to these subsets, and return them all.
In Python, I would use groupby and apply.
How can I do this in R?
The code I wrote looks like:
gr_TILPS = aggregate(df,by=list(df[,"customer_id"]),FUN=kmeansfunction)
Error in TILPSgroup$hour : $ operator is invalid for atomic vectors
The error is coming from the kmeansfunction I guess, which looks something like:
kmeansfunction = function(dfgroup){
Hour =dfgroup$hour
Weekday =TILPSgroup$WeekdayPrime
x <- cbind(Hour, Weekday)
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
clusters = cl$cluster
origclusters = as.factor(clusters)
dfgroup = cbind(dfgroup,origclusters)
return(dfgroup)
}
aggregate applies the same function to multiple single columns. If you want to work on ensembles of columns, then use this paradigm: lapply(split(df,group),function);
Try this:
gr_TILPS <- lapply( split(df, df[,"customer_id"]),
FUN=kmeansfunction)
Sounds like python might have some similarities to the experimental package: 'dplyr'. In a sense aggregate is only a column-oriented processing strategy within blocks, while the lapply(split, ), ) strategy is more applicable when you are interested in entire rows of data, defined by a blocking criterion. If you later want to row-bind those results back together you can always use do.call(rbind, res_from_lapply).
Related
fI have a dataframe with columns for IDs, measurements and uncertainties. For some IDs I have more than one value for measurement and uncertainty, so for each ID I need to take the mean of the measurements and add the uncertainties in quadrature. I can use
df["measurement"] = df.groupby("ID")["measurement"].transform("mean")
to get the mean of the measurements, but I can't find a way to get the uncertainties added in quadrature (so a function to perform sqrt((uncertainty_1)**2 + (uncertainty_2)**2 + ...) on the uncertainties _1, _2 and so on in each ID group) to work on grouped data.
I'm looking for something like:
df["uncertainty"] = df.groupby("ID")["uncertainty"].transform("(custom function)").
I looked into using e.g. df.groupby("ID")["uncertainty"].apply(lambda x: custom function) but couldn't get a custom function to work (I tried a few lambda functions but the output values didn't change). Many thanks for your help.
I'm new to the library and am trying to figure out how to add columns to a pivot table with the mean and standard deviation of the row data for the last three months of transaction data.
Here's the code that sets up the pivot table:
previousThreeMonths = [prev_month_for_analysis, prev_month2_for_analysis, prev_month3_for_analysis]
dfPreviousThreeMonths = df[df['Month'].isin(previousThreeMonths)]
ptHistoricalConsumption = dfPreviousThreeMonths.pivot_table(dfPreviousThreeMonths,
index=['Customer Part #'],
columns=['Month'],
aggfunc={'Qty Shp':np.sum}
)
ptHistoricalConsumption['Mean'] = ptHistoricalConsumption.mean(numeric_only=True, axis=1)
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.std(numeric_only=True, axis=1)
ptHistoricalConsumption
The resulting pivot table looks like this:
The problem is that the standard deviation column is including the Mean in its calculations, whereas I just want it to use the raw data for the previous three months. For example, the Std Dev of part number 2225 should be 11.269, not 9.2.
I'm sure there's a better way to do this and I'm just missing something.
One way would be to remove the Mean column temporarily before call .std():
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.drop('Mean', axis=1).std(numeric_only=True, axis=1)
That wouldn't remove it from the permanently, it would just remove it from the copy fed to .std().
I'm trying to filter a Pandas dataframe based on a set of or conditions, but they're all very similar, and I'm wondering if there's a more efficient way to write this.
Specifically, I want to include rows from the dataframe (df) where any of a set of variables is 1:
df.query("Q50r5==1 or Q50r6==1 or Q50r7==1 or Q50r8==1 or Q50r9==1 or Q50r10==1 or Q50r11==1")
This filters correctly to rows where any of these variables is 1.
However, I expect to have a lot more situations where I need to filter my dataframe to something similar, e.g.:
df.query("Q20r1==1 or Q20r2==1 or Q20r3==1")
df.query("Q23r2==1 or Q23r5==1 or Q23r7==1 or Q23r8==1")
The pandas documentation on .query() doesn't specify any more than that you can use and and or like you can elsewhere in Python, so it's possible this is the only way to do this query, but is there some kind of sum or count I could do across these columns within the query? Something like "any(1,Q20r1,Q20r2,Q20r3)" or "sum(Q20r1,Q20r2,Q20r3)>0"?
EDIT: For example, using this small dataframe:
I would want to retrieve ID #s 1,2,4,5,7 and exclude #s 3 and 6, because 3 and 6 do not have any 1's across the columns I'm referring to.
You can use any with axis = 1 to check that at least one value is True in a row.
For example, you can run
df[(df[["Q20r1", "Q20r2", "Q20r3"]] == 1).any(axis = 1)]
I have pandas dataframe on which I need to some data manipulation, the following code provide me the average of column "Variable" group by "Key":
df.groupby('key').Variable.transform("mean")
The advantage of using "transform" is that it return back the result with the same index which is pretty useful.
Now, I want to have my customize function and use it within "transform" instead of "mean" more over my function need two or more column something like:
lambda (Variable, Variable1, Variable2): (Variable + Variable1)/Variable2
(actual function of mine is more complicated than this example) and each row of my dataframe has Variable,Variable1 and Variable2.
I am wondering if I can define and use such a customized function within "transform" to be able to rerun the result back with same index?
Thanks,
Amir
Don't call transform against Variable, call it on the grouper and then call your variables against the dataframe the function receives as argument:
df.groupby('key').transform(lambda x: (x.Variable + x.Variable1)/x.Variable2)
Why didn't you use simple
df.Variable + df.Variable1 / df.Variable2
There is no need to groupby. In case for example you want to divide by df.groupby('key').Variable2.transform("mean") you can still do it with transform as following:
df.Variable + df.Variable1 / df.groupby('key').Variable2.transform("mean")
I have a dataset that represents reoccurring events at different locations.
df = [Datetime location time event]
Each location can have 8-10 events that repeat. What I'm trying to do is build some information of how long it has been between two events. (they may not be the same event)
I am able to do this by splitting the df into sub-dfs and processing each location individually. But it would seem that groupby should be smarter that this. This is also assuming that I know all the locations which may vary file to file.
df1 = df[(df['location'] == "Loc A")]
df1['delta'] = df1['time'] - df1['time'].shift(1)
df2 = df[(df['location'] == "Loc B")]
df2['delta'] = df2['time'] - df2['time'].shift(1)
...
...
What I would like to do is groupBy based on location...
dfg = df.groupby(['location'])
Then for each grouped location
Add a delta column
Shift and subtract to get the delta time between events
Questions:
Does groupby maintain the order of events?
Would a for loop that runs over the DF be better? That doesn't seem very python like.
Also once you have a grouped df is there a way to transform it back to a general dataframe. I don't think I need to do this but thought it may be helpful in the future.
Thank you for any support you can offer.
http://pandas.pydata.org/pandas-docs/dev/groupby.html looks like it provides what you need.
groups = df.groupby('location').groups
or
for name, group in df.groupby('location')
// do stuff here
Will split it into groups of rows with matching values in the location column.
Then you can sort the groups based on the time value and iterate through to create the deltas.
It appears that when you group-by and identify a column to act on the data is returned in a series which then a function can be applied to.
deltaTime = lambda x: (x - x.shift(1))
df['delta'] = df.groupby('location')['time'].apply(deltaTime)
This groups by location and returns the time column for each group.
Each sub-series is then passed to the function deltaTime.