How to apply custom function to grouped pandas data? - python

fI have a dataframe with columns for IDs, measurements and uncertainties. For some IDs I have more than one value for measurement and uncertainty, so for each ID I need to take the mean of the measurements and add the uncertainties in quadrature. I can use
df["measurement"] = df.groupby("ID")["measurement"].transform("mean")
to get the mean of the measurements, but I can't find a way to get the uncertainties added in quadrature (so a function to perform sqrt((uncertainty_1)**2 + (uncertainty_2)**2 + ...) on the uncertainties _1, _2 and so on in each ID group) to work on grouped data.
I'm looking for something like:
df["uncertainty"] = df.groupby("ID")["uncertainty"].transform("(custom function)").
I looked into using e.g. df.groupby("ID")["uncertainty"].apply(lambda x: custom function) but couldn't get a custom function to work (I tried a few lambda functions but the output values didn't change). Many thanks for your help.

Related

How to delete some specific columns in a large measurenent data which donot contain a some values?

i have a large measurement data which contain 35O columns after filtering(for example to A49,B0to B49,F0 toF49) with some random numbers.
Now i want to look in to (B0 to B49) whether it has values in the range(say: between 20 and 30).If not I want to delete that columns from the measurement data.
How to do this in python with pandas?
I want to know some faster methods for this filtering?
sample data:https://docs.google.com/spreadsheets/d/17Xjc81jkjS-64B4FGZ06SzYDRnc6J27m/edit?usp=sharing&ouid=106137353367530025738&rtpof=true&sd=true
(In Pandas) You can apply a function on all elements of an array using the applymap function. You can also apply aggregating actions to have a single value out of a whole column. You put those two things together to have what you want.
For instance, you want to know if a given set of columns (the "B" ones) have value in some range (say, 20:30). So, you want to verify the values at the element level, but collect the column names as output.
You can do that with the following code. Execute them separately/progressively to understand what they are doing.
>>> b_cols_of_interest_indx = df.filter(regex='^B').applymap(lambda x:20<x<30).any()
>>> b_cols_of_interest_indx[b_cols_of_interest_indx]
B19 True
B21 True
dtype: bool

Generate a pandas dataframe with for-loop

I have generated a dataframe (called 'sectors') that stores information from my brokerage account (sector/industry, sub sector, company name, current value, cost basis, etc).
I want to avoid hard coding a filter for each sector or sub sector to find specific data. I have achieved this with the following code (I know, not very pythonic, but I am new to coding):
for x in set(sectors_df['Sector']):
x_filt = sectors_df['Sector'] == x
#value in sect takes the sum of all current values in a given sector
value_in_sect = round(sectors_df.loc[x_filt]['Current Value'].sum(), 2)
#pct in sect is the % of the sector in the over all portfolio (total equals the total value of all sectors)
pct_in_sect = round((value_in_sect/total)*100 , 2)
print(x, value_in_sect, pct_in_sect)
for sub in set(sectors_df['Sub Sector']):
sub_filt = sectors_df['Sub Sector'] == sub
value_of_subs = round(sectors_df.loc[sub_filt]['Current Value'].sum(), 2)
pct_of_subs = round((value_of_subs/total)*100, 2)
print(sub, value_of_subs, pct_of_subs)
My print statements produce the majority of the information I want, although I am still working through how to program for the % of a sub sector within its own sector. Anyways, I would now like to put this information (value_in_sect, pct_in_sect, etc) into dataframes of their own. What would be the best way or the smartest way or the most pythonic way to go about this? I am thinking a dictionary, and then creating a dataframe from the dictionary, but not sure.
The split-apply-combine process in pandas, specifically aggregation, is the best way to go about this. First I'll explain how this process would work manually, and then I'll show how pandas can do it in one line.
Manual split-apply-combine
Split
First, divide the DataFrame into groups of the same Sector. This involves getting a list of Sectors and figuring out which rows belong to it (just like the first two lines of your code). This code runs through the DataFrame and builds a dictionary with keys as Sectors and a list of indices of rows from sectors_df that correspond to it.
sectors_index = {}
for ix, row in sectors_df.iterrows():
if row['Sector'] not in sectors_index:
sectors_index[row['Sector']] = [ix]
else:
sectors_index[row['Sector']].append(ix)
Apply
Run the same function, in this case summing of Current Value and calculating its percentage share, on each group. That is, for each sector, grab the corresponding rows from the DataFrame and run the calculations in the next lines of your code. I'll store the results as a dictionary of dictionaries: {'Sector1': {'value_in_sect': 1234.56, 'pct_in_sect': 11.11}, 'Sector2': ... } for reasons that will become obvious later:
sector_total_value = {}
total_value = sectors_df['Current Value'].sum()
for sector, row_indices in sectors_index.items():
sector_df = sectors_df.loc[row_indices]
current_value = sector_df['Current Value'].sum()
sector_total_value[sector] = {'value_in_sect': round(current_value, 2),
'pct_in_sect': round(current_value/total_value * 100, 2)
}
(see footnote 1 for a note on rounding)
Combine
Finally, collect the function results into a new DataFrame, where the index is the Sector. pandas can easily convert this nested dictionary structure into a DataFrame:
sector_total_value_df = pd.DataFrame.from_dict(sector_total_value, orient='index')
split-apply-combine using groupby
pandas makes this process very simple using the groupby method.
Split
The groupby method splits a DataFrame into groups by a column or multiple columns (or even another Series):
grouped_by_sector = sectors_df.groupby('Sector')
grouped_by_sector is similar to the index we built earlier, but the groups can be manipulated much more easily, as we can see in the following steps.
Apply
To calculate the total value in each group, select the column or columns to sum up, use the agg or aggregate method with the function you want to apply:
sector_total_value = grouped_by_sector['Current Value'].agg(value_in_sect=sum)
Combine
It's already done! The apply step already creates a DataFrame where the index is the Sector (the groupby column) and the value in the value_in_sect column is the result of the sum operation.
I've left out the pct_in_sect part because a) it can be more easily done after the fact:
sector_total_value_df['pct_in_sect'] = round(sector_total_value_df['value_in_sect'] / total_value * 100, 2)
sector_total_value_df['value_in_sect'] = round(sector_total_value_df['value_in_sect'], 2)
and b) it's outside the scope of this answer.
Most of this can be done easily in one line (see footnote 2 for including the percentage, and rounding):
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(value_in_sect=sum)
For subsectors, there's one additional consideration, which is that grouping should be done by Sector and Subsector rather than just Subsector, so that, for example rows from Utilities/Gas and Energy/Gas aren't combined.
subsector_total_value_df = sectors_df.groupby(['Sector', 'Sub Sector'])['Current Value'].agg(value_in_sect=sum)
This produces a DataFrame with a MultiIndex with levels 'Sector' and 'Sub Sector', and a column 'value_in_sect'. For a final piece of magic, the percentage in Sector can be calculated quite easily:
subsector_total_value_df['pct_within_sect'] = round(subsector_total_value_df['value_in_sect'] / sector_total_value_df['value_in_sect'] * 100, 2)
which works because the 'Sector' index level is matched during division.
Footnote 1. This deviates from your code slightly, because I've chosen to calculate the percentage using the unrounded total value, to minimize the error in the percentage. Ideally though, rounding is only done at display time.
Footnote 2. This one-liner generates the desired result, including percentage and rounding:
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(
value_in_sect = lambda c: round(sum(c), 2),
pct_in_sect = lambda c: round(sum(c)/sectors_df['Current Value'].sum() * 100, 2),
)

Lambda Function with IF condition in Python Pandas DataFrame behaves weirdly

I have a CSV File with records (rows) of sales made in a store. Each record contains information about the client and the purchase done in the store (columns).
After opening my File as a DataFrame named sales, I calculate the mean of one the columns (Amount_Sales), and I want to add a new column (Type_of_Sales) according to the following rule: If the number is lower than the mean of Amount_Sales, then assign the string ‘Low’, and if the value is higher than the mean, assign the string 'High'.
I tried to use a lambda function:
sales['Type_of_Sales'] = sales['Amount_Sales'].apply(lambda x: 'Low' if x < sales.Amount_Sales.mean() else 'High')
and it doesn't work (the console stops working... looks like it is 'locked' in an infinite loop').
But if I calculate the mean beforehand, assign it to a variable, and then I use it in the lambda function definition, it works.
sales_mean = sales.Amount_Sales.mean()
sales['Type_of_Sales'] = sales['Amount_Sales'].apply(lambda x: 'Low' if x < sales_mean else 'High')
Does anyone knows why one of the codes is working and not the other?
Thanks in advance!!!
I believe it could be because, in the first approach, you are calculating the mean of Amount_Sales for every iteration (each row of the new column), which is computationally expensive, which is probably why your console is crashing.
But in the second approach you are only calculating the mean only once and reusing the calculated value.
This explanation is considering the dataset you are working with is reasonably large.
Again, I'm not entirely sure if this is the cause.

Apply transform of your own function in Pandas dataframe

I have pandas dataframe on which I need to some data manipulation, the following code provide me the average of column "Variable" group by "Key":
df.groupby('key').Variable.transform("mean")
The advantage of using "transform" is that it return back the result with the same index which is pretty useful.
Now, I want to have my customize function and use it within "transform" instead of "mean" more over my function need two or more column something like:
lambda (Variable, Variable1, Variable2): (Variable + Variable1)/Variable2
(actual function of mine is more complicated than this example) and each row of my dataframe has Variable,Variable1 and Variable2.
I am wondering if I can define and use such a customized function within "transform" to be able to rerun the result back with same index?
Thanks,
Amir
Don't call transform against Variable, call it on the grouper and then call your variables against the dataframe the function receives as argument:
df.groupby('key').transform(lambda x: (x.Variable + x.Variable1)/x.Variable2)
Why didn't you use simple
df.Variable + df.Variable1 / df.Variable2
There is no need to groupby. In case for example you want to divide by df.groupby('key').Variable2.transform("mean") you can still do it with transform as following:
df.Variable + df.Variable1 / df.groupby('key').Variable2.transform("mean")

using aggregate in R when returning matrix

I want to use aggregate to apply some manipulations to a set of matrices, grouped by the customer_id, which is one column of my dataframe, df.
For example, I want to take the subsets of df that correspond to different customer_id's and add some columns to these subsets, and return them all.
In Python, I would use groupby and apply.
How can I do this in R?
The code I wrote looks like:
gr_TILPS = aggregate(df,by=list(df[,"customer_id"]),FUN=kmeansfunction)
Error in TILPSgroup$hour : $ operator is invalid for atomic vectors
The error is coming from the kmeansfunction I guess, which looks something like:
kmeansfunction = function(dfgroup){
Hour =dfgroup$hour
Weekday =TILPSgroup$WeekdayPrime
x <- cbind(Hour, Weekday)
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
clusters = cl$cluster
origclusters = as.factor(clusters)
dfgroup = cbind(dfgroup,origclusters)
return(dfgroup)
}
aggregate applies the same function to multiple single columns. If you want to work on ensembles of columns, then use this paradigm: lapply(split(df,group),function);
Try this:
gr_TILPS <- lapply( split(df, df[,"customer_id"]),
FUN=kmeansfunction)
Sounds like python might have some similarities to the experimental package: 'dplyr'. In a sense aggregate is only a column-oriented processing strategy within blocks, while the lapply(split, ), ) strategy is more applicable when you are interested in entire rows of data, defined by a blocking criterion. If you later want to row-bind those results back together you can always use do.call(rbind, res_from_lapply).

Categories

Resources