I'm new to the library and am trying to figure out how to add columns to a pivot table with the mean and standard deviation of the row data for the last three months of transaction data.
Here's the code that sets up the pivot table:
previousThreeMonths = [prev_month_for_analysis, prev_month2_for_analysis, prev_month3_for_analysis]
dfPreviousThreeMonths = df[df['Month'].isin(previousThreeMonths)]
ptHistoricalConsumption = dfPreviousThreeMonths.pivot_table(dfPreviousThreeMonths,
index=['Customer Part #'],
columns=['Month'],
aggfunc={'Qty Shp':np.sum}
)
ptHistoricalConsumption['Mean'] = ptHistoricalConsumption.mean(numeric_only=True, axis=1)
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.std(numeric_only=True, axis=1)
ptHistoricalConsumption
The resulting pivot table looks like this:
The problem is that the standard deviation column is including the Mean in its calculations, whereas I just want it to use the raw data for the previous three months. For example, the Std Dev of part number 2225 should be 11.269, not 9.2.
I'm sure there's a better way to do this and I'm just missing something.
One way would be to remove the Mean column temporarily before call .std():
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.drop('Mean', axis=1).std(numeric_only=True, axis=1)
That wouldn't remove it from the permanently, it would just remove it from the copy fed to .std().
Related
I'm using the "LGBT_Survey_DailyLife.csv" dataset from Kaggle(Link) without the question_code and notes columns.
I want each question (question_label) and country (CountryCode) combination to be on its own line, and to have each column be a combination of group (subset) and response (answer) with the values being those given in the percentage column.
It seems like this should be pretty straightforward, but when I run the following:
daily_life.pivot(index = ['CountryCode', 'question_label'], columns = ['subset', 'answer'], values = 'percentage')*
I get this error:
ValueError: Length of passed values is 34020, index implies 2*
You have to first clean up the percentage column as it contains non integer values
And then use pivot_table
df.percentage = df.percentage.replace(':', 0).astype('float')
df1 = df.pivot_table(values="percentage", index=["CountryCode", "question_label"], columns=["subset", "answer"])
Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)
I'm wondering if there is a way to add columns based on the common field in pandas.
This is my original dataset
load mapping freq 99th energy
61 175.0k 5CN0-5CN1 1.20GHz 0.937662 18952.056063
19 175.0k 5CN0-5CN1 2.10GHz 0.391280 19051.052048
I want to add the following columns 99th-1.20GHz energy-1.20GHz 99th-2.10GHz and energy-2.10GHz based on the presumption that load and mapping are the same.
This is the desired output
load mapping 99th-1.20Ghz 99th-2.10GHz energy-1.20GHz energy-2.10GHz
175.0K 5CN0-5CN1 0.937662 0.39128 18952.05606 19051.05205
I suggest you use MultiIndex columns for this, e.g. via pd.pivot_table. You can flatten columns as a separate step, although your data will lose structure.
res = pd.pivot_table(df, index=['load', 'mapping'], columns='freq',
values=['99th', 'energy'], aggfunc='first')
Result:
99th energy
freq 1.20GHz 2.10GHz 1.20GHz 2.10GHz
load mapping
175.0k 5CN0-5CN1 0.937662 0.39128 18952.056063 19051.052048
I want to use aggregate to apply some manipulations to a set of matrices, grouped by the customer_id, which is one column of my dataframe, df.
For example, I want to take the subsets of df that correspond to different customer_id's and add some columns to these subsets, and return them all.
In Python, I would use groupby and apply.
How can I do this in R?
The code I wrote looks like:
gr_TILPS = aggregate(df,by=list(df[,"customer_id"]),FUN=kmeansfunction)
Error in TILPSgroup$hour : $ operator is invalid for atomic vectors
The error is coming from the kmeansfunction I guess, which looks something like:
kmeansfunction = function(dfgroup){
Hour =dfgroup$hour
Weekday =TILPSgroup$WeekdayPrime
x <- cbind(Hour, Weekday)
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
clusters = cl$cluster
origclusters = as.factor(clusters)
dfgroup = cbind(dfgroup,origclusters)
return(dfgroup)
}
aggregate applies the same function to multiple single columns. If you want to work on ensembles of columns, then use this paradigm: lapply(split(df,group),function);
Try this:
gr_TILPS <- lapply( split(df, df[,"customer_id"]),
FUN=kmeansfunction)
Sounds like python might have some similarities to the experimental package: 'dplyr'. In a sense aggregate is only a column-oriented processing strategy within blocks, while the lapply(split, ), ) strategy is more applicable when you are interested in entire rows of data, defined by a blocking criterion. If you later want to row-bind those results back together you can always use do.call(rbind, res_from_lapply).
Hi I have a result set from psycopg2 like so
(
(timestamp1, val11, val12, val13, val14),
(timestamp2, val21, val22, val23, val24),
(timestamp3, val31, val32, val33, val34),
(timestamp4, val41, val42, val43, val44),
)
I have to return the difference between the values of the row (exception for the timestamp column).
Each row would subtract the previous row values.
The first row would be
timestamp, 'NaN', 'NaN' ....
This has to then be returned as a generic object
Ie something like an array of the following objects
Group(timestamp=timestamp, rows=[val11, val12, val13, val14]
I was going to use Pandas to do the diff.
Something like below works ok on the values
df = DataFrame().from_records(data=results, columns=headers)
diffs = df.set_index('time', drop=False).diff()
But diff also performs on the timestamp column and I can't get it to ignore a column while
leaving the original timestamp column in place.
Also I wasn't sure it was going to be efficient to get the data into my return format
as Pandas advises against row access
What would a fast way to get the result set differences in my required output format ?
Why did you set drop=False? That puts the timestamps in the index (where they will not be touched by diff) but also leaves a copy of the timestamps as a proper column, to be process by diff.
I think this will do what you want:
diffs = df.set_index('time').diff().reset_index()
Since you mention psycopg2, take a look at the docs for pandas 0.14, released just a few days ago, which features improved SQL functionality, including new support for postgresql. You can read and write directly between the database and pandas DataFrames.