How can I apply the results of groupby command to all rows? - python

I have a dataframe as follows:
I want to have the max of the 'Sell_price' column according to 'Date' and 'Product_id' columns, without losting the dimension of dataframe as:
Since my data is very big, without a doubt using 'for' command is not logical.

I think you are looking for transform:
df['Sell_price'] = df.groupby(['Date', 'Product_id'])['Sell_price'].transform('max')

Related

How to combine rows with same id number using pandas in python?

I have a big csv file and I would like to combine rows with the same id#.
For instance, this is what my csv shows right now.
and I would like it to be like this:
how can I do this using pandas?
Try this:
df = df.groupby('id').agg({'name':'last',
'type':'last',
'date':'last' }).reset_index()
this way you can have customized function in handling each columns.
(By changing the function from 'last' to your function)
You can read the csv with pd.read_csv() function and then use the GroupBy.last() function to aggregate rows with the same id.
something like:
df = pd.read_csv('file_name.csv')
df1 = df.groupby('id').last()
you should also decide an aggregation function instead of using "the last" row value.

Operate on a list of columns

I have a list of columns from a dataframe
df_date=[df[var1],df[var2]]
I want to change the data in that columns to date time type
for t in df_date:
pd.DatetimeIndex(t)
for some reason its not working
I whould like to understand what is more general solution for applying sevral operations on several columns.
As an alternative, you can do:
for column_name in ["var1", "var2"]:
df[column_name] = pd.DatetimeIndex(df[column_name])
You can use pandas.to_datetime and pandas.DataFrame.apply to convert a dataframe's entire content to datetime. You can also filter out the columns you need and apply it only to them.
df[['column1', 'column2']] = df[['column1', 'column2']].apply(pd.to_datetime)
Note that a list of series and a DataFrame are not the same thing.
A DataFrame is accessed like this:
df[[columns]]
While a list of series is looks like this:
[seriesA, seriesB]

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

Pandas `groupby.aggregate` on `df.index.duplicated()`

Scenario. Assume a
pd.DataFrame, loaded from an external source
where one row is a line from a sensor. The index is a DateTimeIndex
with some rows having df.index.duplicated()==True. This actually means, there are lines with the same timestamp from different sensors.
Now applying some logic, like df.loc[df.A>0, 'my_col'] = 1, I ran into ValueError: cannot reindex from a duplicate axis. This can be solved by simply removing the duplicated rows using
df[~df.index.duplicated()]
But I wonder, if it would be possible, to actually apply a column based function during the Index de-duplication process? E.g.: Calculating the mean/max/min of column A/B/C for the duplicated rows.
Is this possible? Its something like a groupby.aggregate on df.index.duplicated() rows.
Check with describe
df.groupby(level=0).describe()

How to do sorting after groupby and aggregation on a Pandas Dataframe

I'm having a Pandas Dataframe and I'm doing a groupby on two columns and have a couple of aggregate functions on a column. Here is how my code looks like
df2 = df[X,Y, Z].groupby([X,Y]).agg([np.mean, np.max, np.min]).reset_index()
It find the aggregate functions on the column Z.
I need to sort by let's say min (i.e. sort_values('min')) column but it keeps complaining that 'min' column does not exist. How can I do that
Since you are generating a pd.MultiIndex, you must use a tuple in sort_values.
Try:
df2.sort_values(('Z','amin'))

Categories

Resources