I am trying to filter a dataframe in pandas, using the groupby function. The aim is to take the earliest (by date)instance of each variable for each id.
Eventually I was able to solve the problem in R using tidyr like so:
df_mins <- df %>%
group_by(id, variable) %>%
slice(which.min(as.Date(date)))
I also achieved something close using pandas which looked like this:
df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
however the resulting df had more than one (non unique) entry per variable. any ideas what im doing wrong?
Since you have duplicate for min date
m=df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
df=df[m].drop_duplicates(['id', 'variable'])
Also in R we can do
df=df[order(df$date),]
df=df[!duplicated(df[c('id', 'variable')]),]
Same in pandas
df=df.sort_values(['date']).drop_duplicates(['id', 'variable'])
Related
I have a dataframe as follows:
I want to have the max of the 'Sell_price' column according to 'Date' and 'Product_id' columns, without losting the dimension of dataframe as:
Since my data is very big, without a doubt using 'for' command is not logical.
I think you are looking for transform:
df['Sell_price'] = df.groupby(['Date', 'Product_id'])['Sell_price'].transform('max')
I am trying to winsorize a data set that would contain a few hundred columns of data. I'd like to make a new column to the dataframe and the column would contain the winsorized result from its row's data. How can I do this with a pandas dataframe without having to specify each column (I'd like to use all columns)?
Edit: I would want to use the function 'winsorize(list, limits = [0.1,0.1])' but I'm not sure how to format the dataframe rows to work as a list.
Some tips:
You may use the pandas function apply with axis=1 to apply a function to every row.
The apply function will receive a pandas Series object but you can easily convert it to a list using tolist method
For example:
df.apply(lambda x: winsorize(x.tolist(), limits=[0.1,0.1]), axis=1)
You can use the numpy version of your dataframe using to_numpy()
from scipy.stats.mstats import winsorize
ma = winsorize(df.to_numpy(), axis=1, limits=[0.1, 0.1])
out = pd.DataFrame(ma.data, index=df.index, columns=df.columns)
I have a dask dataframe, dfs, with a date column, IR_START_DATE. I'd like to create a new dayofweek column using said date column.
I can achieve this using the following code:
ddf.to_datetime(dfs['IR_START_DATE']).dt.dayofweek.compute()
However, I'm having trouble storing this to it's own column.
E.g., I've tried:
Assigning as column:
dfs['yeah'] = ddf.to_datetime(dfs['IR_START_DATE']).dt.dayofweek.compute()
Using map_partition():
def compute_dow(df):
date_time = ddf.to_datetime(df['IR_START_DATE']).dt
dow = date_time.dayofweek
return dow
dow = dfs.map_partitions(compute_dow)
Using map():
dfs['IR_START_DATE'].map(lambda x: ddf.to_datetime(x['IR_START_DATE']).dt.dayofweek, meta = ('time', 'datetime64[ns]')).compute()
Obviously I'm missing some fundamental piece of dask knowledge here, please point me in the right direction!
Your first two methods were very close!
This should work:
dfs['yeah'] = ddf.to_datetime(dfs['IR_START_DATE']).dt.dayofweek
Note the lack of compute() - you do not want to make a pandas dataframe, you want the column to refer back to the original data in the normal dask lazy way.
For map_partitions, you could have done
def compute_dow(df):
date_time = ddf.to_datetime(df['IR_START_DATE']).dt
df['dow'] = date_time.dayofweek
return df
Note that we are passing in a dataframe and getting back a dataframe. Also, it would help when calling map_partitions to provide the meta= argument, to reduce the inference dask needs to make (read the method docs).
I've got a dataframe that contains several columns, including a user ID (id) and a timestamp (startTime). I want to check how many different days my data (df rows) span, per user.
I'm currently doing that by splitting up the df by 'id', and then calculating the following in a loop for each of the subset dfs:
days = len(df.startTime.dt.date.unique())
How do I do this more efficiently, without splitting up the data frame? I'm working with rather large data frames, and I fear this will take way too much time. I've looked at the groupby function, but I didn't get far. I tried something like:
result = df.groupby('id').agg({'days': lambda x: x.startTime.dt.date.unique()})
... but that clearly didn't work.
You can using drop_duplicates before value_counts
df['New Date'] = df['startTime'].dt.date
result = df.drop_duplicates(['ID','New Date']).ID.value_counts()
Consider this case:
Python pandas equvilant to R groupby mutate
In dplyr:
df = df%>% group_by(a,b) %>%
means first the dataframe is grouped by column a then by b.
In my case I am trying to group my data first by group_name column, then by user_name , then by type_of_work . There are more than three columns (which is why I got confused) but I need data grouped according to these three headers in the same order. I already have an algorithm to work with columns after this stage. I only need an algorithm for creating a dataframe grouped according to these three columns.
It is important in my case that the sequence is preserved like the dplyr function.
Do we have anything similar in pandas data-frame?
Grouped = df.groupby(['a', 'b'])
Read more on "split-apply-combine" strategy in the pandas docs to see how pandas deals with these issues compared to R.
From your comment it seem you want assign the grouped frames. You can either use a groupbyobject through the API, eg grouped.mean(), or you can iterate through the groupby object. You will get name and group in each loop.