Custom function applied to dataframe, based on value in id column - python

I've got a dataframe that contains several columns, including a user ID (id) and a timestamp (startTime). I want to check how many different days my data (df rows) span, per user.
I'm currently doing that by splitting up the df by 'id', and then calculating the following in a loop for each of the subset dfs:
days = len(df.startTime.dt.date.unique())
How do I do this more efficiently, without splitting up the data frame? I'm working with rather large data frames, and I fear this will take way too much time. I've looked at the groupby function, but I didn't get far. I tried something like:
result = df.groupby('id').agg({'days': lambda x: x.startTime.dt.date.unique()})
... but that clearly didn't work.

You can using drop_duplicates before value_counts
df['New Date'] = df['startTime'].dt.date
result = df.drop_duplicates(['ID','New Date']).ID.value_counts()

Related

How can I find the most common value from a column based on another Columns value in python?

I have a dataframe (DF) that has the following columns: UserID, Country, Arrival_Year, Airport_Code.
Each country is listed several times based on UserID.
I want to know the most common arrival year for each country listed in the data frame. How can I calculate that?
Tried using value counts but not getting the right answer.
Try using pandas groupby with mode:
df.groupby('Country')['ArrivalYear'].apply(lambda x: x.mode()[0])

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

Groupby in pandas returning too many rows

I am trying to filter a dataframe in pandas, using the groupby function. The aim is to take the earliest (by date)instance of each variable for each id.
Eventually I was able to solve the problem in R using tidyr like so:
df_mins <- df %>%
group_by(id, variable) %>%
slice(which.min(as.Date(date)))
I also achieved something close using pandas which looked like this:
df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
however the resulting df had more than one (non unique) entry per variable. any ideas what im doing wrong?
Since you have duplicate for min date
m=df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
df=df[m].drop_duplicates(['id', 'variable'])
Also in R we can do
df=df[order(df$date),]
df=df[!duplicated(df[c('id', 'variable')]),]
Same in pandas
df=df.sort_values(['date']).drop_duplicates(['id', 'variable'])

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

Pandas Column of Lists to Separate Rows

I've got a dataframe that contains analysed news articles w/ each row referencing an article and columns w/ some information about that article (e.g. tone).
One column of that df contains a list of FIPS country codes of the locations that were mentioned in that article.
I want to "extract" these country codes such that I get a dataframe in which each mentioned location has its own row, along with the other columns of the original row in which that location was referenced (there will be multiple rows with the same information, but different locations, as the same article may mention multiple locations).
I tried something like this, but iterrows() is notoriously slow, so is there any faster/more efficient way for me to do this?
Thanks a lot.
'events' is the column that contains the locations
'event_cols' are the columns from the original df that I want to retain in the new df.
'df_events' is the new data frame
for i, row in df.iterrows():
for location in df.events.loc[i]:
try:
df_storage = pd.DataFrame(row[event_cols]).T
df_storage['loc'] = location
df_events = df_events.append(df_storage)
except ValueError as e:
continue
I would group the DataFrame with groupby(), explode the lists with a combination of apply and a lambda function, and then reset the index and drop the level column that is created to clean up the resulting DataFrame.
df_events = df.groupby(['event_col1', 'event_col2', 'event_col3'])['events']\
.apply(lambda x: pd.DataFrame(x.values[0]))\
.reset_index().drop('level_3', axis = 1)
In general, I always try to find a way to use apply() before most other methods, because it is often much faster than iterating over each row.

Categories

Resources