The picture is what my dataframe looks like. I have user_name, movie_name and time column. I want to extract only rows that are first day of certain movie. For example, if movie a's first date in the time column is 2018-06-27, i want all the rows in that date and if movie b's first date in the time column is 2018-06-12, i only want those rows. How would i do that with pandas?
I assume that time column is of datetime type. If not, convert this
column calling pd.to_datetime.
Then run:
df.groupby('movie_name').apply(lambda grp:
grp[grp.time.dt.date == grp.time.min().date()])
Groupby groups the source DataFrame into grops concerning particular films.
Then grp.time.min().date() computes the minimal (first) date from the
current group.
And finally the whole lamda function returns only rows from this date
(also from the current group).
The same for other groups of rows (films).
Related
I have a dataframe:
The dataframe has three transaction name columns (Transaction_1,Tramsaction_2 and Transaction_3) and three date columns for each of these transaction name columns. What i need is firstly to take the minimum date from Date_1, Date_2 and Date_3 and put in final_Date column. Secondly, i need one other column indicating which of the transaction columns the date is coming from or corresponds to. For eg, for the first row, the minimum date corresponds to Transaction_2/Date_2 column, i want the final_transaction column to indicate that. It is the second part that i am struggling with the most. Could you please tell a way around this. Please note this is only a sample set, the original data consists of 20-25 million rows.
Thanks!
I have one data frame with start_date and end_date (01-02-2020), based on these two dates it can be daily (if start and end are one day apart), similarly for yearly or quarterly.
Then there is a column Value (3.5) in values column.
Now if there exists one record for monthly with 2.5 value and one record with quarterly with 4.5 and multiple for daily like 1.5 and one for yearly like 0.5.
enter image description here
Then I need to get one row for one date like (01-01-2020) with summing values and giving aggregate value (2.5+4.5+1.5+0.5 = 9), hence 9 is total_value on 01-01-2020. Something like below:
enter image description here
There are years of data like this with multiple records existing for same time period. And I need to get aggregated value for one by one dates for all distinct 'names'
I have been trying to do this in Python with no success till now. Any help is appreciated.
I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')
I have a minute dataframe that I grouped by day
day_grouped_df=minute_df.groupby(pd.Grouper(freq='D'))
Now I want to loop through each group and find the previous groups date inside the loop
for date, group_row in day_grouped_df:
#here I want to get the last groups date
Here how can we fetch the previous group's date when looping through grouped rows? Is there any way to get the index like normal looping so that we can do (index-1)
I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()