How can I add 'duration' column in the given DataFrame

How can I add 'duration' column in the given DataFrame - python

I am pretty new to Python and doing some project work on my own. Hence need a little help to understand a few things.
I have a DataFrame that contains Netflix Data.
what I need to do is to Find out the Sum of DURATION column for each Profile Name i.e want to know who watches Netflix the most.
How can I add the duration Column? I am unable to understand the to_timedelta function.

You can use a combination of to_timedelta and GroupBy.sum:
out = (pd.to_timedelta(df['Duration']) # convert strings to timedelta
.groupby(df['Profile Name']).sum() # sum per Profile
.sort_values(ascending=False) # sort by total duration
)
print(out)

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.

The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']

You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here

There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

Finding duration with time stamp in rows & sum of column elements upto certain order_id in rows (python)

Here's my data:
https://docs.google.com/spreadsheets/d/1Nyvx2GXUFLxrJdRTIKNAqIVvGP7FyiQ9NrjKiHoX3kE/edit?usp=sharing
Dataset
It's a small part of dataset with 100s of order_id.
I want to find duration in #timestamp column with respect to order_id. Example. for order_id 3300400, duration will be from index 6 to index 0. Similarly for all other order ids.
I want to have the sum of items.quantity and items.price with respect to order ids. Ex. for order_id 3300400, sum of items.quantity = 2 and sum of items.price = 499+549 = 1048. Similarly for other order_ids.
I am new to python but I think it will need the use of loops. Any help will be highly appreciated.
Thanks and Regards,
Shantanu Jain

you have figured out how to use the groupby() method which is good. In order to work out the diff in timestamps its a little more work.
# Function to get first and last stamps within group
def get_index(df):
return df.iloc[[0, -1]]
# apply function and then use diff method on ['#timestamp']
df['time_diff'] = df.groupby('order_id').apply(get_index)['#timestamp'].diff()
I haven't tested any of this code, and it will only work if your time stamps are pd.timestamps. It should at least give you an idea on where to start

Pandas - select lowest value to date

I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?

Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.

Using a function to do a %change on a dataset

I am on python using pandas but running into this issue. I am having a dataset that has the countries on the columns and dates(my months) on the rows. The data consists of the population of an item.
I am required to calculate the % change of population month by month is there a function that I can use to get the data into a dataset with the %change month by month in the format attached?
I am trying to do the apply a function onto the dataset but getting the function to retrieve the previous month's population to do a % change is an issue.
Anyone has any good ideas to get this done? Thanks

You can use pct_change:
df.pct_change()

First order the data by month (if it isn't already), and then use the .shift() function for pandas dataframes
df['pct_change'] = (df.US - df.US.shift(1) ) / df.US
.shift() allows you to shift rows up or down depending on the argument.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I add 'duration' column in the given DataFrame - python

You can use a combination of to_timedelta and GroupBy.sum: out = (pd.to_timedelta(df['Duration']) # convert strings to timedelta .groupby(df['Profile Name']).sum() # sum per Profile .sort_values(ascending=False) # sort by total duration ) print(out)

Related

pandas computing new column as a average of other two conditions

Pycharm problem set (Stuck from step 3 onwards)

Finding duration with time stamp in rows & sum of column elements upto certain order_id in rows (python)

Pandas - select lowest value to date

Using a function to do a %change on a dataset

Categories

Resources