Time Diff on vertical dataframe in Python - python

I have a dataframe, df that looks like this
Date Value
10/1/2019 5
10/2/2019 10
10/3/2019 15
10/4/2019 20
10/5/2019 25
10/6/2019 30
10/7/2019 35
I would like to calculate the delta for a period of 7 days
Desired output:
Date Delta
10/1/2019 30
This is what I am doing: A user has helped me with a variation of the code below:
df['Delta']=df.iloc[0:,1].sub(df.iloc[6:,1]), Date=pd.Series
(pd.date_range(pd.Timestamp('2019-10-01'),
periods=7, freq='7d'))[['Delta','Date']]
Any suggestions is appreciated

Let us try shift
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['DIFF'] = df['New'] - df['Value']
df
Out[39]:
Date Value New DIFF
0 2019-10-01 5 35.0 30.0
1 2019-10-02 10 NaN NaN
2 2019-10-03 15 NaN NaN
3 2019-10-04 20 NaN NaN
4 2019-10-05 25 NaN NaN
5 2019-10-06 30 NaN NaN
6 2019-10-07 35 NaN NaN

Related

Combining Rows Based on Column Value

I have a sample similar to the problem I am running into. Here, I have company name and revenue for 3 years. The revenue is given in 3 different datasets. When I concatenate the data, it looks as follows:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 NaN NaN
1 company_2 20.0 NaN NaN
2 company_3 30.0 NaN NaN
3 company_1 NaN 20.0 NaN
4 company_2 NaN 30.0 NaN
5 company_3 NaN 40.0 NaN
6 company_1 NaN NaN 50.0
7 company_2 NaN NaN 60.0
8 company_3 NaN NaN 70.0
9 company_4 NaN NaN 80.0
What I am trying to do is have company name followed by the actual revenue columns. In a sense drop the duplicate company_name rows and put that data into the corresponding company_name. An image of my desired output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80
Use melt and pivot_table:
out = (df.melt('company_name').dropna()
.pivot_table('value', 'company_name', 'variable', fill_value=0)
.rename_axis(columns=None).reset_index())
print(out)
# Output
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10 20 50
1 company_2 20 30 60
2 company_3 30 40 70
3 company_4 0 0 80
You can try:
df.set_index('company_name').stack().unstack().reset_index()
Or
df.groupby('company_name', as_index=False).first()
Output:
company_name 2020 Revenue 2021 Revenue 2022 Revenue
0 company_1 10.0 20.0 50.0
1 company_2 20.0 30.0 60.0
2 company_3 30.0 40.0 70.0
3 company_4 NaN NaN 80.0
I would say your concat might not be the join you should be using, but instead try df_merge pd.merge(df1, df2, how = 'inner', left_on = 'company', left_on = 'company') Then you can do that against with df_merge (your newly merged data) and the next dataframe. This should keep everything in line with each other and only add columns that they do not share. If they don't only have the 2 columns you are looking at you might need to do a little more cleaning of the data to get only the results you are looking for, but that should for the most part get you started and your data all in the correct place.

How to apply a function/impute on an interval in Pandas

I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0

Melting Pandas Dataframe and separate the value column based on its data type

Say I have a Dataframe being read from CSV which looks roughly like this
date 1 2 3 4
05-10-2019 20 32 43.5 Auto
06-10-2019 19 Off 54.6 Auto
07-10-2019 Off 45 37 Auto
Each parameter (1, 2, 3, etc) can have either a float value or a string value. Is there any way I can melt the data so that the value column will be separated on the parameter's datatype? When the value is a string the parameter would have the value None for its float column and if the value is a float its string column value would be None.
In the end the dataframe would look like this
date parameter value message
05-10-2019 1 20 None
05-10-2019 2 32 None
05-10-2019 3 43.5 None
05-10-2019 4 None Auto
06-10-2019 1 19 None
06-10-2019 2 None Off
06-10-2019 3 54.6 None
................
07-10-2019 4 None Auto
First step is DataFrame.melt, then convert values to numeric by to_numeric with errors='coerce' create missing values for non numeric, so is possible use DataFrame.assign for non numeric column with Series.where:
df = df.melt('date', var_name='parameter')
s = pd.to_numeric(df['value'], errors='coerce')
df = df.assign(value = s, message = df['value'].where(s.isna()))
print (df)
date parameter value message
0 05-10-2019 1 20.0 NaN
1 06-10-2019 1 19.0 NaN
2 07-10-2019 1 NaN Off
3 05-10-2019 2 32.0 NaN
4 06-10-2019 2 NaN Off
5 07-10-2019 2 45.0 NaN
6 05-10-2019 3 43.5 NaN
7 06-10-2019 3 54.6 NaN
8 07-10-2019 3 37.0 NaN
9 05-10-2019 4 NaN Auto
10 06-10-2019 4 NaN Auto
11 07-10-2019 4 NaN Auto
If order is important:
df = df.melt('date', var_name='parameter').sort_values(['date','parameter'])
s = pd.to_numeric(df['value'], errors='coerce')
df = df.assign(value = s, message = df['value'].where(s.isna()))
print (df)
date parameter value message
0 2019-05-10 1 20.0 NaN
3 2019-05-10 2 32.0 NaN
6 2019-05-10 3 43.5 NaN
9 2019-05-10 4 NaN Auto
1 2019-06-10 1 19.0 NaN
4 2019-06-10 2 NaN Off
7 2019-06-10 3 54.6 NaN
10 2019-06-10 4 NaN Auto
2 2019-07-10 1 NaN Off
5 2019-07-10 2 45.0 NaN
8 2019-07-10 3 37.0 NaN
11 2019-07-10 4 NaN Auto

Adding extra days for each month in pandas

In a pandas df, I have number of days for a given month in the first col and Amount in the sec col. How can I add the days that are not in there for that month in the first col and give the value 0 for it in the second col
df = pd.DataFrame({
'Date':['5/23/2019', '5/9/2019'],
'Amount':np.random.choice([10000])
})
I would like the result to look like the following:
Expected Output
Date Amount
0 5/01/2019 0
1 5/02/2019 0
.
.
. 5/23/2019 1000
. 5/24/2019 0
Look at date_range from pandas.
I'm assuming that 5/31/2019 is not in your output like the comment asks because you want the differences between the min and max dates?
I convert the date column to a datetime type. I pass the min and max date to date_range and store that in a dataframe. then I do left join.
df['Date'] = pd.to_datetime(df['Date'])
date_range = pd.DataFrame(pd.date_range(start=df['Date'].min(), end=df['Date'].max()), columns=['Date'])
final_df = pd.merge(date_range, df, how='left')
Date Amount
0 2019-05-09 10000.0
1 2019-05-10 NaN
2 2019-05-11 NaN
3 2019-05-12 NaN
4 2019-05-13 NaN
5 2019-05-14 NaN
6 2019-05-15 NaN
7 2019-05-16 NaN
8 2019-05-17 NaN
9 2019-05-18 NaN
10 2019-05-19 NaN
11 2019-05-20 NaN
12 2019-05-21 NaN
13 2019-05-22 NaN
14 2019-05-23 10000.0

pandas group by date, assign value to a column

I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks
assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0
I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2

Categories

Resources