How to create a column of ceil dates in pandas - python

I want to add a column that is the end-of-the-month date to a pandas dataframe. Based on this answer, I tried the following:
import numpy as np
import pandas as pd
dates = ['2014-06-02', '2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06']
sp500_index = [1924.969971, 1924.23999, 1927.880005, 1940.459961, 1949.439941]
df_sp500 = pd.DataFrame({'Date' : dates, 'Close' : sp500_index})
sp500['Date'] = pd.to_datetime(sp500['Date'], format='%Y-%m-%d')
df_sp500['EOM'] = df_sp500['Date'].dt.ceil('M') # breaks on this line
#df_sp500 = df_sp500[df_sp500['Date'] == df_sp500['EOM']]
df_sp500
but I get this error message:
AttributeError: Can only use .dt accessor with datetimelike values
The reason I want to add this column is to use it to filter out all but the EOM dates as shown in the commented out line.

import numpy as np
import pandas as pd
from pandas.tseries.offsets import MonthEnd
dates = ['2014-06-02', '2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06']
sp500_index = [1924.969971, 1924.23999, 1927.880005, 1940.459961, 1949.439941]
df_sp500 = pd.DataFrame({'Date' : dates, 'Close' : sp500_index})
df_sp500['EOM'] = pd.to_datetime(df_sp500['Date'], format='%Y-%m-%d')+ MonthEnd(0)
#df_sp500['EOM']=df_sp500['EOM'].dt.day #add this if you want only day

This is already built-in to datetime with pandas.Series.is_month_end. Instead of calculating a new column just subset with:
df_sp500[df_sp500.Date.dt.is_month_end]
Input Data
dates = ['2014-06-02', '2014-06-03', '2014-06-04', '2014-06-05', '2014-06-06']
sp500_index = [1924.969971, 1924.23999, 1927.880005, 1940.459961, 1949.439941]
df_sp500 = pd.DataFrame({'Date' : dates, 'Close' : sp500_index})
df_sp500['Date'] = pd.to_datetime(df_sp500['Date'], format='%Y-%m-%d')

Base on the document
The frequency level to ceil the index to. Must be a fixed frequency
like β€˜S’ (second) not β€˜ME’ (month end)
So we may using MonthBegin for your case
df_sp500['Date']- pd.offsets.MonthBegin(1) #pd.offsets.MonthEnd(1)
0 2014-06-01
1 2014-06-01
2 2014-06-01
3 2014-06-01
4 2014-06-01
Name: Date, dtype: datetime64[ns]

Related

how to sum of columns based on another column value of excel

I would like to ask how to sum using python or excel.
Like to do summation of "number" columns based on "time" column.
Sum of the Duration for (00:00 am - 00:59 am) is (2+4) 6.
Sum of the Duration for (02:00 am - 02:59 am) is (3+1) 4.
Could you please advise how to ?
When you have a dataframe you can use groupby to accomplish this:
# import pandas module
import pandas as pd
# Create a dictionary with the values
data = {
'time' : ["12:20:51", "12:40:51", "2:26:35", "2:37:35"],
'number' : [2, 4, 3, 1]}
# create a Pandas dataframe
df = pd.DataFrame(data)
# or load the CSV
df = pd.read_csv('path/dir/filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
# add values by hour
dff = df.groupby(df['time'].dt.hour)['number'].sum()
print(dff.head(50))
output:
time
12 6
2 4
When you need more than one column. You can pass the columns as a list inside .groupby(). The code will look like this:
import pandas as pd
df = pd.read_csv('filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
df['date'] = df['date'].apply(pd.to_datetime, format='%d/%m/%Y')
# add values by hour
dff = df.groupby([df['date'], df['time'].dt.hour])['number'].sum()
print(dff.head(50))
# save the file
dff.to_csv("filename.csv")

Python - calculating difference between price extracting time

I need to create a new column and the value should be:
the current fair_price - fair_price 15 minutes ago(or the closest row)
I need to filter who is the row 15 minutes before then calculate the diff.
import numpy as np
import pandas as pd
from datetime import timedelta
df = pd.DataFrame(pd.read_csv('./data.csv'))
def calculate_15min(row):
end_date = pd.to_datetime(row['date']) - timedelta(minutes=15)
mask = (pd.to_datetime(df['date']) <= end_date).head(1)
price_before = df.loc[mask]
return price_before['fair_price']
def calc_new_val(row):
return 'show date 15 minutes before, maybe it will be null, nope'
df['15_min_ago'] = df.apply(lambda row: calculate_15min(row), axis=1)
myFields = ['pkey_id', 'date', '15_min_ago', 'fair_price']
print(df[myFields].head(5))
df[myFields].head(5).to_csv('output.csv', index=False)
I did it using nodejs but python is not my beach, maybe you have a fast solution...
pkey_id,date,fair_price,15_min_ago
465620,2021-05-17 12:28:30,45080.23,fair_price_15_min_before
465625,2021-05-17 12:28:35,45060.17,fair_price_15_min_before
465629,2021-05-17 12:28:40,45052.74,fair_price_15_min_before
465633,2021-05-17 12:28:45,45043.89,fair_price_15_min_before
465636,2021-05-17 12:28:50,45040.93,fair_price_15_min_before
465640,2021-05-17 12:28:56,45049.95,fair_price_15_min_before
465643,2021-05-17 12:29:00,45045.38,fair_price_15_min_before
465646,2021-05-17 12:29:05,45039.87,fair_price_15_min_before
465650,2021-05-17 12:29:10,45045.55,fair_price_15_min_before
465652,2021-05-17 12:29:15,45042.53,fair_price_15_min_before
465653,2021-05-17 12:29:20,45039.34,fair_price_15_min_before
466377,2021-05-17 12:42:50,45142.74,fair_price_15_min_before
466380,2021-05-17 12:42:55,45143.24,fair_price_15_min_before
466393,2021-05-17 12:43:00,45130.98,fair_price_15_min_before
466398,2021-05-17 12:43:05,45128.13,fair_price_15_min_before
466400,2021-05-17 12:43:10,45140.9,fair_price_15_min_before
466401,2021-05-17 12:43:15,45136.38,fair_price_15_min_before
466404,2021-05-17 12:43:20,45118.54,fair_price_15_min_before
466405,2021-05-17 12:43:25,45120.69,fair_price_15_min_before
466407,2021-05-17 12:43:30,45121.37,fair_price_15_min_before
466413,2021-05-17 12:43:36,45133.71,fair_price_15_min_before
466415,2021-05-17 12:43:40,45137.74,fair_price_15_min_before
466419,2021-05-17 12:43:45,45127.96,fair_price_15_min_before
466431,2021-05-17 12:43:50,45100.83,fair_price_15_min_before
466437,2021-05-17 12:43:55,45091.78,fair_price_15_min_before
466438,2021-05-17 12:44:00,45084.75,fair_price_15_min_before
466445,2021-05-17 12:44:06,45094.08,fair_price_15_min_before
466448,2021-05-17 12:44:10,45106.51,fair_price_15_min_before
466456,2021-05-17 12:44:15,45122.97,fair_price_15_min_before
466461,2021-05-17 12:44:20,45106.78,fair_price_15_min_before
466466,2021-05-17 12:44:25,45096.55,fair_price_15_min_before
466469,2021-05-17 12:44:30,45088.06,fair_price_15_min_before
466474,2021-05-17 12:44:35,45086.12,fair_price_15_min_before
466491,2021-05-17 12:44:40,45065.95,fair_price_15_min_before
466495,2021-05-17 12:44:45,45068.21,fair_price_15_min_before
466502,2021-05-17 12:44:55,45066.47,fair_price_15_min_before
466506,2021-05-17 12:45:00,45063.82,fair_price_15_min_before
466512,2021-05-17 12:45:05,45070.48,fair_price_15_min_before
466519,2021-05-17 12:45:10,45050.59,fair_price_15_min_before
466523,2021-05-17 12:45:16,45041.13,fair_price_15_min_before
466526,2021-05-17 12:45:20,45038.36,fair_price_15_min_before
466535,2021-05-17 12:45:25,45029.72,fair_price_15_min_before
466553,2021-05-17 12:45:31,45016.2,fair_price_15_min_before
466557,2021-05-17 12:45:35,45011.2,fair_price_15_min_before
466559,2021-05-17 12:45:40,45007.04,fair_price_15_min_before
This is the CSV
Firstly convert your date column to datetime dtype:
df['date']=pd.to_datetime(df['date'])
Then filter values:
date15min=df['date']-pd.offsets.DateOffset(minutes=15)
out=df.loc[df['date'].isin(date15min.tolist())]
Now Finally do your calculations:
df['price_before_15min']=df['fair_price'].where(df['date'].isin((out['date']+pd.offsets.DateOffset(minutes=15)).tolist()))
df['price_before_15min']=df['price_before_15min'].diff()
df['date_before_15min']=date15min
Now If you print df you will get your desired output
Update:
For that purpose just make a slightly change in the above method:
out=df.loc[df['date'].dt.minute.isin(date15min.dt.minute.tolist())]
df['price_before_15min']=df['fair_price'].where(df['date'].dt.minute.isin((out['date']+pd.offsets.DateOffset(minutes=15)).dt.minute.tolist()))

Collect all transactions for each day and report total spent that day

I have a DataFrame that looks like this
date Burned
8/11/2019 7:00 0.0
8/11/2019 7:00 10101.0
8/11/2019 8:16 5.2
I have this code:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv")
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
a=pd.concat([df['DateTime'], df['Burned']], axis=1, keys=['date', 'Burned'])
b=a.groupby(df.index.date).count()
But I get this error: AttributeError: 'RangeIndex' object has no attribute 'date'
Basically I am wanting to sort all these times just by day since it has timestamps throughout the day. I don't care what time of the day different things occured, I just want to get the total number of 'Burned' per day.
First add parse_dates=['DateTime'] to read_csv for convert column Datetime:
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
Or first column:
df = pd.read_csv("../example.csv", parse_dates=[0])
In your solution is date column, so need Series.dt.date with sum:
b = a.groupby(a['date'].dt.date)['Burned'].sum().reset_index(name='Total')

Element-wise maximum with date values

I have a dataframe with date values and would like to manipulate them to 1 Jan or later. Since I need to do this element-wise, I use np.maximum(). The code below however gives
TypeError: Cannot compare type 'Timestamp' with type 'int'.
What's the appropriate method to deal with this kind of data type?
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': np.arange('1999-12', '2000-02', dtype='datetime64[D]')})
df['corrected_date'] = np.maximum(pd.to_datetime('20000101', format='%Y%m%d'), df['date'])
For me working comparing with Series:
s = pd.Series(pd.to_datetime('20000101', format='%Y%m%d'), index=df.index)
df['corrected_date'] = np.maximum(s, df['date'])
Or with DatetimeIndex:
i = np.repeat(pd.to_datetime(['20000101'], format='%Y%m%d'), len(df))
df['corrected_date'] = np.maximum(i, df['date'])

Upsample data and interpolate

I have the following dataframe:
Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426
...
I need to resample this to weekly resolution and to interpolate between the points. The latter part, the interpolation is straight-forward. The reindex part is a bit tricky, on the other hand, at least for me.
If I use the DataFrame.reindex() method, it will only erase all the entries from the dataframe. I have tried to do it manually, by using .loc() to create new 'NaN' entries between each consecutive months, but this method overwrites the entries I already have.
Any clue how to do it? Thanks!
I have to assume a start date, I chose 2009-12-31.
To get resample to work, you need a pd.DateTimeIndex.
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()
Replicable code
from StringIO import StringIO
import pandas as pd
text = """Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426"""
df = pd.read_csv(StringIO(text), decimal=',', delim_whitespace=True)
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()

Categories

Resources