I have a df as follows:
Date values
20190101000000 1384.4801224435887
20190101000001 1384.5053056232982
20190101000002 1384.5304889818935
20190101000003 1384.5556725193492
20190101000004 1384.5808562356392
20190101000005 1384.606040130739
20190101000006 1384.631224204622
20190101000007 1384.6564084572635
20190101000008 1384.6815928886372
20190101000009 1384.7067774987179
20190101000010 1384.7319622874802
20190101000011 1384.757147254898
20190101000012 1384.7823324009464
20190101000013 1384.8075177255998
20190101000014 1384.8327032288325
20190101000015 1384.8578889106184
20190101000016 1384.8830747709321
20190101000017 1384.9082608097488
20190101000018 1384.9334470270423
20190101000019 1384.958633422787
20190101000020 1384.9838199969574
20190101000021 1385.0090067495285
20190101000022 1385.034193680474
20190101000023 1385.0593807897685
20190101000024 1385.0845680773864
20190101000025 1385.1097555433028
20190101000026 1385.134943187491
20190101000027 1385.160131009926
20190101000028 1385.1853190105826
20190101000029 1385.2105071894343
20190101000030 1385.2356955464566
where the Date column is of the format %Y%m%d%H%M%S. I take start date and end date as the user inputs and split it in a frequency of 1 second.
Now, I would like to take a second value of frequency from the user and obtain the value from the values column at that instant.
Example:
If the second resolution is 10secs, then the output must be as follows:
start end value
20190101000000 20190101000010 1384.7319622874802
20190101000011 20190101000020 1384.9838199969574
20190101000021 20190101000030 1385.2356955464566
from the above df, we can see that if the resolution is 10sec, then the value at every 10th second must be obtained.
If the second resolution is 15mins, then the output must be as follows:
start end values
20190101000000 20190101001500 1407.2142300429964
20190101001501 20190101003000 1416.6996533329484
20190101003001 20190101004500 1424.2467631293005
How can this be done?
My code till now:
import datetime
import pandas as pd
START_DATE = str(input('Enter start date in %Y-%m-%d %H:%M:%S format: '))
END_DATE = str(input('Enter end date in %Y-%m-%d %H:%M:%S format: '))
RESOLUTION = 'S'
dates = pd.date_range(START_DATE, END_DATE, freq = RESOLUTION)
dates = pd.DataFrame(pd.Series(dates).dt.strftime('%Y%m%d%H%M%S'), columns = ['Date'])
Compare values of datetimes converted to underline format with modulo by timedelta, then crete new column by DataFrame.insert and Series.shift, last remove first row with iloc:
res = '10s'
m = pd.to_datetime(df['Date']).to_numpy().astype(np.int64) % pd.Timedelta(res).value == 0
df = df[m].rename(columns={'Date':'end'})
df.insert(0, 'start', df['end'].shift())
df = df.iloc[1:]
print (df)
start end values
10 20190101000000 20190101000010 1384.7319622874802
20 20190101000010 20190101000020 1384.9838199969574
30 20190101000020 20190101000030 1385.2356955464566
Last for add 1 second use:
df.loc[df.index[1:], 'start'] = (pd.to_datetime(df.loc[df.index[1:], 'start']) +
pd.Timedelta('1s')).dt.strftime('%Y%m%d%H%M%S')
print (df)
start end values
10 20190101000000 20190101000010 1384.7319622874802
20 20190101000011 20190101000020 1384.9838199969574
30 20190101000021 20190101000030 1385.2356955464566
you have to change data type of dates ==>
import pandas as pd
start_date = pd.to_datetime(START_DATE)
end_date = pd.to_datetime(END_DATE)
Resolution = start_date.minute
Related
I had a column in data frame called startEndDate, example: '10.12-20.05.2019', divided those to columns start_date and end_date with same year, example: start_date '10.12.2019' and end_date '20.05.2019'. But year in this example is wrong, as it should be 2018 because start date cannot be after end date. How can I compare entire dataframe and replace values so it contains correct start_dates based on if statement(because some start dates should stay with year as 2019)?
This will show you which rows the start_date is > than the end date
data = {
'Start_Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
'End_Date' : ['2020-02-01', '2019-01-02', '2019-01-03', '2020-01-05']
}
df = pd.DataFrame(data)
df['Start_Date'] = pd.to_datetime(df['Start_Date'], infer_datetime_format=True)
df['End_Date'] = pd.to_datetime(df['End_Date'], infer_datetime_format=True)
df['Check'] = np.where(df['Start_Date'] > df['End_Date'], 'Error', 'No Error')
df
Without seeing more of your data or your intended final data this is the best we will be able to do to help identify problems in the data.
This method first splits up the date string to two dates and creates start and end date columns. Then it subtracts 1 year from the start date if it is greater than the end date.
import pandas as pd
import numpy as np
# mock data
df = pd.DataFrame({"dates": ["10.12-20.05.2019", "02.04-31.10.2019"]})
# split date string to two dates, convert to datetime and stack to columns
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x.split("-")[0] + x[-5:],
x.split("-")[1]], format="%d.%m.%Y")))
# subtract 1 year from start date if greater than end date
df["start"] = np.where(df["start"]>df["end"],
df["start"] - pd.DateOffset(years=1),
df["start"])
df
# dates start end
#0 10.12-20.05.2019 2018-12-10 2019-05-20
#1 02.04-31.10.2019 2019-04-02 2019-10-31
Although I have used split here for the initial splitting of the string, as there will always be 5 characters before the hyphen, and the date will always be the last 5 (with the .), there is no need to use the split and instead that line could change to:
df[["start", "end"]] = np.vstack(
df.dates.apply(lambda x: pd.to_datetime(
[x[:5] + x[-5:],
x[6:]], format="%d.%m.%Y")))
How to obtain the below result
Current Month is the column which is to be calculated. We need to get the increment every month starting from Jan-18 for every account id.
Every Account First row/ Record will start from JAN-18, and Second Row will be Feb-18 an so on. We need to increment from Jan-18 till last observation is there for that account id
Above shown is for a sample account and the same has to be applied for multiple account id.
You could achieve what you are looking for as follows:
import pandas as pd
from datetime import date
acct_id = "123456789"
loan_start_date = date(2018, 1, 31)
current_date = date.today()
dates = pd.date_range(loan_start_date,current_date, freq='M').strftime("%b-%y")
df_columns = [acct_id, loan_start_date, dates]
df = pd.DataFrame()
df["current_month"] = dates
df["acct_id"] = acct_id
df["loan_start_date"] = loan_start_date
df = df[["acct_id", "loan_start_date", "current_month"]]
print(df.head())
I have a csv file with 2 datetimes (pre-start and pre-end) in each row, as well as a list of datetimes (install_list).
I am trying to iterate through the csv file and add a column that returns the total number of dates from the install_list that are between the pre-start time and the pre-end time in each row.
I am using the code below, but it is returning the total number of items in the list for each row in the csv.
example: File 1 = start time, end time
List 1 = install time
Desired Result for Each Row = IF Install Time >= Start Time AND Install Time <= End Time, SUM(Installs)
Col1 (Start Time): 1/1/21 12:00:00 PM
Col2 (End Time): 1/1/21 12:10:00 PM
Install Time List = [1/1/21 12:05:00 PM, 1/1/21 12:11:00 PM]
Desired Result for Row1/Col3 = 1
Code Below:
import datetime
import pandas as pd
from collections import Counter
df_post_logs = pd.read_csv('logs_merged.csv',index_col=False)
df_installs = pd.read_csv('install_merge.csv',index_col=False)
'''Convert UTC to EST on Installs Add Column'''
df_installs['conversion date'] = pd.to_datetime(df_installs['conversion date'],infer_datetime_format='%Y-%m-%d')
df_installs['conversion time'] = pd.to_datetime(df_installs['conversion time'],infer_datetime_format='%H:%S:%M')
utc_datetime = df_installs['conversion time']
est_datetime = utc_datetime - datetime.timedelta(hours=5)
df_installs['utc datetime'] = utc_datetime
df_installs['est datetime'] = est_datetime
'''Add Column 10 Minutes Pre-Spot Time to Post Logs/10 Minutes Post Time to Spot'''
df_post_logs['Air Date'] = pd.to_datetime(df_post_logs['Air Date'],infer_datetime_format='%Y-%m-%d')
df_post_logs['Air Time'] = pd.to_datetime(df_post_logs['Air Time'],infer_datetime_format='%H:%S:%M')
timestamp = df_post_logs['Air Time']
df_post_logs['timestamp'] = timestamp
df_post_logs['pre spot time start'] = timestamp - datetime.timedelta(minutes=10, seconds=1)
df_post_logs['pre spot time end'] = timestamp - datetime.timedelta(seconds=1)
df_post_logs['post spot time'] = timestamp + datetime.timedelta(minutes=10)
'''SUM of Installs between pre-spot time'''
install_list = pd.to_datetime(df_installs['est datetime']).to_list()
for pre_spot_start in df_post_logs['pre spot time start']:
pre_spot_start_time = pre_spot_start
for pre_spot_end in df_post_logs['pre spot time end']:
pre_spot_end_time = pre_spot_end
for pre_spot_end in df_post_logs['pre spot time end']:
pre_spot_end_time = pre_spot_end
pre_spot_install = 0
for row in df_post_logs:
for date in install_list:
if date >= pre_spot_start_time and date <= pre_spot_end_time:
pre_spot_install = pre_spot_install+1
df_post_logs['Pre Spot Install'] = pre_spot_install
df_post_logs.to_csv('Test.csv')
The following code will print for each row, how many values in install_dates are between the respective values in the start and end columns of the dataframe:
import pandas as pd
df = pd.DataFrame({
"start": pd.to_datetime(["2018-07-11", "2018-06-10"]),
"end": pd.to_datetime(["2018-07-20", "2018-06-30"]),
})
install_dates = pd.to_datetime(["2018-06-25", "2018-07-01", "2018-07-15", "2018-07-18"])
def num_install_dates_between_start_and_end(row):
return len([d for d in install_dates if row["start"] <= d <= row["end"]])
print(df.agg(num_install_dates_between_start_and_end, axis="columns"))
It uses agg to collapse the information of a row to one number.
The way how information is "collapsed" is specified in the function num_install_dates_between_start_and_end, which counts how many elements from install_dates are between the start/end value in the row.
I have a dataframe with name values and a date range (start/end). I need to expand/replace the dates with the ones generated by the from/to index. How can I do this?
Name date_range
NameOne_%Y%m-%d [-2,1]
NameTwo_%y%m%d [-3,1]
Desired result (Assuming that today's date is 2021-03-09 - 9 of march 2021):
Name
NameOne_202103-10
NameOne_202103-09
NameOne_202103-08
NameOne_202103-07
NameTwo_210310
NameTwo_210309
NameTwo_210308
NameTwo_210307
NameTwo_210306
I've been trying iterating over the dataframe and then generating the dates, but I still can't make it work..
for index, row in self.config_df.iterrows():
print(row['source'], row['date_range'])
days_sub=int(str(self.config_df["date_range"][0]).strip("[").strip("]").split(",")[0].strip())
days_add=int(str(self.config_df["date_range"][0]).strip("[").strip("]").split(",")[1].strip())
start_date = date.today() + timedelta(days=days_sub)
end_date = date.today() + timedelta(days=days_add)
date_range_df=pd.date_range(start=start_date, end=end_date)
date_range_df["source"]=row['source']
Any help is appreciated. Thanks!
Convert your date_range from str to list with ast module:
import ast
df = df.assign(date_range=df["date_range"].apply(ast.literal_eval)
Use date_range to create list of dates and explode to chain the list:
today = pd.Timestamp.today().normalize()
offset = pd.tseries.offsets.Day # shortcut
names = pd.Series([pd.date_range(today + offset(end),
today + offset(start),
freq="-1D").strftime(name)
for name, (start, end) in df.values]).explode(ignore_index=True)
>>> names
0 NameOne_202103-10
1 NameOne_202103-09
2 NameOne_202103-08
3 NameOne_202103-07
4 NameTwo_210310
5 NameTwo_210309
6 NameTwo_210308
7 NameTwo_210307
8 NameTwo_210306
dtype: object
Alright. From your question I understand you have a starting data frame like so:
config_df = pd.DataFrame({
'name': ['NameOne_%Y-%m-%d', 'NameTwo_%y%m%d'],
'str_date_range': ['[-2,1]', '[-3,1]']})
Resulting in this:
name str_date_range
0 NameOne_%Y-%m-%d [-2,1]
1 NameTwo_%y%m%d [-3,1]
To achieve your goal and avoid iterating rows - which should be avoided using pandas - you can use groupby().apply() like so:
def expand(row):
# Get the start_date and end_date from the row, by splitting
# the string and taking the first and last value respectively.
# .min() is required because row is technically a pd.Series
start_date = row.str_date_range.str.strip('[]').str.split(',').str[0].astype(int).min()
end_date = row.str_date_range.str.strip('[]').str.split(',').str[1].astype(int).min()
# Create a list range for from start_date to end_date.
# Note that range() does not include the end_date, therefor add 1
day_range = range(start_date, end_date+1)
# Create a Timedelta series from the day_range
days_diff = pd.to_timedelta(pd.Series(day_range), unit='days')
# Create an equally sized Series of today Timestamps
todays = pd.Series(pd.Timestamp.today()).repeat(len(day_range)-1).reset_index(drop=True)
df = todays.to_frame(name='date')
# Add days_diff to date column
df['date'] = df.date + days_diff
df['name'] = row.name
# Extract the date format from the name
date_format = row.name.split('_')[1]
# Add a column with the formatted date using the date_format string
df['date_str'] = df.date.dt.strftime(date_format=date_format)
df['name'] = df.name.str.split('_').str[0] + '_' + df.date_str
# Optional: drop columns
return df.drop(columns=['date'])
config_df.groupby('name').apply(expand).reset_index(drop=True)
returning:
name date_str
0 NameOne_2021-03-07 2021-03-07
1 NameOne_2021-03-08 2021-03-08
2 NameOne_2021-03-09 2021-03-09
3 NameTwo_210306 210306
4 NameTwo_210307 210307
5 NameTwo_210308 210308
6 NameTwo_210309 210309
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541