How to add a time sequence column, based on groups - python

link
Above is a link to an example of a CSV file I am modifying using python, I need to add a time column that increases by 1 if the date from the previous row matches.
If the date Changes the time would start back over at 8:00:00
additionally if the 'PL Seq' changes from G* to H* the time would also start back over at 8.
I think I have the logic down, just having a hard time writing it.
add a column to the df 'Time'
set the first 'Time' value to 8:00:00
read each row in the df
If date value = date value of previous row and pl seq value first character = first character set time value to time +1
else set time value to time
*Just a note I already have the code to change the format of the order #'s and the Dates of the Goal state
Current
MODELCHASS,Prod Date,PL Seq
M742-021167,20200917,G0005
M359-020535,20200917,G0010
M742-022095,20200917,G0015
M220-001083,20200918,G0400
M742-022390,20200918,G0405
M907-004747,20200918,H0090
M934-005904,20200918,H0095
Expected
MODELCHASS,Prod Date,PL Seq,Time
M742 021167,2020-09-17T,G0005,8:00:00
M359 020535,2020-09-17T,G0010,8:00:01
M742 022095,2020-09-17T,G0015,8:00:02
M220 001083,2020-09-18T,G0400,8:00:00
M742 022390,2020-09-18T,G0405,8:00:01
M907 004747,2020-09-18T,H0090,8:00:00
M934 005904,2020-09-18T,H0095,8:00:01
#Trenton Can we modify this If H orders have the same date as G orders
for example
Current Edit in Line 6
MODELCHASS,Prod Date,PL Seq
M742-021167,20200917,G0005
M359-020535,20200917,G0010
M742-022095,20200917,G0015
M220-001083,20200918,G0400
M742-022390,20200918,G0405
M907-004747,20200917,H0090
M934-005904,20200917,H0095
Expected Edit
MODELCHASS,Prod Date,PL Seq,Time
M742 021167,2020-09-17T,G0005,8:00:00
M359 020535,2020-09-17T,G0010,8:00:01
M742 022095,2020-09-17T,G0015,8:00:02
M220 001083,2020-09-18T,G0400,8:00:00
M742 022390,2020-09-18T,G0405,8:00:01
M907 004747,2020-09-17T,H0090,8:00:00
M934 005904,2020-09-17T,H0095,8:00:01

Convert the 'Prod Date' column to a datetime
Sort the dataframe by 'Prod Date' and 'PL Seq' so 'df' will be in the same order as time_seq for joining.
The main part of the answer is to create a DateRange list with the .groupby and .apply
.groupby the Prod Date and the first element of 'PL Seq'
df.groupby(['Prod Date', df['PL Seq'].str[0]])
.apply(lambda x: (pd.date_range(start=x.values[0] + pd.Timedelta(hours=8), periods=len(x), freq='s')).time)
For each group, use the first value in x as start: x.values[0]
To this date, add an 8 hour Timedelta, to get 08:00:00
The number of periods is len[x]
the freq is 's', for seconds.
This creates a DateRange, from which the time is extracted with .time
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# setup test dataframe
data = {'MODELCHASS': ['M742-021167', 'M359-020535', 'M742-022095', 'M220-001083', 'M742-022390', 'M907-004747', 'M934-005904'],
'Prod Date': [20200917, 20200917, 20200917, 20200918, 20200918, 20200918, 20200918],
'PL Seq': ['G0005', 'G0010', 'G0015', 'G0400', 'G0405', 'H0090', 'H0095']}
df = pd.DataFrame(data)
# convert Prod Date to a datetime column
df['Prod Date'] = pd.to_datetime(df['Prod Date'], format='%Y%m%d')
# sort the dataframe by values so the order will correspond to the groupby order
df = df.sort_values(['Prod Date', 'PL Seq']).reset_index(drop=True)
# groupby Prod Date and the first character of PL Seq
# create a DateRange sequence for each group
# reshape the dataframe
time_seq = (df.groupby(['Prod Date', df['PL Seq'].str[0]])['Prod Date']
.apply(lambda x: (pd.date_range(start=x.values[0] + pd.Timedelta(hours=8), periods=len(x), freq='s')).time)
.reset_index(name='time_seq')
.explode('time_seq', ignore_index=True))
# join the time_seq column to df
df_new = df.join(time_seq.time_seq)
# display(df_new)
MODELCHASS Prod Date PL Seq time_seq
0 M742-021167 2020-09-17 G0005 08:00:00
1 M359-020535 2020-09-17 G0010 08:00:01
2 M742-022095 2020-09-17 G0015 08:00:02
3 M220-001083 2020-09-18 G0400 08:00:00
4 M742-022390 2020-09-18 G0405 08:00:01
5 M907-004747 2020-09-18 H0090 08:00:00
6 M934-005904 2020-09-18 H0095 08:00:01

Related

Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.
The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.
You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3

Trouble with Finding Date Range in Pandas

I have a dataset which has a list of subjects, a start date, and an end date. I'm trying to do a loop so that for each subject I have a list of dates between the start date and end date. I've tried so many ways to do this based on previous posts but still having issues.
an example of the dataframe:
Participant # Start_Date End_Date
1 23-04-19 25-04-19
An example of the output I want:
Participant # Range
1 23-04-19
1 24-04-19
1 25-04-19
Right now my code looks like this:
subjs_490 = tracksheet_490['Participant #']
for subj_490 in subjs_490:
temp_a = tracksheet_490[tracksheet_490['Participant #'].isin([subj_490])]
start = temp_a['Start_Date']
end = temp_a['End_Date'
start_dates = pd.to_datetime(pd.Series(start), format = '%d-%m-%y')
end_dates = pd.to_datetime(pd.Series(end), format = '%d-%m-%y')
date_range = pd.date_range(start_dates, end_dates).tolist()
With this method I'm getting the following error:
Cannot convert input [1 2016-05-03 Name: Start_Date, dtype: datetime64[ns]] of type to Timestamp
Expanding ranges tends to be a slow process. You can create the date_range and then explode it to get what you want. Moving 'Participant #' to the index makes sure it's repeated for all rows that are exploded.
df = (df.set_index('Participant #')
.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1) # :( slow
.rename('Range')
.explode()
.reset_index())
Participant # Range
0 1 2019-04-23
1 1 2019-04-24
2 1 2019-04-25
If you can't use explode another option is to create a separate DataFrame for each row and then concat them all together.
pd.concat([pd.DataFrame({'Participant #': par, 'Range': pd.date_range(start, end)})
for par,start,end in zip(df['Participant #'], df['start_date'], df['end_date'])],
ignore_index=True)

How do I iterate an input list over this function and consequently

I want to iterate a list with input variables over the following function, to then consequently return the output as a csv file. I have a big csv file which I first important to create a dataframe. I then want to get certain part of the dataframe, namely, the -10 days and +10 days around a certain date and for a certain stock.
The dataframe looks as follows (this is just a small part, in reality its 100k+ rows, for every day, all stock tickers, for the period 2011 - 2019)
Date Symbol ShortExemptVolume ShortVolume TotalVolume
2011-01-03 AAWW 0.0 28369 78113.0
2011-01-03 AMD 0.0 3183556 8095093.0
2011-01-03 AMRS 0.0 14196 18811.0
2011-01-03 ARAY 0.0 31685 77976.0
2011-01-03 ARCC 0.0 177208 423768.0
The function is as follows. What it does is it filters the dataframe for the stock ticker and then the dates (-10 and +10 days around a given specific date).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'C:\Users\name\document.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
x = -10 #set window range
y = -(x)
date_1 = datetime.datetime.strptime(issue_date, "%Y-%m-%d")
before_date = pricing_date = date_1 + datetime.timedelta(days=x) days of issue date
after_date = date_1 + datetime.timedelta(days=y)
cond1 = df['Date'] >= before_date
cond2 = df['Date'] <= after_date
cond3 = df['Symbol'] == 'stock_ticker'
short_data = df[cond1 & cond2 & cond3]
return [short_data]
I have a list with a couple hundred rows that contain a specific stock ticker and issue date, for example like this:
ARAY 4/24/2014
ACET 11/16/2015
ACET 11/16/2015
AEGR 8/15/2014
ATSG 9/29/2017
I would like to iterate the list with stocks and their respective date, over the function and get the output in csv format.
The output should be 20 dates for every row in the input file.
Any tips or help is welcome
Consider building a list of data frames generated from function and compile together with concat. Also, there is no need to separately call to_datetime as you can use parse_dates argument in read_csv:
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'C:\Users\name\document.csv', parse_dates=['Date'])
x = -10 #set window range
y = -(x)
date_1 = dt.strptime(issue_date, "%m/%d/%Y") # MATCH ACCORDING TO INPUT
before_date = date_1 + datetime.timedelta(days=x)
after_date = date_1 + datetime.timedelta(days=y)
cond1 = df['Date'] >= before_date
cond2 = df['Date'] <= after_date
cond3 = df['Symbol'] == stock_ticker # REMOVE SINGLE QUOTES
short_data = df[cond1 & cond2 & cond3]
return short_data # REMOVE LIST BRACKETS
stock_date_list = [['ARAY', '4/24/2014'],
['ACET', '11/16/2015'],
['ACET', '11/16/2015'],
['AEGR', '8/15/2014'],
['ATSG', '9/29/2017']]
# LIST COMPREHENSION ITERATIVELY CALLING FUNCTION
df_list = [get_data(i[1], i[0]) for i in stock_date_list)]
# SINGLE DATA FRAME COMPILATION
final_df = pd.concat(df_list, ignore_index=True)

check for date and time between two columns in pandas data frame

I have two data frames:
The first date frame is:
import pandas as pd
df1 = pd.DataFrame({'serialNo':['aaaa','bbbb','cccc','ffff','aaaa','bbbb','aaaa'],
'Name':['Sayonti','Ruchi','Tony','Gowtam','Toffee','Tom','Sayonti'],
'testName': [4402, 3747 ,5555,8754,1234,9876,3602],
'moduleName': ['singing', 'dance','booze', 'vocals','drama','paint','singing'],
'endResult': ['WARNING', 'FAILED', 'WARNING', 'FAILED','WARNING','FAILED','WARNING'],
'Date':['2018-10-5','2018-10-6','2018-10-7','2018-10-8','2018-10-9','2018-10-10','2018-10-8'],
'Time_df1':['23:26:39','22:50:31','22:15:28','21:40:19','21:04:15','20:29:11','19:54:03']})
The second data frame is:
df2 = pd.DataFrame({'serialNo':['aaaa','bbbb','aaaa','ffff','xyzy','aaaa'],
'Food':['Strawberry','Coke','Pepsi','Nuts','Apple','Candy'],
'Work': ['AP', 'TC','OD', 'PU','NO','PM'],
'Date':['2018-10-1','2018-10-6','2018-10-2','2018-10-3','2018-10-5','2018-10-10'],
'Time_df2':['09:00:00','10:00:00','11:00:00','12:00:00','13:00:00','14:00:00']
})
I am joining the two based on serial number:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
result = pd.merge(df1,df2,on=['serialNo'],how='inner')
Now I want that Date_y lies within 3 days of Date_x starting from Date_x
which means Date_X+(1,2,3 days) should be Date_y. And I can get that as below but I also want to check for the time range which I do not know how to achieve
result = result[result.Date_x.sub(result.Date_y).dt.days.between(0,3)]
I want to check for the time such that Time_df2 is within 6 hours of start time being Time_df1. Please help?
You could have a column within your dataframe that combines the date and the time. Here's an example of combining a single row in the dataframe:
# Combining Date_x and time_df1
value_1_x = datetime.datetime.combine(result['Date_x'][0].date() ,\
datetime.datetime.strptime(result['Time_df1'][0], '%H:%M:%S').time())
# Combining date_y and time_df2
value_2_y = datetime.datetime.combine(result['Date_y'][0].date() , \
datetime.datetime.strptime(result['Time_df2'][0], '%H:%M:%S').time())
Then given two datetime objects, you can simply subtract to find the difference you are looking for:
difference = value_1_x - value_2_y
print(difference)
Which gives the output:
4 days, 14:26:39
My understanding is that you are looking to see if something is within 3 days and 6 hours (or a total of 78 hours). You can convert this to hours easily, and then make the desired comparison:
hours_difference = abs(value_1_x - value_2_y).total_seconds() / 3600.0
print(hours_difference)
Which gives the output:
110.44416666666666
Hope that helps!

How to add automatically data to the missing days in historical stock prices?

I would like to write a python script that will check if there is any missing day. If there is it should take the price from the latest day and create a new day in data. I mean something like shown below. My data is in CSV files. Any ideas how it can be done?
Before:
MSFT,5-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,29-May-07,243.31
After:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,253.28
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,249.95
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,243.31
MSFT,29-May-07,243.31
My solution:
import pandas as pd
df = pd.read_csv("path/to/file/file.csv",names=list("abc")) # read string as file
cols = df.columns # store column order
df.b = pd.to_datetime(df.b) # convert col Date to datetime
df.set_index("b",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df = df[cols] # revert order
df.sort_values(by="b",ascending=False,inplace=True) # sort by date
df["b"] = df["b"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv("data.csv",index=False,header=False) #specify outputfile if needed
print(df.to_string())
Using pandas library this operation can be made on a single line. But first we need to read in your data to the right formats:
import io
import pandas as pd
s = u"""name,Date,Close
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,23-Dec-16,789.91
MSFT,16-Dec-16,790.8
MSFT,15-Dec-16,797.85
MSFT,14-Dec-16,797.07"""
#df = pd.read_csv("path/to/file.csv") # read from file
df = pd.read_csv(io.StringIO(s)) # read string as file
cols = df.columns # store column order
df.Date = pd.to_datetime(df.Date) # convert col Date to datetime
df.set_index("Date",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df
Returns:
Date name Close
0 2016-12-14 MSFT 797.07
1 2016-12-15 MSFT 797.85
2 2016-12-16 MSFT 790.80
3 2016-12-17 MSFT 790.80
4 2016-12-18 MSFT 790.80
5 2016-12-19 MSFT 790.80
6 2016-12-20 MSFT 790.80
7 2016-12-21 MSFT 790.80
8 2016-12-22 MSFT 790.80
9 2016-12-23 MSFT 789.91
10 2016-12-24 MSFT 789.91
11 2016-12-25 MSFT 789.91
12 2016-12-26 MSFT 789.91
13 2016-12-27 MSFT 791.55
14 2016-12-28 MSFT 785.05
15 2016-12-29 MSFT 782.79
16 2016-12-30 MSFT 771.82
Return back to csv with:
df = df[cols] # revert order
df.sort_values(by="Date",ascending=False,inplace=True) # sort by date
df["Date"] = df["Date"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv(index=False,header=False) #specify outputfile if needed
Output:
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,26-Dec-16,789.91
MSFT,25-Dec-16,789.91
MSFT,24-Dec-16,789.91
MSFT,23-Dec-16,789.91
...
To do this, you would need to iterate through your dataframe using nested for loops. That would look something like:
for column in df:
for row in df:
do_something()
To give you an idea, the
do_something()
part of your code would probably be something like checking if there was a gap between the dates. Then you would copy the other columns from the row above and insert a new row using:
df.loc[row] = [2, 3, 4] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort() # sorting by index
Hope this helped give you an idea of how you would solve this. Let me know if you want some more code!
This code uses standard routines.
from datetime import datetime, timedelta
Input lines will have to be split on commas, and the dates parsed in two places in the main part of the code. I have therefore put this work in a single function.
def glean(s):
msft, date_part, amount = s.split(',')
if date_part.find('-')==1:
date_part = '0'+date_part
date = datetime.strptime(date_part, '%d-%b-%y')
return date, amount
Similarly, dates will have to be formatted for output with other pieces of data in a number of places in the main code.
def out(date,amount):
date_str = date.strftime('%d-%b-%y')
print(('%s,%s,%s' % ('MSFT', date_str, amount)).replace('MSFT,0', 'MSFT,'))
with open('before.txt') as before:
I read the initial line of data on its own to establish the first date for comparison with date in the next line.
previous_date, previous_amount = glean(before.readline().strip())
out(previous_date, previous_amount)
for line in before.readlines():
date, amount = glean(line.strip())
I calculate the elapsed time between the current line and the previous line, to know how many lines to output in place of missing lines.
elapsed = previous_date - date
setting_date is decremented from previous_date for the number of days that elapsed without data. One line is omitted for each day, if there were any.
setting_date = previous_date
for i in range(-1+elapsed.days):
setting_date -= timedelta(days=1)
out(setting_date, previous_amount)
Now the available line of data is output.
out(date, amount)
Now previous_date and previous_amount are reset to reflect the new values, for use against the next line of data, if any.
previous_date, previous_amount = date, amount
Output:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,248.71
MSFT,29-May-07,243.31

Categories

Resources