Parsing dates and times from a large string into seperate columns - python

I am trying to parse the date and time from a string column.
This is the original column (all one column):
description
4/18/2020 21:05 XXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYY ZZZZZZZZZZZZZZZZZ
my desired output:
spliced_date spliced_time
4/18/2020 21:05
I am looking to pick out the date and time into their own seperate columns.

You could use str.split:
df = pd.DataFrame({'description': ['4/18/2020 21:05 XXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYY ZZZZZZZZZZZZZZZZZ']})
df[['Date','Time']] = df['description'].str.split(n=2, expand=True)[[0,1]]
df = df.drop(columns='description')
Output:
Date Time
0 4/18/2020 21:05

You can use named capture group with str.extract:
pattern = r'(?P<spliced_date>[^\s]+)\s+(?P<spliced_time>[^\s]+)'
df = df['description'].str.extract(pattern)
print(df)
# Output:
spliced_date spliced_time
0 4/18/2020 21:05

Related

Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.
The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.
You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3

add specific part of one column values to another column

I have the following dataframe
import pandas as pd
data = {'existing_indiv': ['stac.Altered', 'MASO.MHD'], 'queries': ['modify', 'change']}
df = pd.DataFrame(data)
existing_indiv queries
0 stac.Altered modify
1 MASO.MHD change
I want to add the period and the word before the period to the beginning of the values of the queries column
Expected outcome:
existing_indiv queries
0 stac.Altered stac.modify
1 MASO.MHD MASO.change
Any ideas?
You can use .str.extract and regex ^([^.]+\.) to extract everything before the first .:
df.queries = df.existing_indiv.str.extract('^([^.]+\.)', expand=False) + df.queries
df
existing_indiv queries
0 stac.Altered stac.modify
1 MASO.MHD MASO.change
If you prefer .str.split:
df.existing_indiv.str.split('.').str[0] + '.' + df.queries
0 stac.modify
1 MASO.change
dtype: object

Split/Parse Values in One Column and create multiple Columns in Python

I have a column in which multiple fields are concatenated, they are delimited by . and :
Example: Order ID:0001ACW120I .Record ID:01160000000UAxCCW .Type:Small .Amount:4596.35 .Booked Date 2021-06-14
I have tried the following:
df["Details"].str.split(r" .|:", expand=True)
But I lose the Decimal and the Amount doesn't Match.
I want to parse the Details Column to:
|Details |Order ID |Record ID |Type |Amount |Booked Date |
|-------------------------------------------------------------------------------------------------------|---------------|-----------------------|-------|---------------|---------------|
|Order ID:0001ACW120I .Record ID:01160000000UAxCCW .Type:Small .Amount:4596.35 .Booked Date 2021-06-14 |0001ACW120I |01160000000UAxCCW |Small |4596.35 |2021-06-14 |
Thank you for your help and guidance
Hope this will give you the solution you want.
Original Data:
df = pd.DataFrame({'A': ['Order ID:0001ACW120I .Record ID:01160000000UAxCCW .Type:Small .Amount:4596.35 .Booked Date 2021-06-14']})
Replacing . with : & then splitting with :
df = df['A'].replace(to_replace ='\s[.]', value = ':', regex = True).str.split(':', expand = True)
Final dataset. Rename the columns.
print(df)

Trouble with Finding Date Range in Pandas

I have a dataset which has a list of subjects, a start date, and an end date. I'm trying to do a loop so that for each subject I have a list of dates between the start date and end date. I've tried so many ways to do this based on previous posts but still having issues.
an example of the dataframe:
Participant # Start_Date End_Date
1 23-04-19 25-04-19
An example of the output I want:
Participant # Range
1 23-04-19
1 24-04-19
1 25-04-19
Right now my code looks like this:
subjs_490 = tracksheet_490['Participant #']
for subj_490 in subjs_490:
temp_a = tracksheet_490[tracksheet_490['Participant #'].isin([subj_490])]
start = temp_a['Start_Date']
end = temp_a['End_Date'
start_dates = pd.to_datetime(pd.Series(start), format = '%d-%m-%y')
end_dates = pd.to_datetime(pd.Series(end), format = '%d-%m-%y')
date_range = pd.date_range(start_dates, end_dates).tolist()
With this method I'm getting the following error:
Cannot convert input [1 2016-05-03 Name: Start_Date, dtype: datetime64[ns]] of type to Timestamp
Expanding ranges tends to be a slow process. You can create the date_range and then explode it to get what you want. Moving 'Participant #' to the index makes sure it's repeated for all rows that are exploded.
df = (df.set_index('Participant #')
.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1) # :( slow
.rename('Range')
.explode()
.reset_index())
Participant # Range
0 1 2019-04-23
1 1 2019-04-24
2 1 2019-04-25
If you can't use explode another option is to create a separate DataFrame for each row and then concat them all together.
pd.concat([pd.DataFrame({'Participant #': par, 'Range': pd.date_range(start, end)})
for par,start,end in zip(df['Participant #'], df['start_date'], df['end_date'])],
ignore_index=True)

Multiple dataframe transformations based on a single column

I was looking for a similar question but I did not find a solution for what I want to do. any help is welcome
so here is the code to get an example of my Dataframe :
import pandas as pd
L = [[0.1998,'IN TIME,IN TIME','19708,19708','MR SD#5 W/Z SD#6 X/Y',20.5],
[0.3983,'LATE,IN TIME','11206,18054','MR SD#4 A/B SD#1 C/D',19.97]]
df = pd.DataFrame(L,columns=['Time','status','F_nom','info','Delta'])
output :
I would like to create two new rows for each row in my main dataframe based on 'Info' column
as we can see on the column 'Info' in my main dataframe each row contains two different SD#
i would like to have only one SD# per row
Also i would like to keep the corresponding values of the columns : Time , Status , F_norm ,Delta
Finaly create a new column 'type info' that contains the specific string for each SD# (W/Z or A/B etc.) and all this by keeping the index of my main data_frame !
Here is the desired result :
I hope i was clear enough, waiting for your returns thank you.
Use:
#split values by comma or whitespace
df['status'] = df['status'].str.split(',')
df['F_nom'] = df['F_nom'].str.split(',')
info = df.pop('info').str.split()
#select values by indexing
df['info'] = info.str[1::2]
df['type_info'] = info.str[2::2]
#reshape to Series
s = df.set_index(['Time','Delta']).stack()
#create new DataFrame and reshape to expected output
df1 = (pd.DataFrame(s.values.tolist(), index=s.index)
.stack()
.unstack(2)
.reset_index(level=2, drop=True)
.reset_index())
print (df1)
Time Delta status F_nom info type_info
0 0.1998 20.50 IN TIME 19708 SD#5 W/Z
1 0.1998 20.50 IN TIME 19708 SD#6 X/Y
2 0.3983 19.97 LATE 11206 SD#4 A/B
3 0.3983 19.97 IN TIME 18054 SD#1 C/D
Another solution:
df['status'] = df['status'].str.split(',')
df['F_nom'] = df['F_nom'].str.split(',')
info = df.pop('info').str.split()
df['info'] = info.str[1::2]
df['type_info'] = info.str[2::2]
from itertools import chain
lens = df['status'].str.len()
df = pd.DataFrame({
'Time' : df['Time'].values.repeat(lens),
'status' : list(chain.from_iterable(df['status'].tolist())),
'F_nom' : list(chain.from_iterable(df['F_nom'].tolist())),
'info' : list(chain.from_iterable(df['info'].tolist())),
'Delta' : df['Delta'].values.repeat(lens),
'type_info' : list(chain.from_iterable(df['type_info'].tolist())),
})
print (df)
Time status F_nom info Delta type_info
0 0.1998 IN TIME 19708 SD#5 20.50 W/Z
1 0.1998 IN TIME 19708 SD#6 20.50 X/Y
2 0.3983 LATE 11206 SD#4 19.97 A/B
3 0.3983 IN TIME 18054 SD#1 19.97 C/D

Categories

Resources