Split/Parse Values in One Column and create multiple Columns in Python - python

I have a column in which multiple fields are concatenated, they are delimited by . and :
Example: Order ID:0001ACW120I .Record ID:01160000000UAxCCW .Type:Small .Amount:4596.35 .Booked Date 2021-06-14
I have tried the following:
df["Details"].str.split(r" .|:", expand=True)
But I lose the Decimal and the Amount doesn't Match.
I want to parse the Details Column to:
|Details |Order ID |Record ID |Type |Amount |Booked Date |
|-------------------------------------------------------------------------------------------------------|---------------|-----------------------|-------|---------------|---------------|
|Order ID:0001ACW120I .Record ID:01160000000UAxCCW .Type:Small .Amount:4596.35 .Booked Date 2021-06-14 |0001ACW120I |01160000000UAxCCW |Small |4596.35 |2021-06-14 |
Thank you for your help and guidance

Hope this will give you the solution you want.
Original Data:
df = pd.DataFrame({'A': ['Order ID:0001ACW120I .Record ID:01160000000UAxCCW .Type:Small .Amount:4596.35 .Booked Date 2021-06-14']})
Replacing . with : & then splitting with :
df = df['A'].replace(to_replace ='\s[.]', value = ':', regex = True).str.split(':', expand = True)
Final dataset. Rename the columns.
print(df)

Related

Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.
The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.
You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3

How to filter dataframe only by month and year?

I want to select many cells which are filtered only by month and year. For example there are 01.01.2017, 15.01.2017, 03.02.2017 and 15.02.2017 cells. I want to group these cells just looking at the month and year information. If they are in january, They should be grouped together.
Output Expectation:
01.01.2017 ---- 1
15.01.2017 ---- 1
03.02.2017 ---- 2
15.02.2017 ---- 2
Edit: I have 2 datasets in different excels as you can see below.
first data
second data
What I m trying to do is that I want to get 'Su Seviye' data for every 'DH_ID' seperately from first data. And then I want to paste these data into 'Kuyu Yüksekliği' column in the second data. But the problems are that every 'DH_ID' is in different sheets and although there are only month and year data in first database, second database has day information additionally. How can I produce this kind of codes?
import pandas as pd
df = pd.read_excel('...Gözlem kuyu su seviyeleri- 2017.xlsx', sheet_name= 'GÖZLEM KUYULARI1', header=None)
df2 = pd.read_excel('...YERALTI SUYU GÖZLEM KUYULARI ANALİZ SONUÇLAR3.xlsx', sheet_name= 'HJ-8')
HJ8 = df.iloc[:, [0,5,7,9,11,13,15,17,19,21,23,25,27,29]]
##writer = pd.ExcelWriter('yıllarsuseviyeler.xlsx')
##HJ8.to_excel(writer)
##writer.save()
rb = pd.read_excel('...yıllarsuseviyeler.xlsx')
rb.loc[0,7]='01.2022'
rb.loc[0,9]='02.2022'
rb.loc[0,11]='03.2022'
rb.loc[0,13]='04.2022'
rb.loc[0,15]='05.2021'
rb.loc[0,17]='06.2022'
rb.loc[0,19]='07.2022'
rb.loc[0,21]='08.2022'
rb.loc[0,23]='09.2022'
rb.loc[0,25]='10.2022'
rb.loc[0,27]='11.2022'
rb.loc[0,29]='12.2022'
You can see what I have done above.
First, you can convert date column to Datetime object, then get the year and month part with to_period, at last get the group number with ngroup().
df['group'] = df.groupby(pd.to_datetime(df['date'], format='%d.%m.%Y').dt.to_period('M')).ngroup() + 1
date group
0 01.01.2017 1
1 15.01.2017 1
2 03.02.2017 2
3 15.02.2017 2

Parsing dates and times from a large string into seperate columns

I am trying to parse the date and time from a string column.
This is the original column (all one column):
description
4/18/2020 21:05 XXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYY ZZZZZZZZZZZZZZZZZ
my desired output:
spliced_date spliced_time
4/18/2020 21:05
I am looking to pick out the date and time into their own seperate columns.
You could use str.split:
df = pd.DataFrame({'description': ['4/18/2020 21:05 XXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYY ZZZZZZZZZZZZZZZZZ']})
df[['Date','Time']] = df['description'].str.split(n=2, expand=True)[[0,1]]
df = df.drop(columns='description')
Output:
Date Time
0 4/18/2020 21:05
You can use named capture group with str.extract:
pattern = r'(?P<spliced_date>[^\s]+)\s+(?P<spliced_time>[^\s]+)'
df = df['description'].str.extract(pattern)
print(df)
# Output:
spliced_date spliced_time
0 4/18/2020 21:05

How to split a pandas data frame column to multiple columns based on the text value contained

Let's say there is column like below.
df = pd.DataFrame(['A-line B-station 9-min C-station 3-min',
'D-line E-station 8-min F-line G-station 5-min',
'G-line H-station 1-min I-station 6-min J-station 8-min'],
columns=['station'])
A,B,C is just arbitrary characters and there are whole bunch of rows like this.
station
0 A-line B-station 9-min C-station 3-min
1 D-line E-station 8-min F-line G-station 5-min
2 G-line H-station 1-min I-station 6-min J-stati...
How can we make columns like below?
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station null null null
1 D-line E-station null null F-line G-station
2 G-line H-station I-station J-station null null
stationX-X means that Station (line number) - (order of station)
Station1-1 means first station for first line(line1)
Station1-2 means second station for first line(line1)
Station2-1 means first station for second line(line2)
I tried to split by delimiter; however, it doesn't work since every row has different number of lines and stations.
What I maybe need is to split columns based on their characters contained. For example, I could store first '-line' to Line1 and store first '-station' to station1-1.
Does anybody have any ideas how to do this?
Any small thoughts help me!
Thank you!
First create Series with Series.str.split and DataFrame.stack:
s = df['station'].str.split(expand=True).stack()
Then remove values ending with min by boolean indexing with Series.str.endswith:
df1 = s[~s.str.endswith('min')].to_frame('data').rename_axis(('a','b'))
Then create counters for lines and for station rows with filtering and GroupBy.cumcount:
df1['Line'] = (df1[df1['data'].str.endswith('line')]
.groupby(level=0)
.cumcount()
.add(1)
.astype(str))
df1['Line'] = df1['Line'].ffill()
df1['station'] = (df1[df1['data'].str.endswith('station')]
.groupby(['a','Line'])
.cumcount()
.add(1)
.astype(str))
Create Series with join, replace missing values by df1['Line'] by Series.fillna:
df1['station'] = (df1['Line'] + '-' + df1['station']).fillna(df1['Line'])
Reshape by DataFrame.set_index with DataFrame.unstack:
df1 = df1.set_index('station', append=True)['data'].reset_index(level=1, drop=True).unstack()
Rename columns names - not before for avoid wrong sorted:
df1 = df1.rename(columns = lambda x: 'Station' + x if '-' in x else 'Line' + x)
Remove columns name:
df1.columns.name = None
df1.index.name = None
print (df1)
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station NaN NaN NaN
1 D-line E-station NaN NaN F-line G-station
2 G-line H-station I-station J-station NaN NaN

Compare two columns in two csv files in python

I have two csv files with same columns name:
In file1 I got all the people who made a test and all the status (passed/missed)
In file2 I only have those who missed the test
I'd like to compare file1.column1 and file2.column1
If they match then compare file1.column4 and file2.column4
If they are different remove item line from file2
I can't figure how to do that.
I looked things with pandas but I didn't manage to do anything that works
What I have is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
Doe;25/03/1958;garage;Missed;02/04/2019
What I want to get is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Doe;25/03/1958;garage;Missed;02/04/2019
So first you will have to open:
import pandas as pd
df1 = pd.read_csv('file1.csv',delimiter=';')
df2 = pd.read_csv('file2.csv',delimiter=';')
Treating the data frame, because of white spaces found
df1.columns= df1.columns.str.strip()
df2.columns= df2.columns.str.strip()
# Assuming only strings
df1 = df1.apply(lambda column: column.str.strip())
df2 = df2.apply(lambda column: column.str.strip())
The solution expected, Assuming that your name is UNIQUE.
Merging the files
new_merged_df = df2.merge(df1[['name','test status']],'left',on=['name'],suffixes=('','file1'))
DataFrame Result:
name DOB service test status test date test statusfile1
0 Smith 12/12/2012 compta Missed 01/01/2019 Missed
1 Smith 12/12/2012 compta Missed 01/01/2019 Passed
2 Doe 25/03/1958 garage Missed 02/04/2019 Missed
Filtering based on the requirements and removing the rows with the name with different test status.
filter = new_merged_df['test status'] != new_merged_df['test statusfile1']
# Check if there is different values
if len(new_merged_df[filter]) > 0:
drop_names = list(new_merged_df[filter]['name'])
# Removing the values that we don't want
new_merged_df = new_merged_df[~new_merged_df['name'].isin(drop_names)]
Removing columns and storing
# Saving as a file with the same schema as file2
new_merged_df.drop(columns=['test statusfile1'],inplace=True)
new_merged_df.to_csv('file2.csv',delimiter=';',index=False)
Result
name DOB service test status test date
2 Doe 25/03/1958 garage Missed 02/04/2019

Categories

Resources