Compare two columns in two csv files in python

Compare two columns in two csv files in python - python

I have two csv files with same columns name:
In file1 I got all the people who made a test and all the status (passed/missed)
In file2 I only have those who missed the test
I'd like to compare file1.column1 and file2.column1
If they match then compare file1.column4 and file2.column4
If they are different remove item line from file2
I can't figure how to do that.
I looked things with pandas but I didn't manage to do anything that works
What I have is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
Doe;25/03/1958;garage;Missed;02/04/2019
What I want to get is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Doe;25/03/1958;garage;Missed;02/04/2019

So first you will have to open:
import pandas as pd
df1 = pd.read_csv('file1.csv',delimiter=';')
df2 = pd.read_csv('file2.csv',delimiter=';')
Treating the data frame, because of white spaces found
df1.columns= df1.columns.str.strip()
df2.columns= df2.columns.str.strip()
# Assuming only strings
df1 = df1.apply(lambda column: column.str.strip())
df2 = df2.apply(lambda column: column.str.strip())
The solution expected, Assuming that your name is UNIQUE.
Merging the files
new_merged_df = df2.merge(df1[['name','test status']],'left',on=['name'],suffixes=('','file1'))
DataFrame Result:
name DOB service test status test date test statusfile1
0 Smith 12/12/2012 compta Missed 01/01/2019 Missed
1 Smith 12/12/2012 compta Missed 01/01/2019 Passed
2 Doe 25/03/1958 garage Missed 02/04/2019 Missed
Filtering based on the requirements and removing the rows with the name with different test status.
filter = new_merged_df['test status'] != new_merged_df['test statusfile1']
# Check if there is different values
if len(new_merged_df[filter]) > 0:
drop_names = list(new_merged_df[filter]['name'])
# Removing the values that we don't want
new_merged_df = new_merged_df[~new_merged_df['name'].isin(drop_names)]
Removing columns and storing
# Saving as a file with the same schema as file2
new_merged_df.drop(columns=['test statusfile1'],inplace=True)
new_merged_df.to_csv('file2.csv',delimiter=';',index=False)
Result
name DOB service test status test date
2 Doe 25/03/1958 garage Missed 02/04/2019

Related

Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.

The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.

You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3

In Python, If there is a duplicate, use the date column to choose the what duplicate to use

I have code that runs 16 test cases against a CSV, checking for anomalies from poor data entry. A new column, 'Test case failed,' is created. A number corresponding to which test it failed is added to this column when a row fails a test. These failed rows are separated from the passed rows; then, they are sent back to be corrected before they are uploaded into a database.
There are duplicates in my data, and I would like to add code to check for duplicates, then decide what field to use based on the date, selecting the most updated fields.
Here is my data with two duplicate IDs, with the first row having the most recent Address while the second row has the most recent name.
ID
MnLast
MnFist
MnDead?
MnInactive?
SpLast
SpFirst
SPInactive?
SpDead
Addee
Sal
Address
NameChanged
AddrChange
123
Doe
John
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
123 place
05/01/2022
11/22/2022
123
Doe
Dan
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
789 road
11/01/2022
05/06/2022
Here is a snippet of my code showing the 5th testcase, which checks for the following: Record has Name information, Spouse has name information, no one is marked deceased, but Addressee or salutation doesn't have "&" or "AND." Addressee or salutation needs to be corrected; this record is married.
import pandas as pd
import numpy as np
data = pd.read_csv("C:/Users/file.csv", encoding='latin-1' )
# Create array to store which test number the row failed
data['Test Case Failed']= ''
data = data.replace(np.nan,'',regex=True)
data.insert(0, 'ID', range(0, len(data)))
# There are several test cases, but they function primarily the same
# Testcase 1
# Testcase 2
# Testcase 3
# Testcase 4
# Testcase 5 - comparing strings in columns
df = data[((data['FirstName']!='') & (data['LastName']!='')) &
((data['SRFirstName']!='') & (data['SRLastName']!='') &
(data['SRDeceased'].str.contains('Yes')==False) & (data['Deceased'].str.contains('Yes')==False)
)]
df1 = df[df['PrimAddText'].str.contains("AND|&")==False]
data_5 = df1[df1['PrimSalText'].str.contains("AND|&")==False]
ids = data_5.index.tolist()
# Assign 5 for each failed
for i in ids:
data.at[i,'Test Case Failed']+=', 5'
# Failed if column 'Test Case Failed' is not empty, Passed if empty
failed = data[(data['Test Case Failed'] != '')]
passed = data[(data['Test Case Failed'] == '')]
failed['Test Case Failed'] =failed['Test Case Failed'].str[1:]
failed = failed[(failed['Test Case Failed'] != '')]
# Clean up
del failed["ID"]
del passed["ID"]
failed['Test Case Failed'].value_counts()
# Print to console
print("There was a total of",data.shape[0], "rows.", "There was" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")
# output two files
failed.to_csv("C:/Users/Failed.csv", index = False)
passed.to_csv("C:/Users/Passed.csv", index = False)
What is the best approach to check for duplicates, choose the most updated fields, drop the outdated fields/row, and perform my test?

First, try to set a mapping that associates update date columns to their corresponding value columns.
date2val = {"AddrChange": ["Address"], "NameChanged": ["MnFist", "MnLast"], ...}
Then, transform date columns into datetime format to be able to compare them (using argmax later).
for key in date2val.keys():
failed[key] = pd.to_datetime(failed[key])
Then, group by ID the duplicates (since ID is the value that decides whether it is a duplicate), and for each date column get the maximum value in the group (which refers to the most recent update) and retrieve the columns to update from the initial mapping. I'll update the last row and set it as the final updated result (by putting it in corrected list).
corrected = list()
for _, grp in failed.groupby("ID"):
for key in date2val.keys():
recent = grp[key].argmax()
for col in date2val[key]:
grp.iloc[-1][col] = grp.iloc[recent][col]
corrected.append(grp.iloc[-1])
corrected = pd.DataFrame(corrected)

Preparing data:
import pandas as pd
c = 'ID MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead Addee Sal Address NameChanged AddrChange'.split()
data1 = '123 Doe John No No Doe Jane No No Mr.JohnDoe Mr.John 123place 05/01/2022 11/22/2022'.split()
data2 = '123 Doe Dan No No Doe Jane No No Mr.JohnDoe Mr.John 789road 11/01/2022 05/06/2022'.split()
data3 = '8888 Brown Peter No No Brwon Peter No No Mr.PeterBrown M.Peter 666Avenue 01/01/2011 01/01/2011'.split()
df = pd.DataFrame(columns = c, data = [data1, data2, data3])
df.AddrChange.astype('datetime64')
df.NameChanged.astype('datetime64')
df
DataFrame is like the example:
Then you pick a piece of the dataframe avoiding changes in original. Adjacent rows have the same ID and the first one has the apropriate name:
df1 = df[['ID', 'MnFist', 'NameChanged']].sort_values(by=['ID', 'NameChanged'], ascending = False)
df1
Then you build a dictionary putting key as df.ID and the appropriate name for its value. You intend to build all the column MnFist:
d = {}
for id in set(df.ID.values):
df_mask = df1.ID == id # filter only rows with same id
filtered_df = df1[df_mask]
if len(filtered_df) <= 1:
d[id] = filtered_df.iat[0, 1] # id has only one row, so no changes
continue
for name in filtered_df.MnFist:
if name in ['unknown', '', ' '] or name is None: # name discards
continue
else:
d[id] = name # found a servible name
if id not in d.keys():
d[id] = filtered_df.iat[0, 1] # no servible name, so picked the first
print(d)
The partial output of the dictionary is:
{'8888': 'Peter', '123': 'Dan'}
Then you build all the column:
df.MnFist = [d[id] for id in df.ID]
df
The partial output is:
Then the same procedure to the other column:
df1 = df[['ID', 'Address', 'AddrChange']].sort_values(by=['ID', 'AddrChange'], ascending = False)
df1
d = { id: df1.loc[df1.ID == id, 'Address'].values[0] for id in set(df.ID.values) }
d
df.Address = [d[id] for id in df.ID]
df
The final output is:
Edited after author comented possibility of unknow inservible data.

Let me restate what I understood from the question:
You have a dataset on which you are doing several sanity checks. (Looks like you already have everything in place for this step)
In next step you are finding duplicates row with different columns updated at different dates. (I assume that you already have this)
Now, you are looking for a new dataset that has non-duplicated rows with updated fields using the latest date entries.
First, define different dates and their related columns in a form of dictionary:
date_to_cols = {"AddrChange": "Address", "NameChanged": ["MnLast", "MnFirst"]}
Next, apply group by using "ID" and then get the index for maximum value of different dates. Once we have the index, we can pull the related fields for that date from the data.
data[list(date_to_cols.keys())] =data[list(date_to_cols.keys())].astype('datetime64')
latest_data = df.groupby('ID')[list(date_to_cols.keys())].idxmax().reset_index()
for date_field, cols_to_update in date_to_cols.items():
latest_data[cols_to_update] = latest_data[date_field].apply(lambda x: data.iloc[x][cols_to_update])
latest_data[date_field] = latest_data[date_field].apply(lambda x: data.iloc[x][date_field])
Next, you can merge these latest_data with the original data (after removing old columns):
cols_to_drop = list(latest_data.columns)
cols_to_drop.remove("ID")
data.drop(columns= cols_to_drop, inplace=True)
latest_data_all_fields = data.merge(latest_data, on="ID", how="left")
latest_data_all_fields.drop_duplicates(inplace=True)

Pandas reading tall data into a DataFrame

I have a text file which consists of tall data. I want to iterate through each line within the text file and create a Dataframe.
The text file looks like this, note that the same fields don't exist for all Users (e.g some might have an email field some might not), Also note that each User is separated by[User]:
[User]
Field=Data
employeeNo=123
last_name=Toole
first_name=Michael
language=english
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
My issue is as follows:
My code iterates through the data but for any field that is not present for that User it iterates down through the list and takes the next piece of data relating to that field.
For example Look at the output below (click on the link below) the first User does not have an email associated with him so the code assigns the email of the second user in the list, however what I want to do is return Nan/N/A/blank if no information is available
Click here to view DataFrame
## Import Libraries
import pandas as pd
import numpy as np
from pandas import DataFrame
## Import Data
## Set column names so that no lines in the text file are missed"
col_names = ['Field',
'Data']
## If you have been sent this script you need to change the file path below, change it to where you have the .txt file saved
textFile = pd.read_csv(r'Desktop\SampleData.txt', delimiter="=", engine='python', names=col_names)
## Get a list of the unique IDs
new_cols = pd.unique(textFile['Field'])
userListing_DF = pd.DataFrame()
## Create a for loop to iterate through the first column and get the unique columns, then concatenate those unique values with data
for col in new_cols:
tmp = textFile[textFile['Field'] == col]
tmp.reset_index(inplace=True)
userListing_DF = pd.concat([userListing_DF, tmp['Data']], axis=1)
userListing_DF.columns = new_cols

Read in the single long column, and then form a group indicator by seeing where the value is '[User]'. Then separate the column labels and values, with a str.split and join back to your DataFrame. Finally pivot to your desired shape.
df = pd.read_csv('test.txt', sep='\n', header=None)
df['Group'] = df[0].eq('[User]').cumsum()
df = df[df[0].ne('[User]')] # No longer need these rows
df = pd.concat([df, df[0].str.split('=', expand=True).rename(columns={0: 'col', 1: 'val'})],
axis=1)
df = df.pivot(index='Group', columns='col', values='val').rename_axis(columns=None)
Field Location department email employeeNo first_name language last_name role
Group
1 Data NaN Marketing NaN 123 Michael english Toole Marketing Lead
2 NaN Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN NaN damian.lee#email.com 998 Damian english Lee NaN

How to join columns in CSV files using Pandas in Python

I have a CSV file that looks something like this:
# data.csv (this line is not there in the file)
Names, Age, Names
John, 5, Jane
Rian, 29, Rath
And when I read it through Pandas in Python I get something like this:
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
And the output of the program is:
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
Is there any way to get:
Names Age
0 John 5
1 Rian 29
2 Jane
3 Rath

First, I'd suggest having unique names for each column. Either go into the csv file and change the name of a column header or do so in pandas.
Using 'Names2' as the header of the column with the second occurence of the same column name, try this:
Starting from
datalist = [['John', 5, 'Jane'], ['Rian', 29, 'Rath']]
df = pd.DataFrame(datalist, columns=['Names', 'Age', 'Names2'])
We have
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
So, use:
dff = pd.concat([df['Names'].append(df['Names2'])
.reset_index(drop=True),
df.iloc[:,1]], ignore_index=True, axis=1)
.fillna('').rename(columns=dict(enumerate(['Names', 'Ages'])))
to get your desired result.
From the inside out:
df.append combines the columns.
pd.concat( ... ) combines the results of df.append with the rest of the dataframe.
To discover what the other commands do, I suggest removing them one-by-one and looking at the results.
Please forgive the formating of dff. I'm trying to make everything clear from an educational perspective.
Adjust indents so the code will compile.

You can use:
usecols which helps to read only selected columns.
low_memory is used so that we Internally process the file in chunks.
import pandas as pd
data = pd.read_csv("data.csv", usecols = ['Names','Age'], low_memory = False))
print(data)
Please have unique column name in your csv

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Similar to this question, but my CSV has a slightly different format. Here is an example:
id,employee,details,createdAt
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"
I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:
id employee details createdAt Unnamed: 1 Unnamed: 2
1 John {Country":"USA" Salary:5000 Review:null}" 2018-09-01
2 Sarah {Country":"Australia" Salary:6000 Review:"Hardworking"}" 2018-09-05
My desired output:
id employee details createdAt
1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
2 Sarah {"Country":"Australia","Salary":6000,"Review":"Hardworking"} 2018-09-05
I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?

As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
This would display your data as:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05

I have reproduced your file
With
df = pd.read_csv('e1.csv', index_col=None )
print (df)
Output
id emp details createdat
0 1 john "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"

I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.
Here is a less than optimal option:
df = pd.read_csv('s083838383.csv', sep='##$%^', engine='python')
header = df.columns[0]
print(df)
Why sep='##$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.
df looks like this:
id,employee,details,createdAt
0 1,John,"{"Country":"USA","Salary":5000,"Review...
1 2,Sarah,"{"Country":"Australia", "Salary":6000...
Then you could use str.extract to apply regex and expand the columns:
result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
expand=True).applymap(str.strip)
result.columns = header.strip().split(',')
print(result)
result is:
id employee details createdAt
0 1 John "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 Sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
If you need the starting and ending quotes stripped off of the details string values, you could do:
result['details'] = result['details'].str.strip('"')
If the details object items needs to be a dicts instead of strings, you could do:
from json import loads
result['details'] = result['details'].apply(loads)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.