How to compare two CSV files in Python? - python

I've two CSV file named as file1.csv and file2.csv in file2.csv there is only one column which contain only five records and in file1.csv I've three column which contain more than thousand records I want to get those records which contain in file2.csv
for example this is my file1.csv
'A J1, Jhon1',jhon1#jhon.com, A/B-201 Test1
'A J2, Jhon2',jhon2#jhon.com, A/B-202 Test2
'A J3, Jhon3',jhon3#jhon.com, A/B-203 Test3
'A J4, Jhon4',jhon4#jhon.com, A/B-204 Test4
.......and more records
and inside my file2.csv I've only five records right now but in future it can be many
A/B-201 Test1
A/B-2012 Test12
A/B-203 Test3
A/B-2022 Test22
so I've to find records from my file1.csv at index[2] or index[-1]
this is what I did but it not giving me any output it just returning empty list
import csv
file1 = open('file1.csv','r')
file2 = open('file2.csv','r')
f1 = list(csv.reader(file1))
f2 = list(csv.reader(file2))
new_list = []
for i in f1:
if i[-1] in f2:
new_list.append(i)
print('New List : ',new_list)
it gives me output like this
New List : []
Please help if I did any thing wrong correct me.

Method 1: pandas
This task can be done with relative ease using pandas. DataFrame documentation here.
Example:
In the example below, the two CSV files are read into two DataFrames. The DataFrames are merged using an inner join on the matching columns.
The output shows the merged result.
import pandas as pd
df1 = pd.read_csv('file1.csv', names=['col1', 'col2', 'col3'], quotechar="'", skipinitialspace=True)
df2 = pd.read_csv('file2.csv', names=['match'])
df = pd.merge(df1, df2, left_on=df1['col3'], right_on=df2['match'], how='inner')
The quotechar and skipinitialspace parameters are used as the first column in file1 is quoted and contains a comma, and there is leading whitespace after the comma before the last field.
Output:
col1 col2 col3
0 A J1, Jhon1 jhon1#jhon.com A/B-201 Test1
1 A J3, Jhon3 jhon3#jhon.com A/B-203 Test3
If you choose, the output can easily be written back to a CSV file as:
df.to_csv('path/to/output.csv')
For other DataFrame operations, refer to the documentation linked above.
Method 2: Core Python
The method below does not use any libraries, only core Python.
Read the matches from file2 into a list.
Iterate over file1 and search each line to determine if the last value is a match for an item in file2.
Report the output.
Any subsequent data cleaning (if required) will be up to your personal requirements or use-case.
Example:
output = []
# Read the matching values into a list.
with open('file2.csv') as f:
matches = [i.strip() for i in f]
# Iterate over file1 and place any matches into the output.
with open('file1.csv') as f:
for i in f:
match = i.split(',')[-1].strip()
if any(match == j for j in matches):
output.append(i)
Output:
["'A J1, Jhon1',jhon1#jhon.com, A/B-201 Test1\n",
"'A J3, Jhon3',jhon3#jhon.com, A/B-203 Test3\n"]

Use sets or dicts for in checks (complexity is O(1) for them, instead of O(N) for lists and tuples).
Have a look at convtools library (github): it has Table helper for working with table data as with streams (table docs)
from convtools import conversion as c
from convtools.contrib.tables import Table
# creating a set of allowed values
allowed_values = {
item[0] for item in Table.from_csv("input2.csv").into_iter_rows(tuple)
}
result = list(
# reading a file with custom quotechar
Table.from_csv("input.csv", dialect=Table.csv_dialect(quotechar="'"))
# stripping last column values
.update(COLUMN_2=c.col("COLUMN_2").call_method("strip"))
# filtering based on allowed values
.filter(c.col("COLUMN_2").in_(c.naive(allowed_values)))
# returning iterable of tuples
.into_iter_rows(tuple)
# # OR outputting csv if needed
# .into_csv("result.csv")
)
"""
>>> In [36]: result
>>> Out[36]:
>>> [('A J1, Jhon1', 'jhon1#jhon.com', 'A/B-201 Test1'),
>>> ('A J3, Jhon3', 'jhon3#jhon.com', 'A/B-203 Test3')]
"""

Related

Compare one column (vector) from one CSV file with two columns (vector and array) from another CSV file using Python 3.8

I am a beginner and looking for a solution. I am trying to compare columns from two CSV files with no header. The first one has one column and the second one has two.
File_1.csv: #contains 2k rows with random numbers.
1
4
1005
.
.
.
9563
File_2.csv: #Contains 28k rows
0 [81,213,574,697,766,1074,...21622]
1 [0,1,4,10,12,13,1005, ...31042]
2 [35,103,85,1023,...]
3 [4,24,108,76,...]
4 []
.
.
.
28280 [0,1,9,10,32,49,56,...]
I want first to compare the column of File_1 with the first column of File_2 and find out if they match and extract the matching values plus the second column of file2 into a new CSV file (output.csv) deleting the not matching values. For example,
output.csv:
1 [0,1,4,10,12,13,1005, ...31042]
4 []
.
.
.
Second, I want to compare the File_1.csv column (iterate 2k rows) with the second column (each array) of the output.csv and find the matching values and delete the ones that do not, and I want to save those matching values into the output.csv file and also keeping the first column of that file. For example, 4 was deleted as it didn't have any values in the second column (array) as there were no numbers to compare to File_1, but there are others like 1 that did have some that match"
output.csv:
1 [1,4,1005]
.
.
.
I found a code that works for the first step, but it does not save the second column. I have been looking at how to compare arrays, but I haven't been able to.
This is what I have so far,
import csv
nodelist = []
node_matches = []
with open('File_1.csv', 'r') as f_rand_node:
csv_f = csv.reader(f_rand_node)
for row in csv_f:
nodelist.append(row[0])
set_node = set(nodelist)
with open('File_2.csv', 'r') as f_tbl:
with open('output.csv', 'w') as f_out:
csv_f = csv.reader(f_tbl)
for row in csv_f:
set_row = set(' '.join(row).split(' '))
if set_row.intersection(set_node):
node_match = list(set_row.intersection(set_node))[0]
f_out.write(node_match + '\n')
Thank you for the help.
I'd recommend to use pandas for this case.
File_1.csv:
1
4
1005
9563
File_2.csv:
0 [81,213,574,697,766,1074]
1 [0,1,4,10,12,13,1005,31042]
2 [35,103,85,1023]
3 [4,24,108,76]
4 []
5 [0,1,9,10,32,49,56]
Code:
import pandas as pd
import csv
file1 = pd.read_csv('File_1.csv', header=None)
file1.columns=['number']
file2 = pd.read_csv('File_2.csv', header=None, delim_whitespace=True, index_col=0)
file2.columns = ['data']
df = file2[file2.index.isin(file1['number'].tolist())] # first step
df = df[df['data'] != '[]'] # second step
df.to_csv('output.csv', header=None, sep='\t', quoting=csv.QUOTE_NONE)
Output.csv:
1 [0,1,4,10,12,13,1005,31042]
The entire thing is a lot easier with pandas DataFrames:
import pandas as pd
#Read the files into two dataFrames
df1= pd.read_csv("File_1.csv")
df2= pd.read_csv("File_2.csv")
df2.set_index("Column 0")
df2= df2.filter(items = df1)
index= df1.values()
df2 = df2.applymap(lambda x: set(x).intersection(index))
df.to_csv("output.csv")
This should do the trick, quite simply.

How do I append two lists of rows into one new row in python?

I have two two csvs with many columns in each. I am looping through the rows in each and would like to combine the rows as i go into a third csv that has the columns of both. so far this is the only way i can do it:
ee = csv.reader(open("ab.csv"), delimiter=",")
cc = csv.reader(open("cd.csv"), delimiter=",")
ofilePosts = open('complete.csv', 'ab')
writerPosts = csv.writer(ofilePosts, delimiter=',')
for e in ee:
for c in cc:
complete.writerow(e[0], e[1], e[2]...................
This takes along time to manually write out e[x] for the number of rows in x.
How can i just do something like this without getting a run time crash:
complete.writerow([e+c])
Use pandas, marge them by index so the missing rows from the file with fewer records will be filled by NA.
import pandas as pd
ee = pd.read_csv('ab.csv')
cc = pd.read_csv('cd.csv')
merged = pd.concat([ee, cc], axis=1) # merge by index
merged.to_csv('complete.csv') # to dump to a csv
print merged
Read the data from file1 and file2 into a list and then append them like so:
l =["row1","row2"] #list1
ll = ["row1","row2"] #list2
a = [[l[x],ll[x]] for x in range(len(l))]
print(a) # [['row1', 'row1'], ['row2', 'row2']]
This will only work properly if number of rows is same.

python functionality to parse a csv file containing tables

I have a CSV file that I need to read and process in python.
The CSV file contains tabular values as follows:
*aa
1 foo1 foo_bar1
2 foo2 foo_bar2
*bb
1.22 bla1 blabla1 blablabla22
1.33 bla2 ' ' blablabla33
Here aa and bb are the names of each table. Wherever table names occur, the name is preceded by a * and the rows below it are the rows of that table.
Note that each table can have:
different number of columns as well as rows.
There can also be empty columns representing missing values. I would like to keep them as ' ' after reading in.
However, we know exactly which tables are present in the csv file (i.e. the table names)
I need to read in the csv file and assign a table's entire content to one variable. I can think of a brute force way of doing this. However, since python has a csv module with read write operations, is there any built in functionality that could make this easier or more efficient for me?
Note: One of the major problem I've faced so far is that after reading in the csv file using csv.reader(), I see that aa's rows have additional empty columns. I believe this is because of the mismatch in the number of aa's and bb's columns. I also want to get rid of these additional empty columns without deleting the empty columns that represent actual missing values.
The cleanest way is to separate the tables before feeding each group to the csv reader. Here is a rough cut to get you started:
from itertools import takewhile
import csv
# Instead of *s*, you can use an open file object here
s = '''\
*aa
1,foo1,foo_bar1
2,foo2,foo_bar2
*bb
1.22,bla1,blabla1,blablabla22
1.33,bla2, ,blablabla33
'''.splitlines()
it = iter(s)
next(it)
for table in ['aa', 'bb']:
print(f'\nTable: {table}')
for row in csv.reader(takewhile(lambda r: not r.startswith('*'), it)):
print(row)
This produces:
Table: aa
['1', 'foo1', 'foo_bar1 ']
['2', 'foo2', 'foo_bar2']
Table: bb
['1.22', 'bla1', 'blabla1', 'blablabla22']
['1.33', 'bla2', ' ', 'blablabla33']
You could parse your csv file like so checking if the first value startswith a '*' and build a dict from it.
import csv
from collections import defaultdict
import pprint
csv_data = defaultdict(list)
with open('data.csv', 'r') as csv_file:
# filter empty lines
csv_reader = csv.reader(filter(lambda l: l.strip(',\n'), csv_file))
header = None
for row in csv_reader:
if row[0].startswith('*'):
header = row[0]
else:
# additional row processing if needed
csv_data[header].append(row)
pprint.pprint(csv_data)
# Output
defaultdict(<class 'list'>,
{'*aa': [['1', ' foo1', 'foo_bar1', ''],
['2', ' foo2', 'foo_bar2', '']],
'*bb': [['1.22', ' bla1', 'blabla1', 'blablabla22'],
['1.333', ' bla2', '', 'blablabla3']]})
If you want to remove the excess elements from a table due to another being larger, one option is
csv_data[header].append(row[:col_nums[header]])
Where as you mentioned you know how many columns your table should have
col_nums = {'*aa' : 3, '*bb' : 4}
defaultdict(<class 'list'>,
{'*aa': [['1', ' foo1', 'foo_bar1'],
['2', ' foo2', 'foo_bar2']],
'*bb': [['1.22', ' bla1', 'blabla1', 'blablabla22'],
['1.333', ' bla2', '', 'blablabla3']]})
If I misread it and you only know the max number of columns and not the number of columns for each table, then you could instead do.
def trim_row(row):
for i, item in enumerate(reversed(row)):
if not item:
break
return row[:len(row) - i]
# use it like so
csv_data[header].append(trim_row(row))
Have you considered using pandas?
import pandas as pd
df = pd.read_csv('foo.csv', sep=r'/s+', header=None) #if there is table headings, remove header = None
You do not need to add any line to the top of the file.
This reads files with different number of rows and columns into a dataframe. You can perform all sorts of actions in it now. For ex:
Empty elements are represented by NaN, which means Not a Number. You can replace it with ' ' just by writing
df.fillna(' ')
To fit your use case, from what i understand, you have multiple tables in the same csv file, try this:
df = pd.read_csv("foo.csv", header=None, names=range(3))
table_names = ["*aa", "*bb", "*cc"..]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
This will create a list of tables with key as table name and value as the table itself.
for k,v in tables.items():
print("table:", k)
print(v)
print()
You can find more details in the documentation.

Make edits to the original csv file

I have three different columns in my csv file, with their respected values. Column B (Name column) in csv file has the values in all caps. I am trying to convert it into first letter caps but when I run the code it returns all the columns squished together and in quotes.
The Original File:
Company Name Job Title
xxxxxx JACK NICHOLSON Manager
yyyyyy BRAD PITT Accountant
I am trying to do:
Company Name Job Title
xxxxxx Jack Nicholson Manager
yyyyyy Brad Pitt Accountant
My code:
import csv
with open('C:\\Users\\Data.csv', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data:
if len(item) > 1:
item[1] = item[1].title()
with open('C:\\Users\\Data.csv', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
My result after I run the code is: Instead of returning three different columns and the second column adjusted with the title() syntax, it returns all the three columns squished together in just one column with quotes.
"Company","Name","Job Title"
xxxxxx,"JACK NICHOLSON","Manager"
yyyyyy,"BRAD PITT","Accountant"
I do not know what is wrong with my snippet. The result has absurd markings in the beginning
A slight change to Mohammed's solution using read_fwf to simplify reading the file.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html
import pandas as pd
df = pd.read_fwf('old_csv_file')
df.Name = df.Name.str.title()
df.to_csv('new_csv_file', index=False, sep='\t')
EDIT:
Changed to use a string method over lambda. I prefer to use lambdas as a last result.
You can do something like this with pandas:
import pandas as pd
df = pd.read_csv('old_csv_file', sep='\s{3,}')
df.Name = df.Name.apply(lambda x: x.title())
df.to_csv('new_csv_file', index=False, sep='\t')
string.title() converts the string to title case, i.e every first letter of the word in string is capitalized and subsequent letters are converted to lower case.
With df.apply you can perform some operation on an entire column or row.
'\s{3,}' is a regular expression
\s is a space character. \s{3,} is for more than 3 spaces.
When you are reading a CSV format you have to specify how your columns are separated.
Generally columns are separated by comma or tab. But in your case you have like 5,6 spaces between each column of a row.
So by using \s{3,} I am telling the CSV processor that the columns in a row are delimited by more than 3 spaces.
If I had use only \s then it would have treated First Name and Last Name as two separate columns because they have 1 space in between. So by 3+ spaces I made First Name and Last Name as a single column.
Take note that data stores each row as list containing one string only.
Having a length of 1, the statement inside this if block won't execute.
if len(item) > 1:
item[1] = item[1].title()
Aside from that, reading and writing in binary format is unnecessary.
import csv
with open('C:\\Users\\Data.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data[1:]: # excludes headers
item[0] = item[0].title() # will capitalize the Company column too
item[0] = item[0][0].lower() + item[0][1:] # that's why we need to revert
print(item)
# see that data contains lists having one element only
# the line above will output to
# ['Company Name Job Title']
# ['xxxxxx Jack Nicholson Manager']
# ['yyyyyy Brad Pitt Accountant']
with open('C:\\Users\\Data.csv', 'w') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)

Copying the Matching Columns from CSV File

Input:
I have two csv files (file1.csv and file2.csv).
file1 looks like:
ID,Name,Gender
1,Smith,M
2,John,M
file2 looks like:
name,gender,city,id
Problem:
I want to compare the header of file1 with file2 and copy the data of the matching columns. The header in file1 need to be in lowercase prior to finding the matching columns in file2.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw file1 and file2
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I have tried the following code so far:
import csv
file1 = csv.DictReader(open("file1.csv")) #reading file1.csv
file1_Dict = {} # the dictionary of lists that will store the keys and values as list
for row in file1:
for column, value in row.iteritems():
file1_Dict.setdefault(column,[]).append(value)
for key in file1_Dict: # converting the keys of the dictionary to lowercase
file1_Dict[key.lower()] = file1_Dict.pop(key)
file2 = open("file2.csv") #reading file2.csv
file2_Dict ={} # store the keys into a dictionary with empty values
for row2 in file2:
row2 = row2.split(",")
for i in row2:
file2_Dict[i] = ""
Any idea how to solve this problem?
I had a crack on this problem using python without taking performance into consideration. Took me quite a while, phew!
This is my solution.
import csv
csv_data1_filepath = './file1.csv'
csv_data2_filepath = './file2.csv'
def main():
# import nem config and data into memory
data1 = list(csv.reader(file(csv_data1_filepath, 'r')))
data2 = list(csv.reader(file(csv_data2_filepath, 'r')))
file1_header = data1[0][:] # Get f1 header
file2_header = data2[0][:] # Get f1 header
lowered_file1_header = [item.lower() for item in file1_header] # lowercase it
lowered_file2_header = [item.lower() for item in file2_header] # do it for header 2 anyway
col_index_dict = {}
for column in lowered_file1_header:
if column in file2_header:
col_index_dict[column] = lowered_file1_header.index(column)
else:
col_index_dict[column] = -1 # mark as column that will not be worked on later
for column in lowered_file2_header:
if not column in lowered_file1_header:
col_index_dict[column] = -1 # mark as column that will not be worked on later
# build header
output = [col_index_dict.keys()]
is_header = True
for row in data1:
if is_header is False:
rowData = []
for column in col_index_dict:
column_index = col_index_dict[column]
if column_index != -1:
rowData.append(row[column_index])
else:
rowData.append('')
output.append(rowData)
else:
is_header = False
print(output)
if __name__ == '__main__':
main()
This will give you the output of:
[
['gender', 'city', 'id', 'name'],
['M', '', '1', 'Smith'],
['M', '', '2', 'John']
]
Note that the output kind of lost its ordering but this should be fixable by using the ordered dictionary instead.
Hope this helps.
You don't need Python for this. This is a task for SQL.
SQLite Browser supports CSV Import. Take the below steps to get the desired output:
Download and install SQLite Browser
Create a new Database
Import both CSV's as tables (let's say the table names are file1 and file2, respectively)
Now, you can decide how you want to match the data sets. If you only want to match the files on ID, then you can do something like:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id
If you want to match on every column, you can do something like:
select *
from file1 f1
inner join file2 f2
on f1.id = f2.id and f1.name = f2.name and f1.gender = f2.gender
Finally, simply export the query results back to a CSV.
I spent a lot of time myself trying to perform tasks like this with scripting languages. The benefit of using SQL is that you simply tell what you want to match on, and then let the database do the optimization for you. Generally, it ends up doing the matching faster than any code I could write.
In case you're interested, python also has a sqlite module that comes out-of-the-box. I've gravitated towards using this as my source for data in python scripts for the above reason, and I simply import the CSV's required in SQLite browser before running the python script.

Categories

Resources