CSV diff with headers - python
I am taking two csvs and writing the difference to a third csv file. Here's my function.
def diff_csv(NewFile, OldFile):
with open(OldFile, 'r') as t1, open(NewFile, 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('result.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
However, since the two input csvs have the same set of hearders, the headers doesn't show up in the output csv. How do I preserve the headers here?
You could use csv.DictReader and csv.DictWriter. Then unconditionally write the header row, but for the other rows first test whether they exist in old_file. For this test, in the example below, I am making a set of rows from old_file to test inclusion, and representing the rows using a tuple of cell values because this is a hashable object that can be stored in a set.
import csv
import pdb
def diff_csv(new_file, old_file, out_file='result.csv'):
with open(old_file) as t1, open(new_file) as t2:
c2 = csv.DictReader(t2)
c1 = csv.DictReader(t1)
assert c2.fieldnames == c1.fieldnames
row_tuples = set()
for row in c1:
row_tuples.add(tuple(row.values()))
with open(out_file, "w") as fout:
cout = csv.DictWriter(fout, c2.fieldnames)
cout.writeheader()
for row in c2:
row_tuple = tuple(row.values())
if row_tuple not in row_tuples:
cout.writerow(row)
if __name__ == '__main__':
diff_csv("b.csv", "a.csv")
with these input files:
a.csv
blah,stuff
a,4
b,6
and b.csv
blah,stuff
x,4
b,6
d,3
gives result.csv
blah,stuff
x,4
d,3
Related
merging and removing duplicates in two csv's without using pandas
I have a test output csv that looks like the following; test1 test success test2 test failed regtest failed to build column 1 contains unique strings and column 2 contains one of the following three strings; test success, test failed, failed to build. I run this test every so often on new builds and I want to compare the csv from the latest test with the previous test. I would like to produce a new csv containing all the tests whose state(column 2) has changed. Preferably in the format of; TestName OldState NewState Here is my current attempt which gets all the differences between the two files but it looks like this; test1 test success test2 test failed regtest failed to build test2 test success I need a way to merge the second test2 with the first one so it looks like this; test1 test success test2 test failed test success regtest failed to build My current code; import csv import sys with open(sys.argv[1], 'r') as t1, open(sys.argv[2], 'r') as t2, open(sys.argv[2], 'r') as t3, open(sys.argv[1], 'r') as t4: fileOne = t1.readlines() fileTwo = t2.readlines() fileThree = t3.readlines() fileFour = t4.readlines() with open(sys.argv[3], 'w') as outFile: for line in fileTwo: if line not in fileOne: outFile.write("From File 2," + line) for line in fileFour: if line not in fileThree: outFile.write("\r\nFrom File 1," + line)
Each row in your csv has a key value pair, so it makes sense to write your csv to a dictionary, and then you can easily compare both values based on keys with open(sys.argv[1], 'r') as t1, open(sys.argv[2], 'r') as t2: # Convert lines of Csv to list reader1 = list(csv.reader(t1)) reader2 = list(csv.reader(t2)) # Create a dictionary of key value pairs for t1 fileOne_dict = {col[0]: [col[1]] for col in reader1} # compare values based on keys and append if different for col in reader2: if fileOne_dict.get(col[0]): if fileOne_dict[col[0]][0] != col[1]: fileOne_dict[col[0]].append(col[1]) else: fileOne_dict[col[0]] = ["", col[1]] out = [[key]+value for key, value in fileOne_dict.items()] print(out) with open ('diff.csv', 'w') as outFile: writer = csv.writer(outFile) writer.writerows(out) This should leave you with a dictionary containing the information you want
Find matches in two csv files
I have two csv files: the first one has one column called ID and 5 rows, and the second one has 12 columns (col 5 is called ID) with 100 rows. I am trying to find match IDs and write the entire row to a new csv file. Thank you for your help! here is my code: import csv input_file1 = "/Desktop/New1/file1.csv" input_file2 = "/Desktop/New1/file2.csv" output_file = "/Desktop/New1/results.csv" with open(input_file1) as t1, open(input_file2) as t2: fileone = csv.reader(t1) filetwo = csv.reader(t2) with open(output_file, 'w') as output_res: for line in filetwo: if line in fileone: output_res.write(line)
You can read the IDs in file1 into a set for more efficient lookup. You should also use csv.writer to output the rows as CSV: with open(input_file1) as t1, open(input_file2) as t2: ids = set(id for id, in csv.reader(t1)) filetwo = csv.reader(t2) with open(output_file, 'w') as output_res: writer = csv.writer(output_res) for row in filetwo: if row[4] in ids: writer.writerow(row)
Replacing content of column 'x' from file A with column 'x' in a very large file B
I have two files. "A" which is not too large (2GB) and "B" which is rather large at 60GB. I have a primitive code as follows: import csv #imports module csv filea = "A.csv" fileb = "B.csv" output = "Python_modified.csv" source1 = csv.reader(open(filea,"r"),delimiter='\t') source2 = csv.reader(open(fileb,"r"),delimiter='\t') #open csv readers source2_dict = {} # prepare changes from file B for row in source2: source2_dict[row[2]] = row[2] # write new changed rows with open(output, "w") as fout: csvwriter = csv.writer(fout, delimiter='\t') for row in source1: # needs to check whether there are any changes prepared if row[3] in source2_dict: # change the item row[3] = source2_dict[row[3]] csvwriter.writerow(row) Which should read through column 3 from both files and replace column 4 in file A with the contents of column 4 in file B if there's a match. However since it's reading in the large files its very slow. Is there any way to optimize this?
You could try reading file_a in large blocks into memory, and then process each block. This would mean you are doing a groups of reads followed by a group of writes which should help to reduce disk thrashing. You will need to decide which block_size to use, obviously something that will fit comfortably in memory. from itertools import islice import csv #imports module csv file_a = "A.csv" file_b = "B.csv" output = "Python_modified.csv" block_size = 10000 # prepare changes from file B source2_dict = {} with open(file_b, 'rb') as f_source2: for row in csv.reader(f_source2, delimiter='\t'): source2_dict[row[3]] = row[4] # just store the replacement value # write new changed rows with open(file_a, 'rb') as f_source1, open(output, "wb") as f_output: csv_source1 = csv.reader(f_source1, delimiter='\t') csv_output = csv.writer(f_output, delimiter='\t') # read input file_a in large groups for block in iter(lambda: list(islice(csv_source1, block_size)), []): for row in block: try: row[4] = source2_dict[row[3]] except KeyError as e: pass csv_output.writerow(row) Secondly, to reduce memory usage, if you are just replacing one value, then just store that one value in your dictionary. Tested using Python 2.x. If you are using Python 3.x, you will need to change your file open's, e.g. with open(file_b, 'r', newline='') as f_source2:
How to search for a 'text' or 'number' in a csv file with Python AND if exists print only first and second column values to a new csv file
I want to do the following using Python. Step-1: Read a specific third column on a csv file using Python. Step-2: Create a list with values got from step-1 Step-3: Take the value of index[0], search in csv file, if present print the values of column 1 and 2 only to a new csv file(There are 6 columns). If Not presents just ignore and goto next search. file1.csv: Country,Location,number,letter,name,pup-name,null a,ab,1,qw,abcd,test1,3 b,cd,1,df,efgh,test2,4 c,ef,2,er,fgh,test3,5 d,gh,3,sd,sds,test4, e,ij,5,we,sdrt,test5, f,kl,6,sc,asdf,test6, g,mn,7,df,xcxc,test7, h,op,8,gb,eretet,test8, i,qr,8,df,hjjh,test9, Python script written for this: import csv import time from collections import defaultdict columns = defaultdict(list) with open('file1.csv') as f: reader = csv.reader(f) reader.next() for row in reader: for (i,v) in enumerate(row): columns[i].append(v) #print(columns[2]) b=(columns[2]) for x in b[:]: time.sleep(1) print x Output of above script: MacBook-Pro:test_usr$ python csv_file.py 1 1 2 3 5 6 7 8 8 MacBook-Pro:test_usr$ I am able to do the steps 1 and 2. Please guide me on doing Step-3. That is how to search for text/string in csv file and if present how to extract only specific column values to a new csv file? Output file should look like: a,ab b,cd c,ef d,gh e,ij f,kl g,mn h,op i,qr Note : Search string will be from another csv file. Please don't suggest the direct answer for printing values of column 1 and 2 directly. FINAL CODE is looks this: import csv import time from collections import defaultdict columns = defaultdict(list) with open('file1.csv') as f: reader = csv.reader(f) reader.next() for row in reader: for (i,v) in enumerate(row): columns[i].append(v) b=(columns[2]) for x in b[:]: with open('file2.csv') as f, open('file3.csv', 'a') as g: reader = csv.reader(f) #next(reader, None) # discard the header writer = csv.writer(g) for row in reader: if row[2] == x: writer.writerow(row[:2]) file1.csv: Country,Location,number,letter,name,pup-name,null a,ab,1,qw,abcd,test1,3 b,cd,1,df,efgh,test2,4 c,ef,2,er,fgh,test3,5 d,gh,3,sd,sds,test4, e,ij,5,we,sdrt,test5, f,kl,6,sc,asdf,test6, g,mn,7,df,xcxc,test7, h,op,8,gb,eretet,test8, i,qr,8,df,hjjh,test9, file2.csv: count,name,number,Type,status,Config Version,,IP1,port 1,bob,1,TRAFFIC,end,1.2,,1.1.1.1,1 2,john,1,TRAFFIC,end,2.1,,1.1.1.2,2 4,foo,2,TRAFFIC,end,1.1,,1.1.1.3,3 5.333333333,test,3,TRAFFIC,end,3.1,,1.1.1.4,4 6.833333333,raa,5,TRAFFIC,end,5.1,,1.1.1.5,5 8.333333333,kaa,6,TRAFFIC,end,7.1,,1.1.1.6,6 9.833333333,thaa,7,TRAFFIC,end,9.1,,1.1.1.7,7 11.33333333,paa,8,TRAFFIC,end,11.1,,1.1.1.8,8 12.83333333,maa,8,TRAFFIC,end,13.1,,1.1.1.9,9 If I run the above script, output of file3.csv: 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john 1,bob 2,john . . . Its goes like this in loop But output should be like this: count,name 1,bob, 2,john, 4,foo, 5.333333333,test, 6.833333333,raa, 8.333333333,kaa, 9.833333333,thaa, 11.33333333,paa, 12.83333333,maa,
I think you should reconsider your approach. You can achieve your goal simply by iterating over the CSV file, without creating intermediate dicts and lists..., and since you want to work with specific columns, you'll make your life easier and your code more readable by using DictReader and DictWriter import csv import time search_string = "whatever" with open('file1.csv', 'rb') as f, open('file2.csv', 'wb') as g: reader = csv.DictReader(f) c1, c2, c3, *_ = reader.fieldnames writer = csv.DictWriter(g, fieldnames=(c1, c2)) for row in reader: if row[c3] == search_string: writer.writerow({c1:row[c1], c2:row[c2]}) Keep in mind that csv module will always return strings. You have to handle data-type conversions yourself, if you need them (I've left that out form above). If you don't want to use DictReader/DictWriter, I suppose it is a little more verbose, and don't want a header in your output file: with open('file1.csv') as f, open('file2.csv', 'w') as g: reader = csv.reader(f) next(reader, None) # discard the header writer = csv.writer(g) for row in reader: if row[2] == search_string: writer.writerow(row[:2])
That is how to search for text/string in csv file and if present how to extract only specific column values to a new csv file? This is two questions. First question: to search for text in a file, the simplest answer would be to read the file text into memory and look for the text. If you want to look for the text in a specific column of the csv you're reading in, you can use a DictReader to make life easy: for row in reader: if search_target in row[header]: # found it! Second question: One way to write specific columns to a new csv would be as follows: keys = ["Country", "Location"] new_rows = [{key: row[key] for key in keys} for row in reader] writer = csv.DictWriter(somefile, keys) writer.writerows(new_rows)
This may help to understand better. Reading two csv files and matching the row indexs values are same or not, If same, writing to another csv. import numpy as np import csv import time import os output_dir = "D:\Laneending\data-ars540" file1 = "3rd_test_rec_road_width_changing_scenarios_250_inference.csv" file2 = "df_5_signals_1597515776730734.csv" ars540 = os.path.join(output_dir, file1) veh_dyn = os.path.join(output_dir, file2) file3 = "df_5_signals_1597515776730734_processed.csv" output_file = os.path.join(output_dir, file3) with open(ars540, 'r') as f1, open(veh_dyn, 'r') as f2, \ open(output_file, 'w+', newline='') as f3: f1_reader = csv.reader(f1) f2_reader = csv.reader(f2) header_f1 = [] header_f1 = next(f1_reader) # reading the next line after header of csv file. header_f2 = [] header_f2 = next(f2_reader) # reading the next line after header of csv file. count = 0 writer = csv.writer(f3) #preparing the file f3 for writing the file. writer.writerow(["Timestamp", "no of detections", "velocity", "yawrate" , "afdr"]) for row_f1 in f1_reader: # looking each row from csv file f1 for row_f2 in f2_reader: # looking for each row from csv file f2 if row_f1[1] == row_f2[0]: #checking the condition; worse case Time complexity o(n2) # print(row_f2) print(count) writer = csv.writer(f3) writer.writerows([row_f2]) count +=1 break
Sort a file according to values in first column
I have a file containing results (integers) in several columns (tab separated), two lines of text at the beginning telling me something about the file's contents and two lines at the end telling me if the file's contents are complete. I have a script to order the file according to the value of the first column but would like to extend it so it skips the first two, and last two lines of the file while also only printing out the ordered first column. How could I do so? This is the script that I currently have: file_name = "output1.dat" new_file_name = "sorted_"+file_name data = csv.reader(open(file_name),delimiter='\t') sortedlist = sorted(data, key=lambda x:int(x[0])) #now write the sorte result into new CSV file with open(new_file_name, "wb") as f: fileWriter = csv.writer(f, delimiter=',') for row in sortedlist: fileWriter.writerow(row) It gets tripped up by the lines of text as they don't contain any columns.
This should skip first two and last two lines: sortedlist = sorted(list(data)[2:-2], key=lambda x:int(x[0])) Write only the first column: fileWriter.writerow(row[:1]) Full script: file_name = "output1.dat" new_file_name = "sorted_"+file_name data = csv.reader(open(file_name),delimiter='\t') sortedlist = sorted(list(data)[2:-2], key=lambda x:int(x[0])) #now write the sorte result into new CSV file with open(new_file_name, "wb") as f: fileWriter = csv.writer(f, delimiter=',') for row in sortedlist: fileWriter.writerow(row[:1])