merging and removing duplicates in two csv's without using pandas

merging and removing duplicates in two csv's without using pandas - python

I have a test output csv that looks like the following;
test1 test success
test2 test failed
regtest failed to build
column 1 contains unique strings and column 2 contains one of the following three strings; test success, test failed, failed to build.
I run this test every so often on new builds and I want to compare the csv from the latest test with the previous test.
I would like to produce a new csv containing all the tests whose state(column 2) has changed. Preferably in the format of;
TestName OldState NewState
Here is my current attempt which gets all the differences between the two files but it looks like this;
test1 test success
test2 test failed
regtest failed to build
test2 test success
I need a way to merge the second test2 with the first one so it looks like this;
test1 test success
test2 test failed test success
regtest failed to build
My current code;
import csv
import sys
with open(sys.argv[1], 'r') as t1, open(sys.argv[2], 'r') as t2, open(sys.argv[2], 'r') as t3, open(sys.argv[1], 'r') as t4:
fileOne = t1.readlines()
fileTwo = t2.readlines()
fileThree = t3.readlines()
fileFour = t4.readlines()
with open(sys.argv[3], 'w') as outFile:
for line in fileTwo:
if line not in fileOne:
outFile.write("From File 2," + line)
for line in fileFour:
if line not in fileThree:
outFile.write("\r\nFrom File 1," + line)

Each row in your csv has a key value pair, so it makes sense to write your csv to a dictionary, and then you can easily compare both values based on keys
with open(sys.argv[1], 'r') as t1, open(sys.argv[2], 'r') as t2:
# Convert lines of Csv to list
reader1 = list(csv.reader(t1))
reader2 = list(csv.reader(t2))
# Create a dictionary of key value pairs for t1
fileOne_dict = {col[0]: [col[1]] for col in reader1}
# compare values based on keys and append if different
for col in reader2:
if fileOne_dict.get(col[0]):
if fileOne_dict[col[0]][0] != col[1]:
fileOne_dict[col[0]].append(col[1])
else:
fileOne_dict[col[0]] = ["", col[1]]
out = [[key]+value for key, value in fileOne_dict.items()]
print(out)
with open ('diff.csv', 'w') as outFile:
writer = csv.writer(outFile)
writer.writerows(out)
This should leave you with a dictionary containing the information you want

Related

Using readlines and somehow skip the third column from comparison in two csv files

Old.csv:
name,department
leona,IT
New.csv:
name,department
leona,IT
lewis,Tax
With the same two columns, finding the new values from New.csv and update Old.csv with those works fine with the code below
feed = []
headers = []
with open("Old.csv", 'r') as t1, open("New.csv", 'r') as t2:
for header in t1.readline().split(','):
headers.append(header.rstrip())
fileone = t1.readlines()
filetwo = t2.readlines()[1:] # Skip csv fieldnames
for line in filetwo:
if line not in fileone:
lineItems = {}
feed.append(line.strip()) # For old file update
New problem:
1/ Add a 3rd column to store timestamp values
2/ Skip the 3rd column (timestamp) in both files and still need to compare two files for differences based on the 1st and 2nd columns
3/ Old file will be updated with the new values on all 3 columns
I tried the slicing method split(',')[0:2] but didn't seem to work at all. I feel there is just some small updates to the existing code but not sure how I can achieve that.
Expected outcome:
Old.csv:
name,department,timestamp
leona,IT,07/20/2020 <--- Existing value
lewis,Tax,08/25/2020 <--- New value from New.csv
New.csv:
name,department,timestamp
leona,IT,07/20/2020
leona,IT,07/25/2020
lewis,Tax,08/25/2020

You can do it all yourself, but why not use the tools built in to Python?
from csv import reader
feed = []
with open('Old.csv', 'r') as t1, open('New.csv', 'r') as t2:
old = reader(t1)
new = reader(t2)
headers = next(old)
# skip header in new
next(new)
# relevant data is only the first two columns
old_data = [rec[:2] for rec in old]
for rec in new:
if rec[:2] not in old_data:
feed.append(rec)
print(headers)
print(feed)
Result:
['name', 'department']
[['lewis', 'Tax']]
Note that you'll get this result with the data you provided, but if you add a third column, the code still works as expected and will add that data to the feed result.
To get feed to be a list of dictionaries, which you can easily turn into JSON, you could do something like:
feed.append(dict(zip(headers, rec)))
Turning feed into json is as simple as:
import json
print(json.dumps(feed))
The whole solution:
import json
from csv import reader
feed = []
with open('Old.csv', 'r') as t1, open('New.csv', 'r') as t2:
old = reader(t1)
new = reader(t2)
headers = next(old)
# skip header in new
next(new)
# relevant data is only the first two columns
old_data = [rec[:2] for rec in old]
for rec in new:
if rec[:2] not in old_data:
feed.append(dict(zip(headers, rec)))
print(json.dumps(feed))
With outputs like:
[{"name": "lewis", "department": "Tax", "timestamp": "08/25/2020"}]

CSV diff with headers

I am taking two csvs and writing the difference to a third csv file. Here's my function.
def diff_csv(NewFile, OldFile):
with open(OldFile, 'r') as t1, open(NewFile, 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('result.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
However, since the two input csvs have the same set of hearders, the headers doesn't show up in the output csv. How do I preserve the headers here?

You could use csv.DictReader and csv.DictWriter. Then unconditionally write the header row, but for the other rows first test whether they exist in old_file. For this test, in the example below, I am making a set of rows from old_file to test inclusion, and representing the rows using a tuple of cell values because this is a hashable object that can be stored in a set.
import csv
import pdb
def diff_csv(new_file, old_file, out_file='result.csv'):
with open(old_file) as t1, open(new_file) as t2:
c2 = csv.DictReader(t2)
c1 = csv.DictReader(t1)
assert c2.fieldnames == c1.fieldnames
row_tuples = set()
for row in c1:
row_tuples.add(tuple(row.values()))
with open(out_file, "w") as fout:
cout = csv.DictWriter(fout, c2.fieldnames)
cout.writeheader()
for row in c2:
row_tuple = tuple(row.values())
if row_tuple not in row_tuples:
cout.writerow(row)
if __name__ == '__main__':
diff_csv("b.csv", "a.csv")
with these input files:
a.csv
blah,stuff
a,4
b,6
and b.csv
blah,stuff
x,4
b,6
d,3
gives result.csv
blah,stuff
x,4
d,3

Joining files by corresponding columns in outside table

I have a .csv file matching table names to categories, which I want to use to merge any files in a folder (as in cat) with names corresponding to column Sample_Name in the .csv according to Category, changing the final file's name to each Category.
The to-be merged files in the folder are not .csv; they're a kind of .fasta file.
The .csv is something as the following (will have more columns that will be ignored for this):
Sample_Name Category
1 a
2 a
3 a
4 b
5 b
After merging, the output should be two files: a (samples 1,2,3 merged) and b (samples 4 and 5).
The idea is to make this work for a large number of files and categories.
Thanks for any help!

Assuming that the files are in order in the input CSV file, this is about as simple as you could get:
from operator import itemgetter
fields = itemgetter(0, 1) # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
next(csvfile) # skip over header line
for line in csvfile:
filename, category = fields(line.split())
with open(filename) as infile, open(category, 'a') as outfile:
outfile.write(infile.read())
One downside to this is that the output file is reopened for every input file. This might be a problem if there are a lot of files per category. If that works out to be an actual problem then you could try this, which holds the output file open for as long as there are input files in that category.
from operator import itemgetter
fields = itemgetter(0, 1) # zero-based field numbers of the fields of interest
with open('sample_categories.csv') as csvfile:
next(csvfile) # skip over header line
current_category = None
outfile = None
for line in csvfile:
filename, category = fields(line.split())
if category != current_category:
if outfile is not None:
outfile.close()
outfile = open(category, 'w')
current_category = category
with open(filename) as infile:
outfile.write(infile.read())

I would build a dictionary with keys of categories and values of lists of corresponding sample names.
d = {'a':['1','2','3'], 'b':['4','5']}
You can achieve this in a straightforward way by reading the csv file and building the dictionary line by line, i.e.
d = {}
with open('myfile.csv'):
for line in myfile.csv:
samp,cat = line.split()
try:
d[cat].append(samp)
except KeyError: # if there is no entry for cat, we will get a KeyError
d[cat] = [samp,]
For a more sophisticated way of doing this, have a look at collections.
Once this database is ready, you can create your new files from category to category:
for cat in d:
with open(cat,'w') as outfile:
for sample in d[cat]:
# copy sample file content to outfile
Copying one file's content to the other can be done in several ways, see this thread.

Merging two files by one common set of identifiers with python

I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):
module 1 gene 1
module 1 gene 2
..
module x gene y
I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):
gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc
I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Any suggestions would be welcome.
My desired output is gene ID tab module affiliation tab sample values separated by tabs.
Here is the script I came up with. The script written does not produce any error messages but it gives me an empty file.
expression_values = {}
matches = []
with open("identifiers.txt") as ids, open("target.txt") as target:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
csvfile = "modules.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in matches:
writer.writerow([val])
Thanks!

These lines of code are not doing what you are expecting them to do:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
The expression values and block_idents will have the values only according to the current line of the files you are updating them with. In other words, the dictionary and the list are not "growing" as more lines are being read. Also TSV files, can be parsed with less effort using csv module.
There are a few assumptions I am making with this rough solution I am suggesting:
The "genes" in the first file are the only "genes" that will appear
in the second file.
There could duplicate "genes" in the first file.
First construct a map of the data in the first file as:
import csv
from collections import defaultdict
gene_map = defaultdict(list)
with open(first_file, 'rb') as file_one:
csv_reader = csv.reader(file_one, delimiter='\t')
for row in csv_reader:
gene_map[row[1]].append(row[0])
Read the second file and write to the output file simultaneously.
with open(sec_file, 'rb') as file_two, open(op_file, 'w') as out_file:
csv_reader = csv.reader(file_two, delimiter='\t')
csv_writer = csv.writer(out_file, delimiter='\t')
for row in csv_reader:
values = gene_map.get(row[0], [])
op_list = []
op_list.append(row[0])
op_list.extend(values)
values.extend(row[1:])
csv_writer.writerow(op_list)

There are a number of problems with the existing approach, not least of which is that you are throwing away all data from the files except for the last line in each. The assignment under each "for line in" will replace the contents of the variable, so only the last assignment, for the last line, will have effect.
Assuming each gene appears in only one module, I suggest instead you read the "ids" into a dictionary, saving the module for each geneid:
geneMod = {}
for line in ids:
id = line.split()
geneMod[ id[0] ] = id[1]
Then you can just go though target lines, and for each line, split it, get the gene id gene= targetsplit[0] and save (or output) the same split fields but inserting the module value, e.g.: print gene, geneMod[gene], targetsplit[1:]

How to perform multiple changes to a csv file without intervening writes

I want to perform multiple edits to most rows in a csv file without making multiple writes to the output csv file.
I have a csv that I need to convert and clean up into specific format for another program to use. For example, I'd like to:
remove all blank rows
remove all rows where the value of column "B" is not a number
with this new data, create a new column and explode the first part of the values in column B into the new column
Here's an example of the data:
"A","B","C","D","E"
"apple","blah","1","","0.00"
"ape","12_fun","53","25","1.00"
"aloe","15_001","51","28",2.00"
I can figure out the logic behind each process, but what I can't figure out is how to perform each process without reading and writing to a file each time. I'm using the CSV module. Is there a better way to perform these steps at once before writing a final CSV?

I would define a set of tests and a set of processes.
If all tests pass, all processes are applied, and the final result is written to output:
import csv
#
# Row tests
#
def test_notblank(row):
return any(len(i) for i in row)
def test_bnumeric(row):
return row[1].isdigit()
def do_tests(row, tests=[test_notblank, test_bnumeric]):
return all(t(row) for t in tests)
#
# Row processing
#
def process_splitb(row):
b = row[1].split('.')
row[1] = b[0]
row.append(b[1])
return row
def do_processes(row, processes=[process_splitb]):
for p in processes:
row = p(row)
return row
def main():
with open("in.csv","rb") as inf, open("out.csv","wb") as outf:
incsv = csv.reader(inf)
outcsv = csv.writer(outf)
outcsv.writerow(incsv.next()) # pass header row
outcsv.writerows(do_processes(row) for row in incsv if do_tests(row))
if __name__=="__main__":
main()

Simple for loops.
import csv
csv_file = open('in.csv', 'rb')
csv_reader = csv.reader(csv_file)
header = csv_reader.next()
header.append('F') #add new column
records = [header]
#process records
for record in csv_reader:
#skip blank records
if record == []:
continue
#make sure column "B" has 2 parts
try:
part1, part2 = record[1].split('_')
except:
continue
#make sure part1 is a digit
if not part1.isdigit():
continue
record[1] = part1 #make column B equal part1
record.append(part2) #add data for the new column F to record
records.append(record)
new_csv_file = open('out.csv', 'wb')
csv_writer = csv.writer(new_csv_file, quoting=csv.QUOTE_ALL)
for r in records:
csv_writer.writerow(r)

Why use the CSV module. A CSV is made up of text lines (strings) and you can use the Python string power (split, join, replace, len) to create your result.
line_cols = line.split(',') and back: line = ','.join(line_cols)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

merging and removing duplicates in two csv's without using pandas - python

Related

Using readlines and somehow skip the third column from comparison in two csv files

CSV diff with headers

Joining files by corresponding columns in outside table

Merging two files by one common set of identifiers with python

How to perform multiple changes to a csv file without intervening writes

Categories

Resources