Compare 2 csv files and update columns

Compare 2 csv files and update columns - python

I have 2 csv files with 2 rows 3 columns (id, name, value) that I want to compare. If there's a new row added to one of the files, the other one is updated as well. Likewise, if a value in one of the column changes the other file is updated.
Here's what I tried
a = '/path/to/file'
b = 'path/to/file'
with open(a, 'r') as f1, open(b, 'r') as f2:
file1 = csv.DictReader(f1)
file2 = csv.DictReader(f2)
for row_new in file2:
for row_old in file1:
if row_new['id'] == row_old['id']:
for k1, v1 in row_new.items():
for k, v in row_old.items():
if row_old[k1] == row_new[k]:
if v1 != v:
print(f'updated value for col {k}')
v1 = v
else:
print('Nothing to update')
else:
print(f'create row {row_new["id"]}')
I noticed that the iteration takes place only once. Am I doing something wrong here?

I noticed that the iteration takes place only once...?
The inner loop is probably reaching the end of the file before the outer loop has a chance to make its next iteration. Try moving the file object's pointer back to the beginning after the inner loop stops.
with open(a, 'r') as f1, open(b, 'r') as f2:
...
for row_new in file2:
for row_old in file1:
if row_new['id'] == row_old['id']:
...
else:
print(f'create row {row_new["id"]}')
f1.seek(0)
Some would say that the nested for loops is what you are doing wrong. Here are some SO questions/answers to consider.
python update a column value of a csv file according to another csv file
Python Pandas: how to update a csv file from another csv file
search python csv update one csv file based on another site:stackoverflow.com
Basically you should try to just read each file once and use data types that allow for fast membership testing like sets or dicts.
Your DictReaders will give you an {'id':x,'name':y,'value':z} dict for each row - causing you to use nested for loops to compare all the rows from one file to each row in the other. You could create a single dictionary using the id column for the keys and the dictionary values could be lists - {id:[name,value],id:[name,value],...} which may make the processing easier.
You also opened both your files for reading, open(...,'r'), so you'll probably find your files unchanged after you fix other things.

Related

Need to read csv files (when csv file is multiple input files) in Python

I have a school assignment that is asking me to write a program that first reads in the name of an input file and then reads the file using the csv.reader() method. The file contains a list of words separated by commas. The program should output the words and their frequencies (the number of times each word appears in the file) without any duplicates.
I have been able to figure out how to do this somewhat for one specific input file, but the program needs to be able to read multiple input files. This is what I have so far:
with open('input1.csv', 'r') as input1file:
csv_reader = csv.reader(input1file, delimiter = ',')
for row in csv_reader:
new_row = set(row)
for m in new_row:
count = row.count(m)
print(m, count)
This is what I get:
woman 1
man 2
Cat 1
Hello 1
boy 2
cat 2
dog 2
hey 2
hello 1
This works (almost) for the input1 file, except it changes the order each time I run it.
And I need it to work for two other input files?
sample CSV
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy

See the code below for an example, I've commented it so you understand what it does and why.
As for the fact that for your implementation the order is different is due to the usage of set. A set by definition is unordered.
Also note that with your implementation you are passing over the rows twice, once to turn it into a set, and once more to count. Besides this, if the file contains more than one row, your logic would fail, as the counting part only gets reached when the last line of the file is read.
import csv
def count_things(filename):
with open(filename) as infile:
csv_reader = csv.reader(infile, delimiter = ',')
result = {}
for row in csv_reader:
# go over the row by element
for element in row:
# does it exist already?
if element in result:
# if yes, increase count
result[element] += 1
else:
# if no, add and set count to 1
result[element] = 1
# sorting, explained in detail here:
# https://stackoverflow.com/a/613218/9267296
return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
# you could just return unsorted result by using:
# return result
for key, value in count_things("input1.csv").items():
# iterate over items() by key/value pairs
# see this link:
# https://www.w3schools.com/python/python_dictionaries_access.asp
print(key, value)

Correlate data from two CSVs and write the data to the first CSV using Python

I'm having trouble figuring out where to dive in on this personal project and I was hoping this community could help me create a Python script to deal with this data.
I have a CSV file that contains a list of meals fed to dogs at an animal rescue, associated by with the kennel number:
Source CSV - mealsandtreats.csv
blank_column,Kennel_Number,Species,Food,Meal_ID
,1,Dog,Meal,11.2
,5,Dog,Meal,45.2
,3,Dog,Meal,21.4
,4,Dog,Meal,17
,2,Dog,Meal,11.2
,4,Dog,Meal,21.4
,6,Dog,Meal,17
,2,Dog,Meal,45.2
I have a second CSV file that provides a key which maps the meals to what treats come with the meal:
Meal to Treat Key - MealsToTreatsKey.csv
Meals_fed,Treats_fed
10.1,2.4
11.2,2.4
13.5,3
15.6,3.2
17,3.2
20.1,5.1
21.4,5.2
35.7,7.7
45.2,7.9
I need to take every meal type (eg; drop duplicate entries) that was delivered from table 1, find the associated treat type, and then create an individual entry for every time a treat was served to a specific kennel. The final result should look something like this:
Result CSV - mealsandtreats.csv
blank_column,Kennel_Number,Species,Food,Meal_ID
,1,Dog,Meal,11.2
,5,Dog,Meal,45.2
,3,Dog,Meal,21.4
,4,Dog,Meal,17
,2,Dog,Meal,11.2
,4,Dog,Meal,21.4
,6,Dog,Meal,17
,2,Dog,Meal,45.2
,1,Dog,Treat,2.4
,5,Dog,Treat,7.9
,3,Dog,Treat,5.2
,4,Dog,Treat,3.2
,1,Dog,Treat,2.4
,4,Dog,Treat,5.2
Would prefer to do this with the csv module and not Pandas, but I'm open to using Pandas if necessary.
I have a bit of code so far just opening the CSVs, but I'm really stuck on where to go next:
import csv
with open('./meals/results/foodToTreats.csv', 'r') as t1,
open('./results/food.csv', 'r') as t2:
key = t1.readlines()
map = t2.readlines()
with open('./results/food.csv', 'w') as outFileF:
for line in map:
if line not in key:
outFileF.write(line)
with open('./results/foodandtreats.csv', 'w') as outFileFT:
for line in map:
if line not in key:
outFileFT.write(line)
So basically I just need to take every treat entry in the 2nd sheet, search for matching associated food entries in the 1st sheet, look up the kennel number associated with that entry and then write it to the 1st sheet.
Giving it my best shot in pseudo code, something like:
for x in column 0,y:
y,1 = Z
food = x
treat = y
kennel_number = z
when x,z:
writerows('', {'kennel_number"}, 'species', '{food/treat}',
{'meal_id"})
Update: Here is the exact code I'm using, thanks to #wwii. Seeing a minor bug:
import csv
import collections
treats = {}
with open('mealsToTreatsKey.csv') as f2:
for line in f2:
meal,treat = line.strip().split(',')
treats[meal] = treat
new_items = set()
Treat = collections.namedtuple('Treat', ['blank_column','Kennel_Number','Species','Food','Meal_ID'])
with open('foodandtreats.csv') as f1:
reader = csv.DictReader(f1)
for row in reader:
row['Food'] = 'Treat'
row['Meal_ID'] = treats[row['Meal_ID']]
new_items.add(Treat(**row))
fieldnames = reader.fieldnames
with open('foodandtreats.csv', 'a') as f1:
writer = csv.DictWriter(f1, fieldnames)
for row in new_items:
writer.writerow(row._asdict())
This works perfectly except for one small bug. The first new row written isn't starting on its own line:
enter image description here

Make a dictionary mapping meals to treats
treats = {}
with open(treatfile) as f2:
for line in f2:
meal,treat = line.strip().split(',')
treats[meal] = treat
Iterate over the meal file and create set of new entries. Use namedtuples for the new items.
import collections
new_items = set()
Treat = collections.namedtuple('Treat', ['blank_column','Kennel_Number','Species','Food','Meal_ID'])
with open(mealfile) as f1:
reader = csv.DictReader(f1)
for row in reader:
row['Food'] = 'Treat'
row['Meal_ID'] = treats[row['Meal_ID']]
new_items.add(Treat(**row))
fieldnames = reader.fieldnames
Open the meal file (again) for appending and write the new entries
with open(mealfile, 'a') as f1:
writer = csv.DictWriter(f1, fieldnames)
for row in new_items:
writer.writerow(row._asdict())
If the meals file does not end with a newline character, you will need to add one before writing the new treat lines. Since you have control of the files you should just make sure it always ends in a blank line.

Code returns same line multiple times instead of multiple lines

What I'm trying to do is to open two CSV files and print only the lines in which the content of a column in file 1 and file 2 match. I already know that I should end up with 14 results, but instead the first line of the CSV file I'm working with gets printed 14 times. Where did I go wrong?
file1 = open("../dir/file1.csv", "r")
for line in file1:
file1splitted = line.strip().split(",")
file2 = open("../dir/file2.csv", "r")
for line in file2:
file2splitted = line.strip().split(",")
for line in file1:
if file1splitted[0] == file2splitted [2]:
print (file1splitted[0],file1splitted[1], file2splitted[6], file2splitted[10], file2splitted[12])
file1.close()
file2.close()

You should be using the csv module for reading these files because splitting on commas is not reliable; it's fine for a single CSV column to contain values that themselves include commas.
I've added a couple of things to try make this cleaner and to help you move forward in your learning:
I've used the with context manager that automatically closes a file once you're done reading it. No need for .close()
I've packaged the csv reading code into a function. Now we only need to write that part once and we can call the function with any file.
I've used the csv module to read the file. This will return a nested list of rows, each inner list representing a single row.
I've used a list comprehension which is a neater way of writing a for loop that creates a list. In this case, it's a list of all the items in the first column of file_1.
I've converted the list in Point 4 into a set. When we iterate through file_2, we can very quickly check whether a row value has been seen in file_1 (set lookup is O(1) rather than having to iterate through file_1 every single time).
The indices I print are from my own test files, you will need to adapt them to your own use-case.
import csv
def read_csv(file_name):
with open(file_name) as infile: # Context manager to auto-close files at end
reader = csv.reader(infile)
#next(reader) remove the hash if you want to drop the headers
return list(reader)
file_1 = read_csv('file_1.csv')
file_2 = read_csv('file_2.csv')
# Make a set of file_1 column 0 with a list comprehension
file_1_vals = set([item[0] for item in file_1])
# Now iterate through file_2
for row in file_2:
if row[2] in file_1_vals:
print(row[1])

file1 = open("../dir/file1.csv", "r")
file2 = open("../dir/file2.csv", "r")
for line in file1:
file1splitted = line.strip().split(",")
for line in file2:
file2splitted = line.strip().split(",")
if file1splitted[0] == file2splitted [2]:
print (file1splitted[0],file1splitted[1], file2splitted[6], file2splitted[10], file2splitted[12])
file1.close()
file2.close()
if you provide your csv files then I can help you more.

Check for key in csv file if key is matched then add data into different rows of matched column using python

I have csv file like below
I need to search for a key then some values should be added in that key column. for example I need to search for folder and some values should be added in folder column. in the same way I need to search for name and some values should be added in name column.
so the final output looks like below
I have followed the below way but it doesn't work for me
import csv
list1 = [['ab', 'cd', 'ed']]
with open('1.csv', 'a') as f_csv:
data_to_write_list1 = zip(*list1)
writer = csv.writer(f_csv, delimiter=',', dialect='excel')
writer.writerows(data_to_write_list1)

If you want to only use built-in methods, you can get the first row of a file (in the case of a CSV file like yours, the headers) like this:
>>> with open('file_you_need.csv', 'r') as f:
>>> file = f.readline()
In your case the variable file would then be (supposing the delimiter is ","):
folder,name,service
You can now do file.split(",") (eventually replacing "," with whatever your delimiter is) and you'll get back a list of headers. You can then create a list of lists where each list is a row of your file and write back to the file or use a dictionary to link new entries to each header. Depending on your choice you would then in different ways write back to the file, i.e. supposing you go with list of lists:
with open('file_you_need.csv','w') as f:
for list in listoflists:
row = ""
for el, i in enumerate(list):
if i != len(list):
row += el+","
else:
row += el
f.write(row)
As others have mentioned you could also use Pandas and DataFrames to make it cleaner, but I don't think this is too hard to grasp

Connecting similar lines from two files

I have two files, both are very big. The files have mixed up information between themselves and I need to compare two files and connect the lines that intersect.
An example would be:
1st file has
var1:var2:var3
2nd would have
var2:var3:var4
I need to connect these in a third file with output: var1:var2:var3:var4.
Please note that the lines do not match, var4 which should go with var1 (since they have var2 and var3 together). Var2 and Var3 are common for Var1 and Var4. could be far away in these huge files.
I need to find a way to compare each line and connect it to the one in the 2nd file. I can't seem to think of anything of an adequate loop. Any ideas?

Try the following (assuming var2:var3 is always a unique key in both files):
Iterate over all lines in the first file
Add all entries into a dictionary with the value var2:var3 as key (and var1 as value)
Iterate over all entries in the second file
look up if the dictionary from part 1 contains an entry for the key var2:var3 and if it does output var1:var2:var3:var4 into the output file and delete the entry from the dictionary.
This approach can use very big amounts of memory and therefore should probably not be used for very large files.

Based on the specific fields you said that you want to match (2 & 3 from file 1, 1 & 2 from file 2):
#!/usr/bin/python3
# Iterate over every line in file1.
# Iterate over every line in file2.
# If lines intersect, print combined line.
with open('file1') as file1:
for line1 in file1:
u1,h1,s1 = line1.rstrip().split(':')
with open('file2') as file2:
for line2 in file2:
h2,s2,p2 = line2.rstrip().split(':')
if h1 == h2 and s1 == s2:
print(':'.join((u1,h1,s2,p2)))
This is horrendously slow (in theory), but uses a minimum of RAM. If the files aren't absolutely huge, it might not perform too badly.

If memory isn't problem, use dictionary where the key is the same as the value:
#!/usr/bin/python
out_dict = {}
with open ('file1','r') as file_in:
lines = file_in.readlines()
for line in lines:
out_dict[line] = line
with open ('file2','r') as file_in:
lines = file_in.readlines()
for line in lines:
out_dict[line] = line
with open ('output_file','w') as file_out:
for key in out_dict:
file_out.write (key)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare 2 csv files and update columns - python

Related

Need to read csv files (when csv file is multiple input files) in Python

Correlate data from two CSVs and write the data to the first CSV using Python

Code returns same line multiple times instead of multiple lines

Check for key in csv file if key is matched then add data into different rows of matched column using python

Connecting similar lines from two files

Categories

Resources