Python: efficient way of combining 2 differently delimited csv files - python

I have written the following code to combine a Tab-separted csv file with another Comma-separated csv file (which has header too). The final output is a Tab-separated csv, without header.
with open('train.csv',"r") as infile1, open("test.csv", "r") as infile2, open('final.csv',"a") as outfile:
reader1 = csv.reader(infile1, delimiter='\t')
reader2 = csv.reader(infile2)
next(reader2, None) # skip the headers
writer = csv.writer(outfile, delimiter='\t')
for row in reader1:
writer.writerow(row)
for row in reader2:
writer.writerow(row)
Below are sample file for train.csv and test.csv respectively
main-captions MSRvid 2012 0001 5.000 A plane. is taking off.
main-captions MSRvid 2012 0004 3.800 A man. is playing a flute.
Domain,Task Name,Year,Index,Score,Sentence 1,Sentence 2
Exp,Exp,2020,1,5,product,damage
Exp,Exp,2020,2,5,product,broken
The above code works fine.
But is there a shorter way to achieve this? Say, that makes use of any new packages or maybe features within csv module?

Your code is already efficient. But it can be shortened further using writer.writerows
from itertools import chain
writer.writerows(chain(reader1, reader2))

Related

large file delete rows python

Need some help with a use case. I have two file one is about 9GB (test_data) and the other 42MB(master_data). test_data contains data with several columns one of the column i.e. #7 contains email address . master_data is my master data file which has just one column which is email address only.
What I am trying to achieve is to compare the emails in master_data file with the emails in test_data if they match, the whole row is to be deleted. I need an efficient way to achieve the same.
The below piece of code is written to achieve but I am stuck at deleting the lines from master_data file but am not sure if this is an efficient way of achieving this requirement.
import csv
import time
# open the file in read mode
filename = open('master_data.csv', 'r')
# creating dictreader object
file = csv.DictReader(filename)
# creating empty lists
email = []
# iterating over each row and append
# values to empty list
for col in file:
email.append(col['EMAIL'])
# printing lists
print('Email:', email)
datafile = open('test_data.csv', 'r+')
for line in datafile:
#print(line)
# str1,id=line.split(',')
split_line=line.split(',')
str1=split_line[7] # Whatever columns
id1=split_line[0]
for w in email:
print(w)
print(str1)
#time.sleep(2.4)
if w in str1:
print(id1)
datafile.remove(id1)
Removing lines from a file is difficult. Its a lot easier to write a new file, filtering out rows as you go. Put your existing emails in a set for easy lookup, write to a temporary file and rename when done. This also has the advantage that you don't loose data if something goes wrong along the way.
You'll need to "normalize" the email. Most email systems are case-insensitive and ignore periods in addresses. Addresses can also contain extra name information as in John Doe <j.doe#Gmail.com>. Write a function that puts addresses into a single form and use it for both of your files.
import csv
import os
import email.utils
def email_normalize(val):
# discard optional full name. lower case, remove '.' in local name
_, addr = email.utils.parseaddr(val)
local, domain = addr.lower().split('#', 1)
local = local.replace('.', '')
return f'{local}#{domain)'
# create set of user emails to keep
with open('master_data.csv', newline='') as file:
emails = set(email_normalize(row[0]) for row in csv.reader(file))
with open('test_data.csv', newline='') as infile, \
open('test_data.csv.tmp', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
writer.writerow(next(reader)) # write header
writer.writerows(row for row in reader
if email_normalize(row[7]) not in emails) # email is column #7
del reader, writer
os.rename('test_data.csv', 'test_data.csv.deleteme')
os.rename('test_data.csv.tmp', 'test_data.csv')
os.remove('test_data.csv.deleteme')
You could load the master file and store the emails in a dict, then as you iterate the rows of test, you can check if the email for a row is in that (master) dict.
Given these CSVs:
test.csv:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,,,,,,foo#mail.com
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
4,,,,,,dog#mail.com
5,,,,,,foo#mail.com
master.csv:
Col1
foo#mail.com
cat#mail.com
dog#mail.com
When I run:
import csv
emails: dict[str, None] = {}
with open("master.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
for row in reader:
emails[row[0]] = None
out_line = "{:<20} {:>8}"
with open("test.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
print(out_line.format("Email", "Test ID"))
for row in reader:
if row[6] in emails:
print(out_line.format(row[6], row[0]))
I get:
Email Test ID
foo#mail.com 1
dog#mail.com 4
foo#mail.com 5
That proves that you can read the emails in from master and compare with them while reading from test.
As others have pointed out, actually deleting anything from a file is difficult; it's far easier to create a new file and just exclude (filter out) the things you don't want:
f_in = open("test.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
for row in reader:
if row[6] in emails:
continue
writer.writerow(row)
f_in.close()
f_out.close()
Iterating your CSV reader and writing-out with your CSV writer is a very efficient means of transforming a CSV (test.csv → output.csv, in this case): you only need the memory to hold row in each step of the loop.
When I run that, after having populated the emails dict like before, my output.csv looks like:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
For the real-world performance of your situation, I mocked up a 42MB CSV file for master—1.35M rows of 32-character-long hex strings. Reading those 1.35M unique strings and saving them in the dict took less than 1 second real time and used 176 MB of RAM (on my M1 Macbook Air, with dual-channel SSD).
Also, I recommend using the csv module every time you need to read/write a CSV. No matter how simple the CSV looks, using the csv readers/writers will be 100% correct and there's almost 0 overhead compared to trying and manually split or join on a comma.

Is that possible to combine two csv files into one as mentiioned below

I am trying to combine csv file into one as below mentioned
input1:
Name,Age,Department
Birla,52,Welding
Rob,45,Furnace
input2:
YearofService,Audit
14,Y
8,N
My expected output :
Name,Age,Department,YearofService,Audit
Birla,52,Welding,14,Y
Rob,45,Furnace,8,N
My code :
with open('input1.csv','r') as i,open('input2.csv','r') as o:
reader=csv.reader(i)
fieldnames=reader.fieldnames
reader1=csv.reader(o)
fieldnames_1=reader1.fieldnames
#here reading the fieldnames and combine csv file as 1:
I dont need to uuse python pandas.is that possible to achieve using csv library?
help me on this.
Instead of writing into sepearate ouput file,if its we ad the input to input1 is also fine
You can simply read both files at the same time and combine the rows to create a new one (or write over one of them). Here is one example on how you can do it, but you can adapt to your needs.
import csv
with open('input1.csv','r') as input1, open('input2.csv', 'r') as input2:
with open('output.csv', 'w') as output:
writer = csv.writer(output, lineterminator='\n')
reader1 = csv.reader(input1)
reader2 = csv.reader(input2)
rows = []
for row1, row2 in zip(reader1, reader2):
rows.append(row1 + row2)
writer.writerows(rows)
Side note: don't forget that the best way to join CSVs are using common indexes, so the rows keep aligned. Python default CSV library is basic and ideally a better tool for dealing with them should be used, such as Pandas.

Reading a CSV and writing the exact file back with one column changed

I'm looking to read a CSV file and make some calculations to the data in there (that part I have done).
After that I need to write all the data into a new file Exactly the way it is in the original with the exception of one column which will be changed to the result of the calculations.
I can't show the actual code (confidentiality issues) but here is an example of the code.
headings = ["losts", "of", "headings"]
with open("read_file.csv", mode="r") as read,\
open("write.csv", mode='w') as write:
reader = csv.DictReader(read, delimiter = ',')
writer = csv.DictWriter(write, fieldnames=headings)
writer.writeheader()
for row in reader:
writer.writerows()
At this stage I am just trying to return the same CSV in "write" as I have in "read"
I haven't used CSV much so not sure if I'm going about it the wrong way, also understand that this example is super simple but I can't seem to get my head around the logic of it.
You're really close!
headings = ["losts", "of", "headings"]
with open("read_file.csv", mode="r") as read, open("write.csv", mode='w') as write:
reader = csv.DictReader(read, delimiter = ',')
writer = csv.DictWriter(write, fieldnames=headings)
writer.writeheader()
for row in reader:
# do processing here to change the values in that one column
processed_row = some_function(row)
writer.writerow(processed_row)
Why you don't use pandas?
import pandas as pd
csv_file = pd.read_csv('read_file.csv')
# do_something_and_add_a_column_here()
csv_file.to_csv('write_file.csv')

Python Read Text File Column by Column

So I have a text file that looks like this:
1,989785345,"something 1",,234.34,254.123
2,234823423,"something 2",,224.4,254.123
3,732847233,"something 3",,266.2,254.123
4,876234234,"something 4",,34.4,254.123
...
I'm running this code right here:
file = open("file.txt", 'r')
readFile = file.readline()
lineID = readFile.split(",")
print lineID[1]
This lets me break up the content in my text file by "," but what I want to do is separate it into columns because I have a massive number of IDs and other things in each line. How would I go about splitting the text file into columns and call each individual row in the column one by one?
You have a CSV file, use the csv module to read it:
import csv
with open('file.txt', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
This still gives you data by row, but with the zip() function you can transpose this to columns instead:
import csv
with open('file.txt', 'rb') as csvfile:
reader = csv.reader(csvfile)
for column in zip(*reader):
Do be careful with the latter; the whole file will be read into memory in one go, and a large CSV file could eat up all your available memory in the process.

Python: appending/merging multiple csv files respecting headers and write to csv

[Using Python3] I'm very new to (Python) programming but nonetheless am writing a script that scans a folder for certain csv files, then I want to read them all and append them and write them into another csv file.
In between it is required that data is returned only where the values in a certain columns are matched to a set criteria.
All csv files have the same columns, and would look somewhere like this:
header1 header2 header3 header4 ...
string float string float ...
string float string float ...
string float string float ...
string float string float ...
... ... ... ... ...
The code I'm working with right now is the following (below), however it just keeps on overwriting the data from the previous file. That does make sense to me, I just cannot figure out how to get it working though.
Code:
import csv
import datetime
import sys
import glob
import itertools
from collections import defaultdict
# Raw data files have the format like '2013-06-04'. To be able to use this script during the whole of 2013, the glob is set to search for the pattern '2013-*.csv'
files = [f for f in glob.glob('2013-*.csv')]
# Output file looks like '20130620-filtered.csv'
outfile = '{:%Y%m%d}-filtered.csv'.format(datetime.datetime.now())
# List of 'Header4' values to be filtered for writing output
header4 = ['string1', 'string2', 'string3', 'string4']
for f in files:
with open(f, 'r') as f_in:
dict_reader = csv.DictReader(f_in)
with open(outfile, 'w') as f_out:
dict_writer = csv.DictWriter(f_out, lineterminator='\n', fieldnames=dict_reader.fieldnames)
dict_writer.writeheader()
for row in dict_reader:
if row['Campaign'] in campaign_names:
dict_writer.writerow(row)
I also tried something like readers = list(itertools.chain(*map(lambda f: csv.DictReader(open(f)), files))), and trying to iterate over the readers however then I cannot figure out how to work with the headers. (I get the error that itertools.chain() does not have the fieldnames attribute).
Any help is very much appreciated!
You keep re-opening the file and overwriting it.
Open outfile once, before your loops start. For the first file you read, write the header and the rows. For rest of the files, just write the rows.
Something like
with open(outfile, 'w') as f_out:
dict_writer = None
for f in files:
with open(f, 'r') as f_in:
dict_reader = csv.DictReader(f_in)
if not dict_writer:
dict_writer = csv.DictWriter(f_out, lineterminator='\n', fieldnames=dict_reader.fieldnames)
dict_writer.writeheader()
for row in dict_reader:
if row['Campaign'] in campaign_names:
dict_writer.writerow(row)

Categories

Resources