large file delete rows python

large file delete rows python - python

Need some help with a use case. I have two file one is about 9GB (test_data) and the other 42MB(master_data). test_data contains data with several columns one of the column i.e. #7 contains email address . master_data is my master data file which has just one column which is email address only.
What I am trying to achieve is to compare the emails in master_data file with the emails in test_data if they match, the whole row is to be deleted. I need an efficient way to achieve the same.
The below piece of code is written to achieve but I am stuck at deleting the lines from master_data file but am not sure if this is an efficient way of achieving this requirement.
import csv
import time
# open the file in read mode
filename = open('master_data.csv', 'r')
# creating dictreader object
file = csv.DictReader(filename)
# creating empty lists
email = []
# iterating over each row and append
# values to empty list
for col in file:
email.append(col['EMAIL'])
# printing lists
print('Email:', email)
datafile = open('test_data.csv', 'r+')
for line in datafile:
#print(line)
# str1,id=line.split(',')
split_line=line.split(',')
str1=split_line[7] # Whatever columns
id1=split_line[0]
for w in email:
print(w)
print(str1)
#time.sleep(2.4)
if w in str1:
print(id1)
datafile.remove(id1)

Removing lines from a file is difficult. Its a lot easier to write a new file, filtering out rows as you go. Put your existing emails in a set for easy lookup, write to a temporary file and rename when done. This also has the advantage that you don't loose data if something goes wrong along the way.
You'll need to "normalize" the email. Most email systems are case-insensitive and ignore periods in addresses. Addresses can also contain extra name information as in John Doe <j.doe#Gmail.com>. Write a function that puts addresses into a single form and use it for both of your files.
import csv
import os
import email.utils
def email_normalize(val):
# discard optional full name. lower case, remove '.' in local name
_, addr = email.utils.parseaddr(val)
local, domain = addr.lower().split('#', 1)
local = local.replace('.', '')
return f'{local}#{domain)'
# create set of user emails to keep
with open('master_data.csv', newline='') as file:
emails = set(email_normalize(row[0]) for row in csv.reader(file))
with open('test_data.csv', newline='') as infile, \
open('test_data.csv.tmp', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
writer.writerow(next(reader)) # write header
writer.writerows(row for row in reader
if email_normalize(row[7]) not in emails) # email is column #7
del reader, writer
os.rename('test_data.csv', 'test_data.csv.deleteme')
os.rename('test_data.csv.tmp', 'test_data.csv')
os.remove('test_data.csv.deleteme')

You could load the master file and store the emails in a dict, then as you iterate the rows of test, you can check if the email for a row is in that (master) dict.
Given these CSVs:
test.csv:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,,,,,,foo#mail.com
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
4,,,,,,dog#mail.com
5,,,,,,foo#mail.com
master.csv:
Col1
foo#mail.com
cat#mail.com
dog#mail.com
When I run:
import csv
emails: dict[str, None] = {}
with open("master.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
for row in reader:
emails[row[0]] = None
out_line = "{:<20} {:>8}"
with open("test.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
print(out_line.format("Email", "Test ID"))
for row in reader:
if row[6] in emails:
print(out_line.format(row[6], row[0]))
I get:
Email Test ID
foo#mail.com 1
dog#mail.com 4
foo#mail.com 5
That proves that you can read the emails in from master and compare with them while reading from test.
As others have pointed out, actually deleting anything from a file is difficult; it's far easier to create a new file and just exclude (filter out) the things you don't want:
f_in = open("test.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
for row in reader:
if row[6] in emails:
continue
writer.writerow(row)
f_in.close()
f_out.close()
Iterating your CSV reader and writing-out with your CSV writer is a very efficient means of transforming a CSV (test.csv → output.csv, in this case): you only need the memory to hold row in each step of the loop.
When I run that, after having populated the emails dict like before, my output.csv looks like:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
For the real-world performance of your situation, I mocked up a 42MB CSV file for master—1.35M rows of 32-character-long hex strings. Reading those 1.35M unique strings and saving them in the dict took less than 1 second real time and used 176 MB of RAM (on my M1 Macbook Air, with dual-channel SSD).
Also, I recommend using the csv module every time you need to read/write a CSV. No matter how simple the CSV looks, using the csv readers/writers will be 100% correct and there's almost 0 overhead compared to trying and manually split or join on a comma.

Related

Parsing column from CSV and replace a value in text file with the new value

I have one CSV file, and I want to extract the first column of it. My CSV file is like this:
Device ID;SysName;Entry address(es);IPv4 address;Platform;Interface;Port ID (outgoing port);Holdtime
PE1-PCS-RANCAGUA;;;192.168.203.153;cisco CISCO7606 Capabilities Router Switch IGMP;TenGigE0/5/0/1;TenGigabitEthernet3/3;128 sec
P2-CORE-VALPO.cisco.com;P2-CORE-VALPO.cisco.com;;200.72.146.220;cisco CRS Capabilities Router;TenGigE0/5/0/0;TenGigE0/5/0/4;128 sec
PE2-CONCE;;;172.31.232.42;Cisco 7204VXR Capabilities Router;GigabitEthernet0/0/0/14;GigabitEthernet0/3;153 sec
P1-CORE-CRS-CNT.entel.cl;P1-CORE-CRS-CNT.entel.cl;;200.72.146.49;cisco CRS Capabilities Router;TenGigE0/5/0/0;TenGigE0/1/0/6;164 sec
For that purpose I use the following code that I saw here:
import csv
makes = []
with open('csvoutput/topologia.csv', 'rb') as f:
reader = csv.reader(f)
# next(reader) # Ignore first row
for row in reader:
makes.append(row[0])
print makes
Then I want to replace into a textfile a particular value for each one of the values of the first column and save it as a new file.
Original textfile:
PLANNED.IMPACTO_ID = IMPACTO.ID AND
PLANNED.ESTADOS_ID = ESTADOS_PLANNED.ID AND
TP_CLASIFICACION.ID = TP_DATA.ID_TP_CLASIFICACION AND
TP_DATA.PLANNED_ID = PLANNED.ID AND
PLANNED.FECHA_FIN >= CURDATE() - INTERVAL 1 DAY AND
PLANNED.DESCRIPCION LIKE '%P1-CORE-CHILLAN%’;
Expected output:
PLANNED.IMPACTO_ID = IMPACTO.ID AND
PLANNED.ESTADOS_ID = ESTADOS_PLANNED.ID AND
TP_CLASIFICACION.ID = TP_DATA.ID_TP_CLASIFICACION AND
TP_DATA.PLANNED_ID = PLANNED.ID AND
PLANNED.FECHA_FIN >= CURDATE() - INTERVAL 1 DAY AND
PLANNED.DESCRIPCION LIKE 'FIRST_COLUMN_VALUE’;
And so on for every value in the first column, and save it as a separate file.
How can I do this? Thank you very much for your help.

You could just read the file, apply changes, and write the file back again. There is no efficient way to edit a file (inserting characters is not efficiently possible), you can only rewrite it.
If your file is going to be big, you should not keep the whole table in memory.
import csv
makes = []
with open('csvoutput/topologia.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
makes.append(row)
# Apply changes in makes
with open('csvoutput/topologia.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(makes);

Code swap. How would I swap the value of one CSV file column to another?

I have two CSV files. The first file(state_abbreviations.csv) has only states abbreviations and their full state names side by side(like the image below), the second file(test.csv) has the state abbreviations with additional info.
I want to replace each state abbreviation in test.csv with its associated state full name from the first file.
My approach was to read reach file, built a dict of the first file(state_abbreviations.csv). Read the second file(test.csv), then compare if an abbreviation matches the first file, if so replace it with the full name.
Any help is appreacited
import csv
state_initials = ("state_abbr")
state_names = ("state_name")
state_file = open("state_abbreviations.csv","r")
state_reader = csv.reader(state_file)
headers = None
final_state_initial= []
for row in state_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in state_initials:
headers.append(i)
else:
final_state_initial.append((row[0]))
print final_state_initial
headers = None
final_state_abbre= []
for row in state_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in state_initials:
headers.append(i)
else:
final_state_abbre.append((row[1]))
print final_state_abbre
final_state_initial
final_state_abbre
state_dictionary = dict(zip(final_state_initial, final_state_abbre))
print state_dictionary

You almost got it, the approach that is - building out a dict out of the abbreviations is the easiest way to do this:
with open("state_abbreviations.csv", "r") as f:
# you can use csv.DictReader() instead but lets strive for performance
reader = csv.reader(f)
next(reader) # skip the header
# assuming the first column holds the abbreviation, second the full state name
state_map = {state[0]: state[1] for state in reader}
Now you have state_map containing a map of all your state abbreviations, for example: state_map["FL"] contains Florida.
To replace the values in your test.csv, tho, you'll either have to load the whole file into memory, parse it, do the replacement and save it, or create a temporary file and stream-write to it the changes, then overwrite the original file with the temporary file. Assuming that test.csv is not too big to fit into your memory, the first approach is much simpler:
with open("test.csv", "r+U") as f: # open the file in read-write mode
# again, you can use csv.DictReader() for convenience, but this is significantly faster
reader = csv.reader(f)
header = next(reader) # get the header
rows = [] # hold our rows
if "state" in header: # proceed only if `state` column is found in the header
state_index = header.index("state") # find the state column index
for row in reader: # read the CSV row by row
current_state = row[state_index] # get the abbreviated state value
# replace the abbreviation if it exists in our state_map
row[state_index] = state_map.get(current_state, current_state)
rows.append(row) # append the processed row to our `rows` list
# now lets overwrite the file with updated data
f.seek(0) # seek to the file begining
f.truncate() # truncate the rest of the content
writer = csv.writer(f) # create a CSV writer
writer.writerow(header) # write back the header
writer.writerows(rows) # write our modified rows

It seems like you are trying to go through the file twice? This is absolutely not necessary: the first time you go through you are already reading all the lines, so you can then create your dictionary items directly.
In addition, comprehension can be very useful when creating lists or dictionaries. In this case it might be a bit less readable though. The alternative would be to create an empty dictionary, start a "real" for-loop and adding all the key:value pairs manually. (i.e: with state_dict[row[abbr]] = row[name])
Finally, I used the with statement when opening the file to ensure it is safely closed when we're done with it. This is good practice when opening files.
import csv
with open("state_abbreviations.csv") as state_file:
state_reader = csv.DictReader(state_file)
state_dict = {row['state_abbr']: row['state_name'] for row in state_reader}
print(state_dict)
Edit: note that, like the code you showed, this only creates the dictionary that maps abbreviations to state names. Actually replacing them in the second file would be the next step.

Step 1: Ask Python to remember the abbreviated full names, so we are using dictionary for that
with open('state_abbreviations.csv', 'r') as f:
csvreader = csv.reader(f)
next(csvreader)
abs = {r[0]: r[1] for r in csvreader}
step 2: Replace the abbreviations with full names and write to an output, I used "test_output.csv"
with open('test.csv', 'r') as reading:
csvreader = csv.reader(reading)
next(csvreader)
header = ['name', 'gender', 'birthdate', 'address', 'city', 'state']
with open( 'test_output.csv', 'w' ) as f:
writer = csv.writer(f)
writer.writerow(header)
for a in csvreader:
writer.writerow(a[0], a[1], a[2], a[3], a[4], abs[a[5]])

Trying to join two datasets via csv and python

I currently have two csv files full of objects. The objects in one of the csv files contains an object id, and various other info fields. The other contains the object id's that reference to the first file along with other info about the objects.
I'm trying to output a third csv file that contains all of the information for each object. Looping through these traditionally is too slow as there are ~3 million objects in one of the files. Does there exist a python package or other solution that makes this process more efficient?

This only requires that data from the smaller csv file be kept in memory.
import csv
extra_data {}
with open('smaller.csv', newline='') as fin1:
reader = csv.reader(fin1)
for row in reader:
objid = row[0] # or whichever field has the object id
extra_data[objid] = row[1:]
with open('bigger.csv', newline='') as fin2, open('combined.csv', 'w', newline='') as fout:
reader = csv.reader(fin2)
writer = csv.writer(fout)
for row in reader:
objid = row[0] # or whichever field has the object id
new_row = row + extra_data.get(objid, [])
writer.writerow(new_row)

Parsing CSV files using Python 2.7

I'm trying to write a script that will open a CSV file and write rows from that file to a new CSV file based on the match criteria of a unique telephone number in column 4 of csv.csv. The phone numbers are always in column 4, and are often duplicated in the file, however the other columns are often unique, thus each row is inherently unique.
A row from the csv file I'm reading looks like this: (the TN is 9259991234)
2,PPS,2015-09-17T15:44,9259991234,9DF51758-A2BD-4F65-AAA2
I hit an error with the code below saying that '_csv.writer' is not iterable and I'm not sure how to modify my code to solve the problem.
import csv
import sys
import os
os.chdir(r'C:\pTest')
with open(r'csv.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
with open (r'new_csv.csv', 'ab') as new_f:
writer = csv.writer(new_f, delimiter=',')
for row in reader:
if row[3] not in writer:
writer.writerow(new_f)

Your error stems from this expression:
row[3] not in writer
You cannot test for membership against a csv.writer() object. If you wanted to track if you already have processed a phone number, use a separate set() object to track those:
with open(r'csv.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
with open (r'new_csv.csv', 'ab') as new_f:
writer = csv.writer(new_f, delimiter=',')
seen = set()
for row in reader:
if row[3] not in seen:
seen.add(row[3])
writer.writerow(row)
Note that I also changed your writer.writerow() call; you want to write the row, not the file object.

Add rows to a csvfile without creating an intermediate copy

How can I add rows to a csvfile by editing in place? I want to avoid the pattern of writing to a temp file and then replacing the original, (pseudocode):
add_records_to_csv(newdata, infile, tmpfile)
delete(infile)
rename(tmpfile, infile)
Here's the actual function. The lines "# <--" are what I want to get rid of and/or condense into something more straightforward:
def add_records_to_csv(dic, csvfile):
""" Append a dictionary to a CSV file.
Adapted from http://pymotw.com/2/csv/
"""
f_old = open(csvfile, 'rb') # <--
csv_old = csv.DictReader(f_old) # <--
fpath, fname = os.path.split(csvfile) # <--
csvfile_new = os.path.join(fpath, 'new_' + fname ) # <--
print(csvfile_new) # <--
f = open(csvfile_new, 'wb') # <--
try:
fieldnames = sorted(set(dic.keys() + csv_old.fieldnames))
writer = csv.DictWriter(f, fieldnames=fieldnames)
headers = dict( (n,n) for n in fieldnames )
writer.writerow(headers)
for row in csv_old:
writer.writerow(row)
writer.writerow(dic)
finally:
f_old.close()
f.close()
return csvfile_new

This is not going to be possible in general. Here is the reason, from your code:
fieldnames = sorted(set(dic.keys() + csv_old.fieldnames))
To me, this says that at least in some cases your new row contains columns that were not in the previous rows. When you add a row like this, you will have to update the header of the file (the first line), in addition to appending new rows at the end. If you need to have the column names in alphabetized order, then you may have to rearrange the fields in all the other rows in order to retain the ordering of the columns.
Because you may need to edit the first line of the file, in addition to appending new lines at the end and possibly editing all the lines in-between, there isn't a reasonable way to make this work in-place.
My suggestion is to try and figure out, ahead of time, all the fields/columns that you may need to include so that you guarantee your program will never have to edit the header and can simply add new rows.

If your new row has the same structure as the existing records the following will work:
import csv
def append_record_to_csv(dic, csvfile):
with open(csvfile, 'rb') as f:
# discover order of field names in header row
fieldnames = next(csv.reader(f))
with open(csvfile, 'ab') as f:
# assumes that dic contains only fieldnames in csv file
dwriter = csv.DictWriter(f, fieldnames=fieldnames)
dwriter.writerow(dic)
On the other hand, if your new row as a different structure than the existing rows a csv file is probably the wrong file format. In order to add a new column to a csv file every row needs to be edited. The performance of this approach is very bad and will be quite noticeable with a large csv file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

large file delete rows python - python

Related

Parsing column from CSV and replace a value in text file with the new value

Code swap. How would I swap the value of one CSV file column to another?

Trying to join two datasets via csv and python

Parsing CSV files using Python 2.7

Add rows to a csvfile without creating an intermediate copy

Categories

Resources