Python: Removing dupes from large text file

Python: Removing dupes from large text file - python

I need my code to remove duplicate lines from a file, at the moment it is just reproducing the same file as output. Can anyone see how to fix this? The for loop is not running as I would have liked.
#!usr/bin/python
import os
import sys
#Reading Input file
f = open(sys.argv[1]).readlines()
#printing no of lines in the input file
print "Total lines in the input file",len(f)
#temporary dictionary to store the unique records/rows
temp = {}
#counter to count unique items
count = 0
for i in range(0,9057,1):
if i not in temp: #if row is not there in dictionary i.e it is unique so store it into a dictionary
temp[f[i]] = 1;
count += 1
else: #if exact row is there then print duplicate record and dont store that
print "Duplicate Records",f[i]
continue;
#once all the records are read print how many unique records are there
#u can print all unique records by printing temp
print "Unique records",count,len(temp)
#f = open("C://Python27//Vendor Heat Map Test 31072015.csv", 'w')
#print f
#f.close()
nf = open("C://Python34//Unique_Data.csv", "w")
for data in temp.keys():
nf.write(data)
nf.close()
# Written by Gary O'Neill
# Date 03-08-15

This is a much better way to do what you want:
infile_path = 'infile.csv'
outfile_path = 'outfile.csv'
written_lines = set()
with open(infile_path, 'r') as infile, open(outfile_path, 'w') as outfile:
for line in infile:
if line not in written_lines:
outfile.write(line)
written_lines.add(line)
else:
print "Duplicate record: {}".format(line)
print "{} unique records".format(len(written_lines))
This will read one line at a time, so it works even on large files that don't fit into memory. While it's true that if they're mostly unique lines, written_lines will end up being large anyway, it's better than having two copies of almost every line in memory.

You should test the existence of f[i] in temp not i. Change the line:
if i not in temp:
with
if f[i] not in temp:

Related

How to skip over a certain index position for txt file

I have an assignment that requires me to analyze data for presidential job creation without using dictionaries. I have to open a text file and average data that applies to the democratic and republican presidents. I am having trouble understanding how to skip over certain lines (in my case I don't want to include the first line and index position 0, the years and months). This is what I have so far and a bit of the input file:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1979,14090,14135,14152,14191,14221,14239,14288,14328,14422,14484,14532,14559
1980,14624,14747,14754,14795,14827,14784,14861,14870,14824,14900,14903,14946
1981,14969,14981,14987,14985,14971,14963,14993,15007,14971,15028,15073,15075
1982,15056,15056,15050,15075,15132,15207,15299,15328,15403,15463,15515,15538
def g_avg():
infile = open("government_employment_Windows.txt", 'r')
lines = []
for line in infile:
print(line)
lines.append(line)
infile.close()
print(lines)
mean = 0
for line in lines:
number = float(line)
mean = mean + number
mean = mean / len(lines)
print(mean)

Its a very very pythonic way to calculate this
with open('filename') as f:
lines = f.readlines() #Read the lines of the txt
sum = 0
n = 0
for line in lines[1:]: #Use the [1:] to skip the first row with the months
row = line.split(',') #Use the split to convert the line in a list separated by comma
for element in row[1:]: #Use the [1:] to skip the years
sum += float(element)
n += 1
mean = sum/ n

This also looks like a csv file, in which case you can use the built in csv module
import csv
total = 0
count = 0
with open("government_employment_Windows.txt", 'r') as f:
reader = csv.reader(f)
next(reader) #skips the headers
for line in reader:
for item in line[1:]:
count += 1
total += float(item)
print(total)
print(count)
print('average: ', total/count)

Use a slice to skip over the first line, i.e file.readlines[1:]
# foo.txt
line, that, you, want, to, skip
important, stuff
other, important, stuff
with open('foo.txt') as file:
for line in file.readlines()[1:]:
print(line)
important, stuff
other, important, stuff
Note that since I have used with to open a file, Python will close it for me automatically. If you have just done file = open(..) you will have to also do file.close()

How to convert this text file to csv?

I try analyze text file with data - columns, and records.
My file:
Name Surname Age Sex Grade
Chris M. 14 M 4
Adam A. 17 M
Jack O. M 8
The text file has some empty data. As above.
User want to show Name and Grade:
import csv
with open('launchlog.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split() for line in stripped if line)
with open('log.txt', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'Surname', 'Age', 'Sex', 'Grade'))
writer.writerows(lines)
log.txt :
Chris,M.,14,M,4
Adam,A.,17,M
Jack,O.,M,8
How to empty data insert a "None" string?
For example:
Chris,M.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8
What would be the best way to do this in Python?

Use pandas:
import pandas
data=pandas.read_fwf("file.txt")
To get your dictionnary:
data.set_index("Name")["Grade"].to_dict()

Here's something in Pure Python™ that seems to do what you want, at least on the sample data file in your question.
In a nutshell what it does is first determine where each of the field names in column header line start and end, and then for each of the remaining lines of the file, does the same thing getting a second list which is used to determine what column each data item in the line is underneath (which it then puts in its proper position in the row that will be written to the output file).
import csv
def find_words(line):
""" Return a list of (start, stop) tuples with the indices of the
first and last characters of each "word" in the given string.
Any sequence of consecutive non-space characters is considered
as comprising a word.
"""
line_len = len(line)
indices = []
i = 0
while i < line_len:
start, count = i, 0
while line[i] != ' ':
count += 1
i += 1
if i >= line_len:
break
indices.append((start, start+count-1))
while i < line_len and line[i] == ' ': # advance to start of next word
i += 1
return indices
# convert text file with missing fields to csv
with open('name_grades.txt', 'rt') as in_file, open('log.csv', 'wt', newline='') as out_file:
writer = csv.writer(out_file)
header = next(in_file) # read first line
fields = header.split()
writer.writerow(fields)
# determine the indices of where each field starts and stops based on header line
field_positions = find_words(header)
for line in in_file:
line = line.rstrip('\r\n') # remove trailing newline
row = ['None' for _ in range(len(fields))]
value_positions = find_words(line)
for (vstart, vstop) in value_positions:
# determine what field the value is underneath
for i, (hstart, hstop) in enumerate(field_positions):
if vstart <= hstop and hstart <= vstop: # overlap?
row[i] = line[vstart:vstop+1]
break # stop looking
writer.writerow(row)
Here's the contents of the log.csv file it created:
Name,Surname,Age,Sex,Grade
Chris,C.,14,M,4
Adam,A.,17,M,None
Jack,O.,None,M,8

I would use baloo's answer over mine -- but if you just want to get a feel for where your code went wrong, the solution below mostly works (there is a formatting issue with the Grade field, but I'm sure you can get through that.) Add some print statements to your code and to mine and you should be able to pick up the differences.
import csv
<Old Code removed in favor of new code below>
EDIT: I see your difficulty now. Please try the below code; I'm out of time today so you will have to fill in the writer parts where the print statement is, but this will fulfill your request to replace empty fields with None.
import csv
with open('Test.txt', 'r') as in_file:
with open('log.csv', 'w') as out_file:
writer = csv.writer(out_file)
lines = [line for line in in_file]
name_and_grade = dict()
for line in lines[1:]:
parts = line[0:10], line[11:19], line[20:24], line[25:31], line[32:]
new_line = list()
for part in parts:
val = part.replace('/n','')
val = val.strip()
val = val if val != '' else 'None'
new_line.append(val)
print(new_line)

Without using pandas:
Edited based on your comment, I hard coded this solution based on your data. This will not work for the rows doesn't have Surname column.
I'm writing out Name and Grade since you only need those two columns.
o = open("out.txt", 'w')
with open("inFIle.txt") as f:
for lines in f:
lines = lines.strip("\n").split(",")
try:
grade = int(lines[-1])
if (lines[-2][-1]) != '.':
o.write(lines[0]+","+ str(grade)+"\n")
except ValueError:
print(lines)
o.close()

Python: delete element of one txt file from another txt file

I hope you are well.
I have two txt files: data.txt and to_remove.txt
data.txt has many lines, and each line has several integers with spaces in between. One line in data.txt looks like this: 1001 1229 19910
to_remove.txt has many line, each line has one integer. One line in to_remove.txt looks like this: 1229
I would like to write a new txt file which has data.txt without the integers in to_remove.txt
I know the first element of each line of data.txt does not have any of the elements of to_remove.txt; so I need to check all non-first elements of each line with each integer in to_remove.txt
I wrote to code to do this, but my code is far too slow. data.txt has more than a million lines, and to_remove.txt has few hundred thousand lines
Would be useful if you can suggest a faster way to do this.
Here is my code:
with open('new.txt', 'w') as new:
with open('data.txt') as data:
for line in data:
connections = []
currentline = line.split(" ")
for i in xrange(len(currentline)-2):
n = int(currentline[i+1])
connections.append(n)
with open('to_remove.txt') as to_remove:
for ID in to_remove:
ID = int(ID)
if ID in connections:
connections.remove(ID)
d = '%d '
connections.insert(0,int(currentline[0]))
for j in xrange(len(connections)-1):
d = d + '%d '
new.write((d % tuple(connections) + '\n'))

Your code is a bit messy, so I've re-written rather than edited. The main way to improve your speed is store the numbers to remove in a set(), which allows for efficient O(l) membership testing:
with open('data.txt') as data, open('to_remove.txt') as to_remove, open('new.txt', 'w') as new:
nums_to_remove = {item.strip() for item in to_remove} # create a set of strings to check for removing
for line in data:
numbers = line.rstrip().split() # create numbers list (note: these are stored as strings)
if not any(num in nums_to_remove for num in numbers[1:]): # check for the presence of numbers to remove
new.write(line) # write to the new file

I developed a code to answer my question using the code in the some of the answers, and the suggestion in the comment to the question.
def return_nums_remove():
with open('to_remove.txt') as to_remove:
nums_to_remove = {item.strip() for item in to_remove}
return nums_to_remove
with open('data.txt') as data, open('new.txt', 'w') as new:
nums_to_remove = return_nums_remove()
for line in data:
numbers = line.rstrip().split()
for n in numbers:
if n in nums_to_remove:
numbers.remove(n)
if len(numbers) > 1:
s = '%s '
for j in xrange(len(numbers)-1):
s = s + '%s '
new.write((s % tuple(numbers) + '\n'))

Swapping IDs & Python Performance

I'm hoping I can get help making my code run more efficiently. The purpose of my code is to take out the first ID (RUID) and replace it with a de-identified ID (RESPID) based on a key file of ids. The input data file is a large tab-delimited text file at about 2.5GB. The data is very wide, each row has thousands of columns. I have a function that works, but on the actual data it is incredibly slow. My first file has been running for 4 days and is only at 1.4GB. I don't know which part of my code is the most problematic, but I suspect it is where I build the row back together and write each row individually. Any advice on how to improve this would be greatly appreciated, 4 days is way too long for processing! Thank you!
def swap():
#input files
infile1 = open(r"Z:\ped_test.txt", 'rb')
keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')
#output file
outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
kList = k.rstrip('\r\n').split('\t')
if kList[0] not in RESPID and kList[0] != "":
RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"
print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
#if not re.match('#', line): #if there is a header
sline = line.split()
#slen = len(sline)
RUID = sline[COLUMN]
#print RUID
C0 = sline[0]
#print C0
DAT=sline[2:]
for key in RESPID:
if key==RUID:
NewID=RESPID[key]
row=str(C0+'\t'+NewID)
for a in DAT:
row=row+'\t'+a
#print row
outfile.write(row)
outfile.write('\n')
infile1.close()
keyfile.close()
outfile.close()
print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())

You have several places you can speed things up. Primarily, its a problem with enumerating all of the keys in RESPID when you can just use the 'get' function to read the value. But since you have very wide lines, there are a couple of other tweeks that will make a difference.
def swap():
#input files
infile1 = open(r"Z:\ped_test.txt", 'rb')
keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')
#output file
outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
kList = k.split('\t', 2) # minor: jut grab what you need
if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first
RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"
print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
#if not re.match('#', line): #if there is a header
sline = line.split('\t', 2) # minor: just grab what you need
#slen = len(sline)
RUID = sline[COLUMN]
#print RUID
C0 = sline[0]
#print C0
DAT=sline[2:]
# the biggie, just use a lookup
#for key in RESPID:
# if key==RUID:
# NewID=RESPID[key]
rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]])
#row=str(C0+'\t'+NewID)
#for a in DAT:
# row=row+'\t'+a
#print row
outfile.write(row)
outfile.write('\n')
infile1.close()
keyfile.close()
outfile.close()
print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())

You do not need to iterate over RESPID.
Replace:
for key in RESPID:
if key==RUID:
NewID=RESPID[key]
with
NewId = RESPID[RUID]
It does the same thing because key is always RUID.
I am pretty sure that that will decrease the running time of the program dramatically because RESPID is huge and you are checking every key as many times as there are lines in "ped_test.txt".

Extracting oddly arranged data from csv and converting to another csv file using python

I have a odd csv file thas has data with header value and its corresponding data in a manner as below:
,,,Completed Milling Job,,,,,, # row 1
,,,,Extended Report,,,,,
,,Job Spec numerical control,,,,,,,
Job Number,3456,,,,,, Operator Id,clipper,
Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22,
Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16,
I need to extract the data from this strucutre create another csv file as per the structure below:
Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row
If you notice, there is a new header column added called 'status" but the value is in the first row of the csv file. rest of the column names in output file are extracted from the original file.
Any thoughts will be greatly appreciated - thanks

Assuming the files are all exactly like that (at least in terms of caps) this should work, though I can only guarantee it on the exact data you have supplied:
#!/usr/bin/python
import glob
from sys import argv
g=open(argv[2],'w')
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n")
for fname in glob.glob(argv[1]):
with open(fname) as f:
status=f.readline().strip().strip(',')
f.readline()#extended report not needed
f.readline()#job spec numerical control not needed
s=f.readline()
job_no=s.split('Job Number,')[1].split(',')[0]
op_id=s.split('Operator Id,')[1].strip().strip(',')
s=f.readline()
machine_name=s.split('Coder Machine Name,')[1].split(',')[0]
start_t=s.split('Job Start time,')[1].strip().strip(',')
s=f.readline()
machine_type=s.split('Machine type,')[1].split(',')[0]
end_t=s.split('Job end time,')[1].strip().strip(',')
g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n")
g.close()
It takes a glob argument (like Job*.data) and an output filename and should construct what you need. Just save it as 'so.py' or something and run it as python so.py <data_files_wildcarded> output.csv

Here is a solution that should work on any CSV files that follow the same pattern as what you showed. That is a seriously nasty format.
I got interested in the problem and worked on it during my lunch break. Here's the code:
COMMA = ','
NEWLINE = '\n'
def _kvpairs_from_line(line):
line = line.strip()
values = [item.strip() for item in line.split(COMMA)]
i = 0
while i < len(values):
if not values[i]:
i += 1 # advance past empty value
else:
# yield pair of values
yield (values[i], values[i+1])
i += 2 # advance past pair
def kvpairs_by_column_then_row(lines):
"""
Given a series of lines, where each line is comma-separated values
organized as key/value pairs like so:
key_1,value_1,key_n+1,value_n+1,...
key_2,value_2,key_n+2,value_n+2,...
...
key_n,value_n,key_n+n,value_n+n,...
Yield up key/value pairs taken from the first column, then from the second column
and so on.
"""
pairs = [_kvpairs_from_line(line) for line in lines]
done = [False for _ in pairs]
while not all(done):
for i in range(len(pairs)):
if not done[i]:
try:
key_value_tuple = next(pairs[i])
yield key_value_tuple
except StopIteration:
done[i] = True
STATUS = "Status"
columns = [STATUS]
d = {}
with open("data.csv", "rt") as f:
# get an iterator that lets us pull lines conveniently from file
itr = iter(f)
# pull first line and collect status
line = next(itr)
lst = line.split(COMMA)
d[STATUS] = lst[3]
# pull next lines and make sure the file is what we expected
line = next(itr)
assert "Extended Report" in line
line = next(itr)
assert "Job Spec numerical control" in line
# pull all remaining lines and save in a list
lines = [line.strip() for line in f]
for key, value in kvpairs_by_column_then_row(lines):
columns.append(key)
d[key] = value
with open("output.csv", "wt") as f:
# write column headers line
line = COMMA.join(columns)
f.write(line + NEWLINE)
# write data row
line = COMMA.join(d[key] for key in columns)
f.write(line + NEWLINE)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Removing dupes from large text file - python

You should test the existence of f[i] in temp not i. Change the line: if i not in temp: with if f[i] not in temp:

Related

How to skip over a certain index position for txt file

How to convert this text file to csv?

Python: delete element of one txt file from another txt file

Swapping IDs & Python Performance

Extracting oddly arranged data from csv and converting to another csv file using python

Categories

Resources