I'm hoping I can get help making my code run more efficiently. The purpose of my code is to take out the first ID (RUID) and replace it with a de-identified ID (RESPID) based on a key file of ids. The input data file is a large tab-delimited text file at about 2.5GB. The data is very wide, each row has thousands of columns. I have a function that works, but on the actual data it is incredibly slow. My first file has been running for 4 days and is only at 1.4GB. I don't know which part of my code is the most problematic, but I suspect it is where I build the row back together and write each row individually. Any advice on how to improve this would be greatly appreciated, 4 days is way too long for processing! Thank you!
def swap():
#input files
infile1 = open(r"Z:\ped_test.txt", 'rb')
keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')
#output file
outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
kList = k.rstrip('\r\n').split('\t')
if kList[0] not in RESPID and kList[0] != "":
RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"
print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
#if not re.match('#', line): #if there is a header
sline = line.split()
#slen = len(sline)
RUID = sline[COLUMN]
#print RUID
C0 = sline[0]
#print C0
DAT=sline[2:]
for key in RESPID:
if key==RUID:
NewID=RESPID[key]
row=str(C0+'\t'+NewID)
for a in DAT:
row=row+'\t'+a
#print row
outfile.write(row)
outfile.write('\n')
infile1.close()
keyfile.close()
outfile.close()
print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())
You have several places you can speed things up. Primarily, its a problem with enumerating all of the keys in RESPID when you can just use the 'get' function to read the value. But since you have very wide lines, there are a couple of other tweeks that will make a difference.
def swap():
#input files
infile1 = open(r"Z:\ped_test.txt", 'rb')
keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')
#output file
outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
kList = k.split('\t', 2) # minor: jut grab what you need
if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first
RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"
print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
#if not re.match('#', line): #if there is a header
sline = line.split('\t', 2) # minor: just grab what you need
#slen = len(sline)
RUID = sline[COLUMN]
#print RUID
C0 = sline[0]
#print C0
DAT=sline[2:]
# the biggie, just use a lookup
#for key in RESPID:
# if key==RUID:
# NewID=RESPID[key]
rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]])
#row=str(C0+'\t'+NewID)
#for a in DAT:
# row=row+'\t'+a
#print row
outfile.write(row)
outfile.write('\n')
infile1.close()
keyfile.close()
outfile.close()
print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())
You do not need to iterate over RESPID.
Replace:
for key in RESPID:
if key==RUID:
NewID=RESPID[key]
with
NewId = RESPID[RUID]
It does the same thing because key is always RUID.
I am pretty sure that that will decrease the running time of the program dramatically because RESPID is huge and you are checking every key as many times as there are lines in "ped_test.txt".
Related
The title isn't big enough for me to explain this so here it goes:
I have a csv file looking something like this:
Example csv containing
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
long string with some special characters , number, string, number
I want to go through the first column and if the length of the string is greater then 20 do this:
LINE 20 : long string with som, e special characters
split the string, modify first csv with first part of the string, and create a new csv and add the other part on the same line number, leaving the rest just whitespace
what i have for now is this:
this below doesn't do anything right now, its just what I made to try and explain to myself and figure out how could i do new file writing with splitString
fileName = file name
maxCollumnLength = number of rows in the whole set
lineNum = line number of a string that is greater then 20
splitString = second part of the split string that should be written on another file
def newopenfile(fileName, maxCollumnLength, lineNum, splitString):
with open(fileName, 'rw', encoding="utf8") as nf:
writer = csv.writer(fileName, quoting=csv.QUOTE_NONE)
for i in range(0, maxCollumnLength-1):
#write whitespace until reaching lineNum of a string thats bigger then 20 then write that part of the string to a csv
this goes through the first column and checks the length
fileName = 'uskrs.csv'
firstColList=[] #an empty list to store the second column
splitString=[]
i = 0
with open(fileName, 'rw', encoding="utf8") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
if len(row[0]) > 20:
i+=1
#split row and parse the other end of the row to newopenfile(fileName, len(reader), i, splitString )
#print(row[0])
#for debuging
firstColList.append(row[0])
from this point i am stuck at how to actualy change the string in the csv and how to split them
THE STRING COULD ALSO HAVE 60+ chars, so it would need splitting more then 2 times and storing it in more then 2 csvs
I suck at explaining the problem, so if you have any questions please do ask
Okay so i was sucessful in dividing the first column if it has length greater then 20, and replace the first column with first 20 chars
import csv
def checkLength(column, readFile, writeFile, maxLen):
counter = 0
i = 0
idxSplitItems = []
final = []
newSplits = 0
with open(readFile,'r', encoding="utf8", newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
final = your_list
for sublist in your_list:
#del sublist[-1] -remove last invisible element
i+=1
data = removeUnwanted(sublist[column])
print(data)
if len(data) > maxLen:
counter += 1 # Number of large
idxSplitItems.append(split_bylen(i,data,maxLen))
if len(idxSplitItems) > newSplits: newSplits = len(idxSplitItems)
final[i-1][column] = split_bylen(i,data,maxLen)[1]
final[i-1][column] = removeUnwanted(final[i-1][column])
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
writeSplitToCSV(writeFile, final)
checkCols(final, 6)
return final, idxSplitItems
def removeUnwanted(data):
data = data.replace(',',' ')
return data
def split_bylen(index, item, maxLen):
clean = removeUnwanted(item)
splitList = [clean[ind:ind+maxLen] for ind in range(0, len(item), maxLen)]
splitList.insert(0,index)
return splitList
def writeSplitToCSV(writeFile,data):
with open(writeFile,'w', encoding="utf8", newline='') as f:
writer = csv.writer(f)
writer.writerows(data)
def checkCols(data, columns):
for sublist in data:
if len(sublist)-1!=columns:
print ("[X] This row doesnt have the same amount of columns as others: "+sublist)
else:
print("All okay")
#len(data) #how many split items
#print(your_list[0][0])
#print("Number of large: ", counter)
final, idxSplitItems = checkLength(0,'test.csv','final.csv', 30)
print("------------------------")
print(idxSplitItems)
print("-------------------------")
print(final)
Now I have a problem with this part of the code, notice this:
print("After split data: "+ data)
print("After split final: "+ final[i-1][column])
This is to check if removing comma worked.
with example of
"BUTKOVIĆ VESNA , DIPL.IUR."
data returns
BUTKOVIĆ VESNA DIPL.IUR.
but final returns
BUTKOVIĆ VESNA , DIPL.IUR.
why does my final return "," again but in data its gone, must be something done in "split_bylen()" that makes it do that
Dictionaries are fun!
To overwrite the original csv see here. You would have to use Dictreader & Dictwriter. I keep your method of reading just for clarity.
writecsvs = {} #store each line of each new csv
# e.g. {'csv1':[[row0_split1,row0_num,row0_str,row0_num],[row1_split1,row1_num,row1_str,row1_num],...],
# 'csv2':[[row0_split2,row0_num,row0_str,row0_num],[row1_split2,row1_num,row1_str,row1_num],...],
# .
# .
# .}
with open(fileName, mode='rw', encoding="utf-8-sig") as rf:
reader = csv.reader(rf, delimiter=',')
for row in reader:
col1 = row[0]
# check size & split
# decide number of new csvs
# overwrite original csv
# store new content in writecsvs dict
for # Loop over each csv in writecsvs:
writelines = # Get List of Lines
out_file = open('csv1.csv', mode='w') # use the keys in writecsvs for filenames
for line in writelines:
out_file.write(line)
Hope this helps.
I have different text files and I want to extract the values from there into a csv file.
Each file has the following format
main cost: 30
additional cost: 5
I managed to do that but the problem that I want it to insert the values of each file into a different columns I also want the number of text files to be a user argument
This is what I'm doing now
numFiles = sys.argv[1]
d = [[] for x in xrange(numFiles+1)]
for i in range(numFiles):
filename = 'mytext' + str(i) + '.text'
with open(filename, 'r') as in_file:
for line in in_file:
items = line.split(' : ')
num = items[1].split('\n')
if i ==0:
d[i].append(items[0])
d[i+1].append(num[0])
grouped = itertools.izip(*d[i] * 1)
if i == 0:
grouped1 = itertools.izip(*d[i+1] * 1)
with open(outFilename, 'w') as out_file:
writer = csv.writer(out_file)
for j in range(numFiles):
for val in itertools.izip(d[j]):
writer.writerow(val)
This is what I'm getting now, everything in one column
main cost
additional cost
30
5
40
10
And I want it to be
main cost | 30 | 40
additional cost | 5 | 10
You could use a dictionary to do this where the key will be the "header" you want to use and the value be a list.
So it would look like someDict = {'main cost': [30,40], 'additional cost': [5,10]}
edit2: Went ahead and cleaned up this answer so it makes a little more sense.
You can build the dictionary and iterate over it like this:
from collections import OrderedDict
in_file = ['main cost : 30', 'additional cost : 5', 'main cost : 40', 'additional cost : 10']
someDict = OrderedDict()
for line in in_file:
key,val = line.split(' : ')
num = int(val)
if key not in someDict:
someDict[key] = []
someDict[key].append(num)
for key in someDict:
print(key)
for value in someDict[key]:
print(value)
The code outputs:
main cost
30
40
additional cost
5
10
Should be pretty straightforward to modify the example to fit your desired output.
I used the example # append multiple values for one key in Python dictionary and thanks to #wwii for some suggestions.
I used an OrderedDict since a dictionary won't keep keys in order.
You can run my example # https://ideone.com/myN2ge
This is how I might do it. Assumes the fields are the same in all the files. Make a list of names, and a dictionary using those field names as keys, and the list of values as the entries. Instead of running on file1.text, file2.text, etc. run the script with file*.text as a command line argument.
#! /usr/bin/env python
import sys
if len(sys.argv)<2:
print "Give file names to process, with wildcards"
else:
FileList= sys.argv[1:]
FileNum = 0
outFilename = "myoutput.dat"
NameList = []
ValueDict = {}
for InfileName in FileList:
Infile = open(InfileName, 'rU')
for Line in Infile:
Line=Line.strip('\n')
Name,Value = Line.split(":")
if FileNum==0:
NameList.append(Name.strip())
ValueDict[Name] = ValueDict.get(Name,[]) + [Value.strip()]
FileNum += 1 # the last statement in the file loop
Infile.close()
# print NameList
# print ValueDict
with open(outFilename, 'w') as out_file:
for N in NameList:
OutString = "{},{}\n".format(N,",".join(ValueDict.get(N)))
out_file.write(OutString)
Output for my four fake files was:
main cost,10,10,40,10
additional cost,25.6,25.6,55.6,25.6
I need my code to remove duplicate lines from a file, at the moment it is just reproducing the same file as output. Can anyone see how to fix this? The for loop is not running as I would have liked.
#!usr/bin/python
import os
import sys
#Reading Input file
f = open(sys.argv[1]).readlines()
#printing no of lines in the input file
print "Total lines in the input file",len(f)
#temporary dictionary to store the unique records/rows
temp = {}
#counter to count unique items
count = 0
for i in range(0,9057,1):
if i not in temp: #if row is not there in dictionary i.e it is unique so store it into a dictionary
temp[f[i]] = 1;
count += 1
else: #if exact row is there then print duplicate record and dont store that
print "Duplicate Records",f[i]
continue;
#once all the records are read print how many unique records are there
#u can print all unique records by printing temp
print "Unique records",count,len(temp)
#f = open("C://Python27//Vendor Heat Map Test 31072015.csv", 'w')
#print f
#f.close()
nf = open("C://Python34//Unique_Data.csv", "w")
for data in temp.keys():
nf.write(data)
nf.close()
# Written by Gary O'Neill
# Date 03-08-15
This is a much better way to do what you want:
infile_path = 'infile.csv'
outfile_path = 'outfile.csv'
written_lines = set()
with open(infile_path, 'r') as infile, open(outfile_path, 'w') as outfile:
for line in infile:
if line not in written_lines:
outfile.write(line)
written_lines.add(line)
else:
print "Duplicate record: {}".format(line)
print "{} unique records".format(len(written_lines))
This will read one line at a time, so it works even on large files that don't fit into memory. While it's true that if they're mostly unique lines, written_lines will end up being large anyway, it's better than having two copies of almost every line in memory.
You should test the existence of f[i] in temp not i. Change the line:
if i not in temp:
with
if f[i] not in temp:
I'm trying to parse the 2 pipe/comma separated files and if the particular field matches in the file create the new entry in the 3rd file.
Code as follows:
#! /usr/bin/python
fo = open("c-1.txt" , "r" )
for line in fo:
#print line
fields = line.split('|')
src = fields[0]
f1 = open("Airport.txt", 'r')
f2 = open("b.txt", "a")
#with open('c.csv', 'r') as f1:
# line1 = f1.read()
for line1 in f1:
reader = line1.split(',')
hi = False
target = reader[0]
if target == src and fields[1] == 'ZHT':
print target
hi = True
f2.write(fields[0])
f2.write("|")
f2.write(fields[1])
f2.write("|")
f2.write(fields[2])
f2.write("|")
f2.write(fields[3])
f2.write("|")
f2.write(fields[4])
f2.write("|")
f2.write(fields[5])
f2.write("|")
f2.write(reader[2])
if hi == False:
f2.write(line)
f2.close()
f1.close()
fo.close()
The matching field gets printed 2 times in the new file. what could be the reason?
The problem seems to be that you reset hi to False in each iteration of the loop. Lets say the second line matches, but the third does not. You set hi to True in the second line, but then to False again in the third, and then print the original line.
Try like this:
hi = False
for line1 in f1:
reader = line1.split(',')
target = reader[0]
if target == src and fields[1] == 'ZHT':
hi = True
f2.write(stuff)
if hi == False:
f2.write(line)
Or, assuming that only one line will ever match, you could use for/else:
for line1 in f1:
reader = line1.split(',')
target = reader[0]
if target == src and fields[1] == 'ZHT':
f2.write(stuff)
break
else:
f2.write(line)
Also note that you could probably replace that series of f2.write statements by this one, joining the several parts with |:
f2.write('|'.join(fields[0:6] + [reader[2]])
As mentioned already, you reset the flag within the loop so are liable to printing multiple lines.
If there is definitely only one row that will match it might be worth breaking the loop once that row has been found.
and finally check your data to make sure there aren't identical matching rows.
Other than that I have a couple other suggestions to clean up your code and make it easier to debug:
1) Use the csv library.
2) If the files can be held in memory, it would be better to hold them in memory instead of constantly opening and closing them.
3) Use with to handle the files (I not you have already tried in your comments).
Something like the following should work.
#! /usr/bin/python
import csv
data_0 = {}
data_1 = {}
with open("c-1.txt" , "r" ) as fo, open("Airport.txt", "r") as f1:
fo_reader = csv.reader(fo, delimiter="|")
f1_reader = csv.reader(f1) # default delimiter is ','
for line in fo_reader:
if line[1] == 'ZHT':
try: # Add to a list here in case keys are duplicated.
data_0[line[0]].append(line)
except KeyError:
data_0[line[0]] = [line]
for line in f1_reader:
data_1[line[0]] = line[2] # We only need the third column of this row to append to the data.
with open("b.txt", "a") as f2:
writer = csv.writer(f2, delimiter="|") # I would be tempted to not make this a pipe, but probably too late already if you've got a pre-made file.
for key in data_0:
if key in data_1.keys():
for row in data_0[key]:
writer.writerow(row[:6]+data_1[key]) # index to the 6th column, and append the data from the other file.
else:
for row in data_0[key]:
writer.writerow(row)
That should avoid having the extra rows as well as there is no true/False flag to rely on.
I was hoping someone may be able to point me in the right direction, or give an example on how I can put the following script output into an Excel spreadsheet using xlwt. My script prints out the following text on screen as required, however I was hoping to put this output into an Excel into two columns of time and value. Here's the printed output..
07:16:33.354 1
07:16:33.359 1
07:16:33.364 1
07:16:33.368 1
My script so far is below.
import re
f = open("C:\Results\16.txt", "r")
searchlines = f.readlines()
searchstrings = ['Indicator']
timestampline = None
timestamp = None
f.close()
a = 0
tot = 0
while a<len(searchstrings):
for i, line in enumerate(searchlines):
for word in searchstrings:
if word in line:
timestampline = searchlines[i-33]
for l in searchlines[i:i+1]: #print timestampline,l,
#print
for i in line:
str = timestampline
match = re.search(r'\d{2}:\d{2}:\d{2}.\d{3}', str)
if match:
value = line.split()
print '\t',match.group(),'\t',value[5],
print
print
tot = tot+1
break
print 'total count for', '"',searchstrings[a],'"', 'is', tot
tot = 0
a = a+1
I have had a few goes using xlwt or CSV writer, but each time i hit a wall and revert bact to my above script and try again. I am hoping to print match.group() and value[5] into two different columns on an Excel worksheet.
Thanks for your time...
MikG
What kind of problems do you have with xlwt? Personally, I find it very easy to use, remembering basic workflow:
import xlwt
create your spreadsheet using eg.
my_xls=xlwt.Workbook(encoding=your_char_encoding),
which returns you spreadsheet handle to use for adding sheets and saving whole file
add a sheet to created spreadsheet with eg.
my_sheet=my_xls.add_sheet("sheet name")
now, having sheet object, you can write on it's cells using sheet_name.write(row,column, value):
my_sheet.write(0,0,"First column title")
my sheet.write(0,1,"Second column title")
Save whole thing using spreadsheet.save('file_name.xls')
my_xls.save("results.txt")
It's a simplest of working examples; your code should of course use sheet.write(row,column,value) within loop printing data, let it be eg.:
import re
import xlwt
f = open("C:\Results\VAMOS_RxQual_Build_Update_Fri_04-11.08-16.txt", "r")
searchlines = f.readlines()
searchstrings = ['TSC Set 2 Indicator']
timestampline = None
timestamp = None
f.close()
a = 0
tot = 0
my_xls=xlwt.Workbook(encoding="utf-8") # begin your whole mighty xls thing
my_sheet=my_xls.add_sheet("Results") # add a sheet to it
row_num=0 # let it be current row number
my_sheet.write(row_num,0,"match.group()") # here go column headers,
my_sheet.write(row_num,1,"value[5]") # change it to your needs
row_num+=1 # let's change to next row
while a<len(searchstrings):
for i, line in enumerate(searchlines):
for word in searchstrings:
if word in line:
timestampline = searchlines[i-33]
for l in searchlines[i:i+1]: #print timestampline,l,
#print
for i in line:
str = timestampline
match = re.search(r'\d{2}:\d{2}:\d{2}.\d{3}', str)
if match:
value = line.split()
print '\t',match.group(),'\t',value[5],
# here goes cell writing:
my_sheet.write(row_num,0,match.group())
my_sheet.write(row_num,1,value[5])
row_num+=1
# and that's it...
print
print
tot = tot+1
break
print 'total count for', '"',searchstrings[a],'"', 'is', tot
tot = 0
a = a+1
# don't forget to save your file!
my_xls.save("results.xls")
A catch:
native date/time data writing to xls was a nightmare to me, as excel
internally doesn't store date/time data (nor I couldn't figure it
out),
be careful about data types you're writing into cells. For simple reporting at the begining it's enough to pass everything as a string,
later you should find xlwt documentation quite useful.
Happy XLWTing!