I have the function below which is reading in a previously constructed file into a defaultdict. The file it reads is a csv and contains file sizes and the paths of files.
If more than 1 file matches the filesize of another, then that file is run through a hashing function.
The issue i have is, print is giving me the expected output, where as writing the output to a file is not.
def loadfiles():
'''Loads files and identifies potential duplicates'''
files = defaultdict(list) # uses defaultdict
with open(tmpfile) as csvfile: # reads the file into a dictionary
reader = csv.DictReader(csvfile)
for row in reader:
files[row['size']].append(row['file'])
for key, value in files.items():
if len([item for item in value if item]) > 1:
with open (reportname, 'w') as fr:
writer = csv.writer(fr, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(['size','filename','hash'])
for value in value:
writer.writerow([key,value,str(md5Checksum(value))])
print(key, value, str(md5Checksum(value)))
The output to a file is this:
size,filename,hash
43842270,/home/bob/scripts/inprogress_python_scripts/file_dup/testingscript/webwolf-8.0.0.M25.jar,b325dc62d33e2ada19aea07cbcfb237f
43842270,/home/bob/scripts/inprogress_python_scripts/file_dup/testingscript/bkwolf.jar,b325dc62d33e2ada19aea07cbcfb237f
Where as the output to screen from print is this:
128555 /home/bob/scripts/inprogress_python_scripts/file_dup/testingscript/SN0aaa(1).pdf def426a8dee8f226e40df826fcde9904
128555 /home/bob/scripts/inprogress_python_scripts/file_dup/testingscript/SN0aaa(1) (another copy).pdf def426a8dee8f226e40df826fcde9904
128555 /home/bob/scripts/inprogress_python_scripts/file_dup/testingscript/SN0aaa.pdf def426a8dee8f226e40df826fcde9904
128555 /home/bob/scripts/inprogress_python_scripts/file_dup/testingscript/SN0aaa(1) (copy).pdf def426a8dee8f226e40df826fcde9904
43842270 /home/bob/scripts/inprogress_python_scripts/file_dup/testingscript/webwolf-8.0.0.M25.jar b325dc62d33e2ada19aea07cbcfb237f
43842270 /home/b/scripts/inprogress_python_scripts/file_dup/testingscript/bkwolf.jar b325dc62d33e2ada19aea07cbcfb237f
Any ideas / guidance please as to whats wrong?
Using "w" opens the file in write mode, overwriting anything that already exists in the file. Use "a" for append instead.
This will lead to the problem that you will have your header (size,filename,hash) multiple times in there - consider writing this in the very first line and not in a loop.
See, for example: https://www.w3schools.com/python/python_file_write.asp
Related
update-my file.txt.zp is tab delimited and looks kind of like this :
file.txt.zp
I want to split the first col by : _ /
original post:
I have a very large zipped tab delimited file.
I want to open it, scan it one row at a time, split some of the col, and write it to a new file.
I got various errors (every time I fix one another pops)
This is my code:
import csv
import re
import gzip
f = gzip.open('file.txt.gz')
original = f.readlines()
f.close()
original_l = csv.reader(original)
for row in original_l:
file_l = re.split('_|:|/',row)
with open ('newfile.gz', 'w', newline='') as final:
finalfile = csv.writer(final,delimiter = ' ')
finalfile.writerow(file_l)
Thanks!
for this code i got the error:
for row in original_l:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
so based on what I found here I added this after f.close():
original = original.decode('utf8')
and then got the error:
original = original.decode('utf8')
AttributeError: 'list' object has no attribute 'decode'
Update 2
This code should produce the output that you're after.
import csv
import gzip
import re
with gzip.open('file.txt.gz', mode='rt') as f, \
open('newfile.gz', 'w') as final:
writer = csv.writer(final, delimiter=' ')
reader = csv.reader(f, delimiter='\t')
_ = next(reader) # skip header row
for row in reader:
writer.writerow(re.split(r'_|:|/', row[0]))
Update
Open the gzip file in text mode because str objects are required by the CSV module in Python 3.
f = gzip.open('file.txt.gz', 'rt')
Also specify the delimiter when creating the csv.reader.
original_l = csv.reader(original, delimiter='\t')
This will get you past the first hurdle.
Now you need to explain what the data is, which columns you wish to extract, and what the output should look like.
Original answer follows...
One obvious problem is that the output file is constantly being overwritten by the next row of input. This is because the output file is opened in (over)write mode (`'w`` ) once per row.
It would be better to open the output file once outside of the loop.
Also, the CSV file delimiter is not specified when creating the reader. You said that the file is tab delimited so specify that:
original_l = csv.reader(original, delimiter='\t')
On the other hand, your code attempts to split each row using other delimiters, however, the rows coming from the csv.reader are represented as a list, not a string as the re.split() code would require.
Another problem is that the output file is not zipped as the name suggests.
I am receiving a error on this code. It is "TypeError: expected string or buffer". I looked around, and found out that the error is because I am passing re.sub a list, and it does not take lists. However, I wasn't able to figure out how to change my line from the csv file into something that it would read.
I am trying to change all the periods in a csv file into commas. Here is my code:
import csv
import re
in_file = open("/test.csv", "rb")
reader = csv.reader(in_file)
out_file = open("/out.csv", "wb")
writer = csv.writer(out_file)
for row in reader:
newrow = re.sub(r"(\.)+", ",", row)
writer.writerow(newrow)
in_file.close()
out_file.close()
I'm sorry if this has already been answered somewhere. There was certainly a lot of answers regarding this error, but I couldn't make any of them work with my csv file. Also, as a side note, this was originally an .xslb excel file that I converted into csv in order to be able to work with it. Was that necessary?
You could use list comprehension to apply your substitution to each item in row
for row in reader:
newrow = [re.sub(r"(\.)+", ",", item) for item in row]
writer.writerow(newrow)
for row in reader does not return single element to parse it rather it returns list of of elements in that row so you have to unpack that list and parse each item individually, just like #Trii shew you:
[re.sub(r'(\.)+','.',s) for s in row]
In this case, we are using glob to access all the csv files in the directory.
The code below overwrites the source csv file, so there is no need to create an output file.
NOTE:
If you want to get a second file with the parameters provided with re.sub, replace write = open(i, 'w') for write = open('secondFile.csv', 'w')
import re
import glob
for i in glob.glob("*.csv"):
read = open(i, 'r')
reader = read.read()
csvRe = re.sub(re.sub(r"(\.)+", ",", str(reader))
write = open(i, 'w')
write.write(csvRe)
read.close()
write.close()
I have some code that is meant to convert CSV files into tab delimited files. My problem is that I cannot figure out how to write the correct values in the correct order. Here is my code:
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
Now, since both my write statements are in the for row in data loop, my headers are being written multiple times over. If I outdent the first write statement, I'll have an obvious formatting error. If I move the second write statement above the first and then outdent, my data will be out of order. What can I do to make sure that the first write statement gets written once as a header, and the second gets written for each line in the CSV file? How do I extract the first 'write' statement outside of the loop without breaking the dictionary? Thanks!
The csv module contains methods for writing as well as reading, making this pretty trivial:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.reader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow(row)
No need to do it all yourself. Note my use of the with statement, which should always be used when working with files in Python.
Edit: Naturally, if you want to select specific values, you can do that easily enough. You appear to be making your own dictionary to select the values - again, the csv module provides DictReader to do that for you:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.DictReader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow([row["name"], row["order_num"], ...])
As kirelagin points out in the commends, csv.writerows() could also be used, here with a generator expression:
writer.writerows([row["name"], row["order_num"], ...] for row in reader)
Extract the code that writes the headers outside the main loop, in such a way that it only gets written exactly once at the beginning.
Also, consider using the CSV module for writing CSV files (not just for reading), don't reinvent the wheel!
Ok, so I figured it out, but it's not the most elegant solutions. Basically, I just ran the first loop, wrote to the file, then ran it a second time and appended the results. See my code below. I would love any input on a better way to accomplish what I've done here. Thanks!
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.close()
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
tab_file.close()
I'm trying to remove the last row in a csv but I getting an error: _csv.Error: string with NUL byte
This is what I have so far:
dcsv = open('PnL.csv' , 'a+r+b')
cWriter = csv.writer(dcsv, delimiter=' ')
cReader = csv.reader(dcsv)
for row in cReader:
cWriter.writerow(row[:-1])
I cant figure out why I keep getting errors
I would just read in the whole file with readlines(), pop out the last row, and then write that with csv module
import csv
f = open("summary.csv", "r+w")
lines=f.readlines()
lines=lines[:-1]
cWriter = csv.writer(f, delimiter=',')
for line in lines:
cWriter.writerow(line)
This should work
import csv
f = open('Pnl.csv', "r+")
lines = f.readlines()
lines.pop()
f = open('Pnl.csv', "w+")
f.writelines(lines)
I'm not sure what you're doing with the 'a+r+b' file mode and reading and writing to the same file, so won't provide a complete code snippet, but here's a simple method to skip any lines that contains a NUL byte in them in a file you're reading, whether it's the last, first, or one in the middle being read.
The trick is to realize that the docs say the csvfile argument to a csv.writer() "can be any object which supports the iterator protocol and returns a string each time its next() method is called." This means that you can replace the file argument in the call with a simple filter iterator function defined this way:
def filter_nul_byte_lines(a_file):
for line in a_file:
if '\x00' not in line:
yield line
and use it in a way similar to this:
dcsv = open('Pnl.csv', 'rb+')
cReader = csv.reader(filter_nul_byte_lines(dcsv))
for row in cReader:
print row
This will cause any lines with a NUL byte in them to be ignored while reading the file. Also this technique works on-the-fly as each line is read, so it does not require reading the entire file into memory at once or preprocessing it ahead of time.
hi i was suppose to add a customer name to a customer.txt file
[1, “Amin Milani Fard”, “Columbia College”, 778]
[2, “Ali”, “Douiglas College”, 77238]
def addingcustomer(file_name,new_name):
f=open(file_name,'w')
for line in f:
if new_name==line:
return ("already existed")
elif new_name!=line:
f.write(str(new_name)+"\n")
return ("succesfully added")
it gives me this error:
Traceback (most recent call last):
File "C:\Users\Yuvinng\Desktop\Customer assignment 1\Customer assignment 2", line 77, in <module>
function(c)
File "C:\Users\Yuvinng\Desktop\Customer assignment 1\Customer assignment 2", line 26, in function
print (addingcustomer("customerlist.txt",x))
File "C:\Users\Yuvinng\Desktop\Customer assignment 1\Customer assignment 2", line 60, in addingcustomer
for line in f:
io.UnsupportedOperation: not readable
You probably want to close the file too, after you are done.
f.close()
You are opening the file with w, meaning you ask for write only permissions. Then you try to loop through all lines in the file, which is clearly a read operation. IIRC you should open the file with r+, w+ or a+, depending on what behaviour you want (read here for a description). Furthermore as mentionened by mh512 it is generally a good idea to close your file with f.close() when you're done with it.
However you might also want to rethink your algorithm.
for line in f:
if new_name==line:
return ("already existed")
elif new_name!=line:
f.write(str(new_name)+"\n")
return ("succesfully added")
For every line it processes, this will either return "already existed" if it equals the new name or write the new name to the file and return. Therefore this loop will always return after the first line. Furthermore even if the name already exists at a later point, if it isn't in the first line, it will be written to the file again. Since this is homework I won't give you the complete solution, but as a hint you might want to loop through all lines, before you decide what to do.
Here are a couple of things that might have caused the error:
You started by: "i was suppose to add a customer name to a customer.txt file" but the stack trace you posted says that the file you are trying to read is "customerlist.txt".
You are using the file_open function with 'w' for write privilege. try using 'r' or 'wr'.
E.
Probably not a direct answer to your specific question, but it solves your problem as a sideeffect.
I assume that every line in your code is a python list of the form
line = [id, fullname, establishment, some_integer]
and you (or someone else) stored it just by writing it to a line into the file with name file_name. That's not pythonic at all. You should use a standard format like CSV (Comma-Separated-Value) for the file (which python supports in its librarys with the CSV-module). As delimitter you can chose a comma, a semi-colon or what ever you want. Let's assume you chose a semi-colon as delimitter. Than the file would look like this:
id; name; establishment; some_other_id
"1"; "Amin Milani Fard"; "Columbia College"; "778"
"2", "Ali"; "Douiglas College"; "77238"
etc.
Assuming a list
mylist = ["1"; "Amin Milani Fard"; "Columbia College"; "778"]
you would need a CSV writer two write to this list:
import csv
wFile = open(file_name, 'w')
csvWriter = csv.writer(wFile, delimiter=';')
csvWriter.writerow(mylist) # write list to csv file
rFile.close()
if you want to read the list again you do:
rFile = open(file_name, 'r')
csvReader = csv.reader(rFile, delimiter=';')
for row in csvReader: # step throug rows. Each row is a list.
print ','.join(row)
rFile.close()
I don't this is a big effort for you but I suggest the following, if you want to clean up the files and your code a little bit: Turn the file into a csv file by reading all the lines into a list. turning every list into a valid python list and then write those lists with csvWriter to your file again which will then be a valid csv file. Then use csvreader and csvWriter to add lines to your file if they do not exist.
In your case (assuming the visible formatting is consistent) I would do:
import csv
old_name = 'customer.txt'
new_name = 'customer.csv'
rFile = open(old_name, 'r')
wfile = open(new_name, w)
csvWriter = csv.writer(wFile, delimiter=';')
for line in rFile:
line = line.strip()[1:-1] # strip braces "[" and "]" and newline "\n"
mylist = line.split(', ') # split the line by ', '
csvWriter.writerwo(mylist)
rFile.close()
wFile.close()
You will have a csv file afterwards. No you can use the csvreaders and writers as described.
Edit:
Perhaps the following code snippets helps you to understand what I meant above. csv readers and writers are really not that complicated. See
import csv
# Creating a customer file with some entrys to start with.
wFile = open('test.csv', 'w') # 'w' means crete fresh file
csvWriter = csv.writer(wFile, quoting=csv.QUOTE_MINIMAL)
csvWriter.writerow(['1', 'old', '2', 'customer'])
csvWriter.writerow(['1', 'an other old', '2', 'customer'])
wFile.close() # Don't forget to close the file
new_customers = [ # List of new customers to add if not exist.
['1', 'old', '2', 'customer'], # will not be added.
['4', 'new', '2', 'customer'] # will be added.
]
# First we wont to eliminate existent customers from list `new_customers`
rFile = open('test.csv', 'r') # 'r' means read file
csvReader = csv.reader(rFile)
print new_customers
for row in csvReader:
print row
if row in new_customers:
new_customers.remove(row) # remove customer if exists
rFile.close() # Don't forget to close the file
if new_customers: # there are new customers left in new_customers
wFile = open('test.csv', 'a') # 'a' means append to existing file
csvWriter = csv.writer(wFile, quoting=csv.QUOTE_MINIMAL)
for customer in new_customers:
csvWriter.writerow(customer) # add them to the file.
wFile.close()