I'm trying to remove duplicates from a csv file with a lot of data. The removal works as intended but I can't seem to figure out how to change encoding on inplace removal. Googling for an answer didn't help. Any of you got a suggestion?
This is my code:
seen = set()
for line in fileinput.FileInput('Dupes.csv', inplace=1):
if line in seen: continue # skip duplicated line
seen.add(line)
print(line, end='')
This script works fine with me.
import fileinput
import sys
encoding = 'utf8'
end = '\n'
seen = set()
dupeCount = 0
for line in fileinput.FileInput('Dupes.csv', inplace=1, mode='rU'):
stripped = line.strip()
if stripped in seen:
dupeCount += 1
continue
seen.add(stripped)
# Sends the output in the right representation
sys.stdout.buffer.write(stripped.encode(encoding) + end.encode(encoding))
print('Removed %d dupes' % dupeCount)
The idea is to read the file with the right mode, and then write to the file thru stdout in the correct encoding, which is done by writing everything in the utf8's byte representation.
Tested with accents, seems to work.
Related
I am trying to read a file from command line and trying to replace all the commas in that file with blank. Below is my code:
import sys
datafile = sys.argv[1];
with open(datafile, 'r') as data:
plaintext = data.read()
plaintext = plaintext.replace(',', '')
print(plaintext)
But while printing the plaintext I am getting one extra blank row at the end. Why is it happening and how can I get rid of that?
You might be able to use
plaintext.rstrip('\n')
This should remove the extra line
I am using this code to find a string in Python:
buildSucceeded = "Build succeeded."
datafile = r'C:\PowerBuild\logs\Release\BuildAllPart2.log'
with open(datafile, 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
I am quite sure there is the string in the file although it does not return anything.
If I just print one line by line it returns a lot of 'NUL' characters between each "valid" character.
EDIT 1:
The problem was the encoding of Windows. I changed the encoding following this post and it worked: Why doesn't Python recognize my utf-8 encoded source file?
Anyway the file looks like this:
Line 1.
Line 2.
...
Build succeeded.
0 Warning(s)
0 Error(s)
...
I am currently testing with Sublime for Windows editor - which outputs a 'NUL' character between each "real" character which is very odd.
Using python command line I have this output:
C:\Dev>python readFile.py
Traceback (most recent call last):
File "readFile.py", line 7, in <module>
print(line)
File "C:\Program Files\Python35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1: character maps to <undefined>
Thanks for your help anyway...
If your file is not that big you can do a simple find. Otherwise I would check to file to see if you have the string in the file/ check the location for any spelling mistakes and try to narrow down the problem.
f = open(datafile, 'r')
lines = f.read()
answer = lines.find(buildSucceeded)
Also note that if it does not find the string answer would be -1.
As explained, the problem happening was related to encoding. In the below website there is a very good explanation on how to convert between files with one encoding to some other.
I used the last example (with Python 3 which is my case) it worked as expected:
buildSucceeded = "Build succeeded."
datafile = 'C:\\PowerBuild\\logs\\Release\\BuildAllPart2.log'
# Open both input and output streams.
#input = open(datafile, "rt", encoding="utf-16")
input = open(datafile, "r", encoding="utf-16")
output = open("output.txt", "w", encoding="utf-8")
# Stream chunks of unicode data.
with input, output:
while True:
# Read a chunk of data.
chunk = input.read(4096)
if not chunk:
break
# Remove vertical tabs.
chunk = chunk.replace("\u000B", "")
# Write the chunk of data.
output.write(chunk)
with open('output.txt', 'r') as f:
for line in f:
if buildSucceeded in line:
print(line)
Source: http://blog.etianen.com/blog/2013/10/05/python-unicode-streams/
I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb.
Currently I have code like this.
with open(fname, "r") as fp:
for line in fp:
line = line.strip()
line = line.decode('utf-8', 'ignore')
line = line.encode('utf-8', 'ignore')
somehow I still get an error
bson.errors.InvalidStringData: strings in documents must be valid UTF-8:
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin
I don't get it. Is there some simple way to do it?
UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.
Try below code line instead of last two lines. Hope it helps:
line=line.decode('utf-8','ignore').encode("utf-8")
For python 3, as mentioned in a comment in this thread, you can do:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.
If your line is already a bytes object (e.g. b'my string') then you just need to decode it with decode('utf-8', 'ignore').
Example to handle no utf-8 characters
import string
test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
print ''.join(x for x in test if x in string.printable)
with open(fname, "r") as fp:
for line in fp:
line = line.strip()
line = line.decode('cp1252').encode('utf-8')
I want to read some quite huge files(to be precise: the google ngram 1 word dataset) and count how many times a character occurs. Now I wrote this script:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
line = line.strip()
data = line.split('\t')
for character in list(data[0]):
if (not character in charcounts):
charcounts[character] = 0
charcounts[character] += int(data[1])
if (fileinput.filename() is not lastfile):
print(fileinput.filename())
lastfile = fileinput.filename()
if(fileinput.filelineno() % 100000 == 0):
print(fileinput.filelineno())
print(charcounts)
which works fine, until it reaches approx. line 700.000 of the first file, I then get this error:
../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
File "charactercounter.py", line 5, in <module>
for line in fileinput.input(files):
File "C:\Python31\lib\fileinput.py", line 254, in __next__
line = self.readline()
File "C:\Python31\lib\fileinput.py", line 349, in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>
To solve this I searched the web a bit, and came up with this code:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
line = line.strip()
data = line.split('\t')
for character in list(data[0]):
if (not character in charcounts):
charcounts[character] = 0
charcounts[character] += int(data[1])
if (fileinput.filename() is not lastfile):
print(fileinput.filename())
lastfile = fileinput.filename()
if(fileinput.filelineno() % 100000 == 0):
print(fileinput.filelineno())
print(charcounts)
but the hook I now use tries to read the entire, 990MB, file into the memory at once, which kind of crashes my pc. Does anyone know how to rewrite this code so that it actually works?
p.s: the code hasn't even run all the way yet, so I don't even know if it does what it has to do, but for that to happen I first need to fix this bug.
Oh, and I use Python 3.2
I do not know why fileinput does not work as expected.
I suggest you use the open function instead. The return value can be iterated over and will return lines, just like fileinput.
The code will then be something like:
for filename in files:
print(filename)
for filelineno, line in enumerate(open(filename, encoding="utf-8")):
line = line.strip()
data = line.split('\t')
# ...
Some documentation links: enumerate, open, io.TextIOWrapper (open returns an instance of TextIOWrapper).
The problem is that fileinput doesn't use file.xreadlines(), which reads line by line, but file.readline(bufsize), which reads bufsize bytes at once (and turns that into a list of lines). You are providing 0 for the bufsize parameter of fileinput.input() (which is also the default value). Bufsize 0 means that the whole file is buffered.
Solution: provide a reasonable bufsize.
This works for me: you can use "utf-8" in the hook definition. I used it on a 50GB/200M lines file with no problem.
fi = fileinput.FileInput(openhook=fileinput.hook_encoded("iso-8859-1"))
Could you try to read not a whole file, but a part of it as binary, then decode(), then proccess, then call the function again to read another part?
I don't if the one I have is the latest version (and I don't remember how I read them), but...
$ file -i googlebooks-eng-1M-1gram-20090715-0.csv
googlebooks-eng-1M-1gram-20090715-0.csv: text/plain; charset=us-ascii
Have you tried fileinput.hook_encoded('ascii') or fileinput.hook_encoded('latin_1')? Not sure why this would make a difference, since I think the these are just subsets of unicode with the same mapping, but worth a try.
EDIT I think this might be a bug in fileinput, neither of these work.
If you are worried about the mem usage, why not read by line using readline()? This will get rid of the memory issues you are running into. Currently you are reading the full file before performing any actions on the fileObj, with readline() you are not saving the data, merely searching it on a per-line basis.
def charCount1(_file, _char):
result = []
file = open(_file, encoding="utf-8")
data = file.read()
file.close()
for index, line in enumerate(data.split("\n")):
if _char in line:
result.append(index)
return result
def charCount2(_file, _char):
result = []
count = 0
file = open(_file, encoding="utf-8")
while 1:
line = file.readline()
if _char in line:
result.append(count)
count += 1
if not line: break
file.close()
return result
I didn't have a chance to really look over your code but the above samples should give you an idea of how to make the appropriate changes to your structure. charCount1() demonstrates your method which caches the entire file in a single call from read(). I tested your method out on a +400MB text file and the python.exe process went as high as +900MB. when you run charCount2(), the python.exe process shouldn't exceed more than a few MB's (provided you haven't bulked up the size with other code) ;)
I have a file in UTF-8, where some lines contain the U+2028 Line Separator character (http://www.fileformat.info/info/unicode/char/2028/index.htm). I don't want it to be treated as a line break when I read lines from the file. Is there a way to exclude it from separators when I iterate over the file or use readlines()? (Besides reading the entire file into a string and then splitting by \n.) Thank you!
I can't duplicate this behaviour in python 2.5, 2.6 or 3.0 on mac os x - U+2028 is always treated as non-endline. Could you go into more detail about where you see this error?
That said, here is a subclass of the "file" class that might do what you want:
#/usr/bin/python
# -*- coding: utf-8 -*-
class MyFile (file):
def __init__(self, *arg, **kwarg):
file.__init__(self, *arg, **kwarg)
self.EOF = False
def next(self, catchEOF = False):
if self.EOF:
raise StopIteration("End of file")
try:
nextLine= file.next(self)
except StopIteration:
self.EOF = True
if not catchEOF:
raise
return ""
if nextLine.decode("utf8")[-1] == u'\u2028':
return nextLine+self.next(catchEOF = True)
else:
return nextLine
A = MyFile("someUnicode.txt")
for line in A:
print line.strip("\n").decode("utf8")
I couldn't reproduce that behavior but here's a naive solution that just merges readline results until they don't end with U+2028.
#!/usr/bin/env python
from __future__ import with_statement
def my_readlines(f):
buf = u""
for line in f.readlines():
uline = line.decode('utf8')
buf += uline
if uline[-1] != u'\u2028':
yield buf
buf = u""
if buf:
yield buf
with open("in.txt", "rb") as fin:
for l in my_readlines(fin):
print l
Thanks to everyone for answering.
I think I know why you might not have been able to replicate this.I just realized that it happens if I decode the file when opening, as in:
f = codecs.open(filename, encoding='utf-8')
for line in f:
print line
The lines are not separated on u2028, if I open the file first and then decode individual lines:
f = open(filename)
for line in f:
print line.decode("utf8")
(I'm using Python 2.6 on Windows. The file was originally UTF16LE and then it was converted into UTF8).
This is very interesting, I guess I won't be using codecs.open much from now on :-).
If you use Python 3.0 (note that I don't, so I can't test), according to the documentation you can pass an optional newline parameter to open to specifify which line seperator to use. However, the documentation doesn't mention U+2028 at all (it only mentions \r, \n, and \r\n as line seperators), so it's actually a suprise to me that this even occurs (although I can confirm this even with Python 2.6).
The codecs module is doing the RIGHT thing. U+2028 is named "LINE SEPARATOR" with the comment "may be used to represent this semantic unambiguously". So treating it as a line separator is sensible.
Presumably the creator would not have put the U+2028 characters there without good reason ... does the file have u"\n" as well? Why do you want lines not to be split on U+2028?