Strange special characters appearing - python

https://docs.google.com/file/d/0B1sEqo7wNB1-TlNEeXh6QldLT2c/edit
I am trying to have a program that will remove the special characters in the above txt.
I already have a remover like this.
chars = [u'\u001A', u'\u001C', u'\u001D', u'\u001E', u'\u0085'];
input_file = sys.argv[1]
output_file = sys.argv[2]
ifile = codecs.open(input_file, encoding = 'utf-8', mode="rb")
ofile = codecs.open(output_file, encoding = 'utf-8', mode="wb")
for line in ifile:
for ch in chars:
if ch in line:
line = line.replace(ch, '')
ofile.write(line)
ifile.close()
ofile.close()
But it can't remove those characters in that txt. Rather, it crashes. What should I do?

I would try with:
input_file = sys.argv[1]
output_file = sys.argv[2]
ifile = codecs.open(input_file, encoding = 'utf-8', mode="rb")
ofile = codecs.open(output_file, encoding = 'utf-8', mode="wb")
for line in ifile:
for ch in line:
try:
ofile.write(ch.decode('utf-8')
except UnicodeDecodeError:
pass
ifile.close()
ofile.close()
As a little hint, for the code to be more pythonic look at with statements.

Related

How can i convert a UTF-16-LE txt file to an ANSI txt file and remove the header in PYTHON?

I have a .txt file in UTF-16-LE encoding .
I want to remove the headers(1st row) and save it in ANSI
I can do it maually but i need to do that for 150 txt files EVERY day
So i wanted to use Python to do it automatically.
But i am stuck ,
i have tried this code but it is not working ,produces an error :
*"return mbcs_encode(input, self.errors)[0]
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character "*
filename = "filetochangecodec.txt"
path = "C:/Users/fallen/Desktop/New folder/"
pathfile = path + filename
coding1 = "utf-16-le"
coding2 = "ANSI"
f= open(pathfile, 'r', encoding=coding1)
content= f.read()
f.close()
f= open(pathfile, 'w', encoding=coding2)
f.write(content)
f.close()
A noble contributer helped me with the solution and i now post it so everyone can benefit and save time.
Instead of trying to write all the content , we make a list with every line of the txt file and then we write them in a new file one by one with the use of " for " .
import os
inpath = r"C:/Users/user/Desktop/insert/"
expath = r"C:/Users/user/Desktop/export/"
encoding1 = "utf-16"
encoding2 = "ansi"
input_filename = "text.txt"
input_pathfile = os.path.join(inpath, input_filename)
output_filename = "new_text.txt"
output_pathfile = os.path.join(expath, output_filename)
with open(input_pathfile, 'r', encoding=encoding1) as file_in:
lines = []
for line in file_in:
lines.append(line)
with open(output_pathfile, 'w', encoding='ANSI') as f:
for line in lines:
f.write(line)

Text mining UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1671718: character maps to <undefined>

I have written code to create frequency table. but it is breaking at the line ext_string = document_text.read().lower(. I even put a try and except to catch the error but it is not helping.
import re
import string
frequency = {}
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
try:
count = frequency.get(word,0)
frequency[word] = count + 1
except UnicodeDecodeError:
pass
frequency_list = frequency.keys()
for words in frequency_list:
print (words, frequency[words])
You are opening your file twice, the second time without specifying the encoding:
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
You should open the file as follows:
frequencies = {}
with open('EVG_text mining.txt', encoding="utf8", mode='r') as f:
text = f.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
...
The second time you were opening your file, you were not defining what encoding to use which is probably why it errored.
The with statement helps perform certain task linked with I/O for a file. You can read more about it here: https://www.pythonforbeginners.com/files/with-statement-in-python
You should probably have a look at error handling as well as you were not enclosing the line that was actually causing the error: https://www.pythonforbeginners.com/error-handling/
The code ignoring all decoding issues:
import re
import string # Do you need this?
with open('EVG_text mining.txt', mode='rb') as f: # The 'b' in mode changes the open() function to read out bytes.
bytes = f.read()
text = bytes.decode('utf-8', 'ignore') # Change 'ignore' to 'replace' to insert a '?' whenever it finds an unknown byte.
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
frequencies = {}
for word in match_pattern: # Your error handling wasn't doing anything here as the error didn't occur here but when reading the file.
count = frequencies.setdefault(word, 0)
frequencies[word] = count + 1
for word, freq in frequencies.items():
print (word, freq)
To read a file with some special characters, use encoding as 'latin1' or 'unicode_escape'

Python 2.7 CSV file read/write \xef\xbb\xbf code

I have a question about Python 2.7 read/write csv file with 'utf-8-sig' code, my csv . header is
['\xef\xbb\xbfID;timestamp;CustomerID;Email']
there have some code("\xef\xbb\xbfID") I read from file A.csv and I want write the same code and header to file B.csv
My print log is shows:
['\xef\xbb\xbfID;timestamp;CustomerID;Email']
But the actual output file header it looks like
ÔªøID;timestamp
Here is the code:
def remove_gdpr_info_from_csv(file_path, file_name, temp_folder, original_header):
new_temp_folder = tempfile.mkdtemp()
new_temp_file = new_temp_folder + "/" + file_name
# Blanked new file
with open(new_temp_file, 'wb') as outfile:
writer = csv.writer(outfile, delimiter=";")
print original_header
writer.writerow(original_header)
# File from SFTP
with open(file_path, 'r') as infile:
reader = csv.reader(infile, delimiter=";")
first_row = next(reader)
email = first_row.index('Email')
contract_detractor1 = first_row.index('Contact Detractor (Q21)')
contract_detractor2 = first_row.index('Contact Detractor (Q20)')
contract_detractor3 = first_row.index('Contact Detractor (Q43)')
contract_detractor4 = first_row.index('Contact Detractor(Q26)')
contract_detractor5 = first_row.index('Contact Detractor(Q27)')
contract_detractor6 = first_row.index('Contact Detractor(Q44)')
indexes = []
for column_name in header_list:
ind = first_row.index(column_name)
indexes.append(ind)
for row in reader:
output_row = []
for ind in indexes:
data = row[ind]
if ind == email:
data = ''
elif ind == contract_detractor1:
data = ''
elif ind == contract_detractor2:
data = ''
elif ind == contract_detractor3:
data = ''
elif ind == contract_detractor4:
data = ''
elif ind == contract_detractor5:
data = ''
elif ind == contract_detractor6:
data = ''
output_row.append(data)
writer.writerow(output_row)
s3core.upload_files(SPARKY_S3, DESTINATION_PATH, new_temp_file)
shutil.rmtree(temp_folder)
shutil.rmtree(new_temp_folder)
'\xef\xbb\xbf' is the UTF8 encoded version of the unicode ZERO WIDTH NO-BREAK SPACE U+FEFF. It is often used as a Byte Order Mark at the beginning of unicode text files:
when you have 3 bytes: '\xef\xbb\xbf', then the file is utf8 encoded
when you have 2 bytes: '\xff\xfe', then the file is in utf16 little endian
when you have 2 bytes: '\xfe\xff', then the file is in utf16 big endian
The 'utf-8-sig' encoding explicitely asks for writing this BOM at the beginning of the file
To process it automatically at read time of a csv file in Python 2, you can use the codecs module:
with open(file_path, 'r') as infile:
reader = csv.reader(codecs.EncodedFile(infile, 'utf-8', 'utf-8-sig'), delimiter=";")
EncodedFile will wrap the original file object by decoding it in utf8-sig, actually skipping the BOM and re-encoding it in utf8 with no BOM.
You want to use the EncodedFile method from the codecs library as in Serge Ballesta's answer.
However using Python 2.7 the encoding utf-8-sig is not a supported alias for the UTF8-sig encoding, you need to use utf_8_sig. Additionally the order of the method properties needs to define the output data encoding first, and the file encoding second: codecs.EncodedFile(file,datacodec,filecodec=None,errors=’strict')
Here's the full result:
import codecs
with open(file_path, 'r') as infile:
reader = csv.reader(codecs.EncodedFile(infile, 'utf8', 'utf_8_sig'), delimiter=";")

Having trouble writing output to a file

Below is my code, the problem that I'm working on is to have the output of my program "written to a file whose name is obtained by appending the string _output to the input file name".
What is the correct way of going about doing this?
fileName = raw_input('Enter the HTML file name:') + '.html'
f = open(fileName, 'r')
myList = f.readlines()
for i in range(0, len(myList)):
toString = ''.join(myList)
newString = toString.replace('<span>', '')
newString = newString.replace('</span>', '')
print newString #testing the output
f.close()
Here is revised code. Something like this?
fileName = raw_input('Enter the HTML file name:') + '.html'
f = open(fileName, 'r')
fnew = open(fileName, 'w')
myList = f.readlines()
for i in range(0, len(myList)):
toString = ''.join(myList)
newString = toString.replace('<span>', '')
newString = newString.replace('</span>', '')
fnew.write(newString)
f.close()
Try;
fileName = raw_input('Enter the HTML file name:') + '.html'
f = open(fileName, 'r+')
toString = f.read()
newString = toString.replace('<span>', '')
newString = newString.replace('</span>', '')
print newString #testing the output
f.truncate() #clean all content from the file
f.write(newString) #write to the file
f.close()
Please refer this post : In Python, is read() , or readlines() faster?
If you want to print the output to a new file then;
new_file = open(new_file_path, 'w') #If the file does not exist, creates a new file for writing
new_file.write(newString)
new_file.close()
Now no need to open the first html file read/write use
f = open(fileName, 'r')

Problems with Python's file.write() method and string handling

The problem I am having at this point in time (being new to Python) is writing strings to a text file. The issue I'm experiencing is one where either the strings don't have linebreaks inbetween them or there is a linebreak after every character. Code to follow:
import string, io
FileName = input("Arb file name (.txt): ")
MyFile = open(FileName, 'r')
TempFile = open('TempFile.txt', 'w', encoding='UTF-8')
for m_line in MyFile:
m_line = m_line.strip()
m_line = m_line.split(": ", 1)
if len(m_line) > 1:
del m_line[0]
#print(m_line)
MyString = str(m_line)
MyString = MyString.strip("'[]")
TempFile.write(MyString)
MyFile.close()
TempFile.close()
My input looks like this:
1 Jargon
2 Python
3 Yada Yada
4 Stuck
My output when I do this is:
JargonPythonYada YadaStuck
I then modify the source code to this:
import string, io
FileName = input("Arb File Name (.txt): ")
MyFile = open(FileName, 'r')
TempFile = open('TempFile.txt', 'w', encoding='UTF-8')
for m_line in MyFile:
m_line = m_line.strip()
m_line = m_line.split(": ", 1)
if len(m_line) > 1:
del m_line[0]
#print(m_line)
MyString = str(m_line)
MyString = MyString.strip("'[]")
#print(MyString)
TempFile.write('\n'.join(MyString))
MyFile.close()
TempFile.close()
Same input and my output looks like this:
J
a
r
g
o
nP
y
t
h
o
nY
a
d
a
Y
a
d
aS
t
u
c
k
Ideally, I would like each of the words to appear on a seperate line without the numbers in front of them.
Thanks,
MarleyH
You have to write the '\n' after each line, since you're stripping the original '\n';
Your idea of using '\n'.join() doesn't work because it will use\n to join the string, inserting it between each char of the string. You need a single \n after each name, instead.
import string, io
FileName = input("Arb file name (.txt): ")
with open(FileName, 'r') as MyFile:
with open('TempFile.txt', 'w', encoding='UTF-8') as TempFile:
for line in MyFile:
line = line.strip().split(": ", 1)
TempFile.write(line[1] + '\n')
fileName = input("Arb file name (.txt): ")
tempName = 'TempFile.txt'
with open(fileName) as inf, open(tempName, 'w', encoding='UTF-8') as outf:
for line in inf:
line = line.strip().split(": ", 1)[-1]
#print(line)
outf.write(line + '\n')
Problems:
the result of str.split() is a list (this is why, when you cast it to str, you get ['my item']).
write does not add a newline; if you want one, you have to add it explicitly.

Categories

Resources