Python: Write encrypted data to file - python

I've made a chat app for school, and some people just write into the database. So my new project on it is to encrypt the resources. So I've made an encrypt function.
It's working fine, but when I try to write a encrypted data at a file, I get an error Message:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x94' in position 0:
character maps to <undefined>
How to fix that problem?
complete code:
def encrypts(data, step):
newdata = ""
i = 0
while (len(data) > len(step)):
step += step[i]
i += 1
if (len(data) < len(step)):
step = step[:len(data)]
for i in range(len(data)):
a = ord(data[i])
b = ord(step[i])
newdata += chr(a+b)
return newdata
file = open("C:/Users/David/Desktop/file.msg","wb")
file.write(encrypts("12345","code"))
Now, I finally solved my problem. The created ASCII Characters didn't exist. So I changed my functions:
def encrypts(data, step):
newdata = ""
i = 0
while (len(data) > len(step)):
step += step[i]
i += 1
if (len(data) < len(step)):
step = step[:len(data)]
for i in range(len(data)):
a = ord(data[i])
b = ord(step[i])
newdata += chr(a+b-100) #The "-100" fixed the problem.
return newdata

When opening a file for writing or saving, try adding the 'b' character to the open mode. So instead of :
open("encryptedFile.txt", 'w')
use
open("encryptedFile.txt", 'wb')
This will open files as binary, which is necessary when you modify the characters the way you are because you're sometime setting those characters to values outside of the ASCII range.

Your problem in the encoding of the file.
Try it:
inputFile = codecs.open('input.txt', 'rb', 'cp1251')
outFile = codecs.open('output.txt', 'wb', 'cp1251')

Related

Emojis in the dataset cause problems when editing the irregular dataset and exporting it to txt file

I'm stuck while editing the irregular data set and exporting to txt file.
Here is my code and data set;
import re
liste = []
with open("magaza_yorumlari_duygu_analizi (1).csv", "r", encoding='utf-16') as f:
liste = f.readlines()
f.close()
metin = ""
for satir in liste:
metin += satir
metin = metin.replace("\n", "")
metin = metin.replace(",Olumsuz", "#0#")
metin = metin.replace(",Olumlu", "#1#")
metin = metin.replace(",Tarafsız", "#2#")
# print(metin)
spt = re.split("#[0-2]#", metin)
etiketler = []
for gorus in spt:
inx = metin.find(gorus)
etiketler.append(metin[inx+len(gorus):inx+len(gorus)+3])
with open("sinav_veri_seti.txt", "a") as f:
for k in range(len(spt)):
f.write(spt[k]+";;"+etiketler[k]+"\n")
f.close()
I can't use "utf-8" cause it's not supporting emoji(yes there is emoji in data set).
Error code;
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f44d' in position 42: character maps to
I can't upload the .csv file to here because it's above the 3mb.
https://s6.dosya.tc/server11/97vwee/magaza_yorumlari_duygu_analizi__1_.csv.html
I uploaded the data set file to another website.
Anyone can help me about this?
I tried to editing the data set file(.csv).
I want to removing the emoji from the data set so I can edit the data set without any problems.

How to translate encoding by ansi into unicode

When I use the CountVectorizer in sklearn, it needs the file encoding in unicode, but my data file is encoding in ansi.
I tried to change the encoding to unicode using notepad++, then I use readlines, it cannot read all the lines, instead it can only read the last line. After that, I tried to read the line into data file, and write them into the new file by using unicode, but I failed.
def merge_file():
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
resname='resule_final.txt'
if os.path.exists(resname):
os.remove(resname)
result = codecs.open(resname,'w','utf-8')
num = 1
for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num ,":" ,str(filename)
num = num+1
path=current_dir + "\\" +str(filename)
source=open(path,'r')
line = source.readline()
line = line.strip('\n')
line = line.strip('\r')
while line !="":
line = unicode(line,"gbk")
line = line.replace('\n',' ')
line = line.replace('\r',' ')
result.write(line + ' ')
line = source.readline()
else:
print 'End file :'+ str(filename)
result.write('\n')
source.close()
print 'End All.'
result.close()
The error message is :UnicodeDecodeError: 'gbk' codec can't decode bytes in position 0-1: illegal multibyte sequence
Oh,I find the way.
First, use chardet to detect string encoding.
Second,use codecs to input or output to the file in the specific encoding.
Here is the code.
import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num,":",str(filename)
num=num+1
path=current_dir+"\\"+str(filename)
content = open(path,'r').read()
source_encoding=chardet.detect(content)['encoding']
if source_encoding == None:
print '??' , filename
failed.append(filename)
elif source_encoding != 'utf-8':
content=content.decode(source_encoding,'ignore')
codecs.open(path,'w',encoding='utf-8').write(content)
print failed
Thanks for all your help.

detect strange chars in csv python

I have a CSV with a million line or so and some of the lines are mixed with some of these chars (meaning the line can be read but is mixed with baloney):
ªïÜܵ+>&\ôowó¨ñø4(½;!|Nòdd¼Õõ¿¨W[¦¡¿p\,¶êÕMÜÙ;!ÂeãYÃ3S®´øÂÃ
The input file is ISO-8859-1, each line filtered and written in an utf-8 new file.
can this be filtered? and how?
Here is what it looks like (the entire line)
Foo;Bar;24/01/2019-13:06;24/01/2019-12:55.01;;!
ù:#ªïÜܵ+>&\ôowó¨ñø4(½;!|Nòdd¼Õõ¿¨W[¦¡¿p\,¶êÕMÜÙ;!
ÂeãYÃ3ÃS®´øÂÃç~ÂÂÂÂÃýì¯ãm;!ÃvÂ
´Ã¼ÂÂ9¬u»/"ÂFÃ|b`ÃÃõà±ÃÃÂ8ÃÂ;Baz
This is how i read it.
the encoding for fileObject being ISO-8859-1
def tee(self, rules=None):
_good = open(self.good, "a")
_bad = open(self.bad, "a")
with open(self.temp, encoding=rules["encoding"]) as fileobject:
cpt = 0
_csv = csv.reader(fileobject, **self.dialect)
for row in _csv:
_len = len(row)
_reconstructed = ";".join(row)
self.count['original'] += 1
if _len == rules["columns"]:
_good.write("{}\n".format(_reconstructed))
else:
_bad.write("{}\n".format(_reconstructed))
# print("[{}] {}:{}{}{}".format(_len, cpt, fg("red"), _reconstructed, attr("reset")))
self.count['errors'] += 1
cpt += 1
_good.close()
_bad.close()

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0446' in position 32: ordinal not in range(128)

I'm trying to debug some code a previous intern wrote and I'm having some difficulties resolving this issue with answers from other unicode error posts.
The error is found in the last line of this function:
def dumpTextPacket(self, header, bugLog, offset, outfile):
bugLog.seek(offset)
data = bugLog.read( header[1] ) # header[1] = size of the packet
outString = data.decode("utf-8","ignore")
if(header[3] == 8): # Removing ugly characters from packet that has bTag = 8.
outString = outString[1:]
outString = outString.strip('\0') # Remove all 'null' characters from text
outString = "{:.3f}".format(header[5]) + ' ms: ' + outString # Append the timestamp to the beginning of the line
outfile.write(outString)
I don't have much experience with unicode,so I would really appreciate any pointers with this issue!
edit: Using Python 2.7, and below is the entire file. Another thing I should mention is that the code does work when parsing some files, but I think it errors on other files when the timestamp gets too big?
In the main.py file, we call the method LogInterpreter.execute(), and the traceback gives the error shown in the title on the line "outfile.write(outString)", the last line in the dumpTextPacket method which is called in the execute method:
import sys
import os
from struct import unpack
class LogInterpreter:
def __init__( self ):
self.RTCUpdated = False
self.RTCOffset = 0.0
self.LastTimeStamp = 0.0
self.TimerRolloverCount = 0
self.ThisTimeStamp = 0.0
self.m_RTCSeconds = 0.0
self.m_StartTimeInSec = 0.0
def GetRTCOffset( self ):
return self.m_RTCSeconds - self.m_StartTimeInSec
def convertTimeStamp(self,uTime,LogRev):
TicsPerSecond = 24000000.0
self.ThisTimeStamp = uTime
self.RTCOffset = self.GetRTCOffset()
if int( LogRev ) == 2:
if self.RTCUpdated:
self.LastTimeStamp = 0.0
if self.LastTimeStamp > self.ThisTimeStamp:
self.TimerRolloverCount += 1
self.LastTimeStamp = self.ThisTimeStamp
ULnumber = (-1 & 0xffffffff)
return ((ULnumber/TicsPerSecond)*self.TimerRolloverCount + (uTime/TicsPerSecond) + self.RTCOffset) * 1000.0
##########################################################################
# Information about the header for the current packet we are looking at. #
##########################################################################
def grabHeader(self, bugLog, offset):
'''
s_PktHdrRev1
/*0*/ u16 StartOfPacketMarker; # uShort 2
/*2*/ u16 SizeOfPacket; # uShort 2
/*4*/ u08 LogRev; # uChar 1
/*5*/ u08 bTag; # uChar 1
/*6*/ u16 iSeq; # uShort 2
/*8*/ u32 uTime; # uLong 4
'''
headerSize = 12 # Header size in bytes
bType = 'HHBBHL' # codes for our byte type
bugLog.seek(offset)
data = bugLog.read(headerSize)
if len(data) < headerSize:
print('Error in the format of BBLog file')
sys.exit()
headerArray = unpack(bType, data)
convertedTime = self.convertTimeStamp(headerArray[5],headerArray[2])
headerArray = headerArray[:5] + (convertedTime,)
return headerArray
################################################################
# bTag = 8 or bTag = 16 --> just write the data to LogMsgs.txt #
################################################################
def dumpTextPacket(self, header, bugLog, offset, outfile):
bugLog.seek(offset)
data = bugLog.read( header[1] ) # header[1] = size of the packet
outString = data.decode("utf-8","ignore")
if(header[3] == 8): # Removing ugly characters from packet that has bTag = 8.
outString = outString[1:]
outString = outString.strip('\0') # Remove all 'null' characters from text
outString = "{:.3f}".format(header[5]) + ' ms: ' + outString # Append the timestamp to the beginning of the line
outfile.write(outString)
def execute(self):
path = './Logs/'
for fn in os.listdir(path):
fileName = fn
print fn
if (fileName.endswith(".bin")):
# if(fileName.split('.')[1] == "bin"):
print("Parsing "+fileName)
outfile = open(path+fileName.split('.')[0]+".txt", "w") # Open a file for output
fileSize = os.path.getsize(path+fileName)
packetOffset = 0
with open(path+fileName, 'rb') as bugLog:
while(packetOffset < fileSize):
currHeader = self.grabHeader(bugLog, packetOffset) # Grab the header for the current packet
packetOffset = packetOffset + 12 # Increment the pointer by 12 bytes (size of a header packet)
if currHeader[3]==8 or currHeader[3]==16: # Look at the bTag and see if it is a text packet
self.dumpTextPacket(currHeader, bugLog, packetOffset, outfile)
packetOffset = packetOffset + currHeader[1] # Move on to the next packet by incrementing the pointer by the size of the current packet
outfile.close()
print(fileName+" completed.")
When you add together two strings with one of them being Unicode, Python 2 will coerce the result to Unicode too.
>>> 'a' + u'b'
u'ab'
Since you used data.decode, outString will be Unicode.
When you write to a binary file, you must have a byte string. Python 2 will attempt to convert your Unicode string to a byte string, but it uses the most generic codec it has: 'ascii'. This codec fails on many Unicode characters, specifically those with a codepoint above '\u007f'. You can encode it yourself with a more capable codec to get around this problem:
outfile.write(outString.encode('utf-8'))
Everything changes in Python 3, which won't let you mix byte strings and Unicode strings nor attempt any automatic conversions.

'ascii' codec can't encode character

I am trying to parse an HTML link into the code and take its source code as list of strings. As I have to use get some relevant data from it, I am decoding everything into UTF-8 scheme.
I am also using beautifulsoup4 which extracts the text in decoded form.
This is my code that I have used.
def do_underline(line,mistakes):
last = u'</u></font>'
first = u"<u><font color='red'>"
a = [i.decode(encoding='UTF-8', errors='ignore') for i in line]
lenm = len(mistakes)
for i in range(lenm):
a.insert(mistakes[lenm-i-1][2],last)
a.insert(mistakes[lenm-i-1][1],first)
b = u''
return b.join(a)
def readURL(u):
"""
URL -> List
Opens a webpage's source code and extract it text
along with blank and new lines.
enumerate all lines.(including blank and new lines
"""
global line_dict,q
line_dict = {}
p = opener.open(u)
p1 = p.readlines()
q = [i.decode(encoding = 'UTF-8',errors='ignore') for i in p1]
q1 = [BeautifulSoup(i).get_text() for i in q]
q2 = list(enumerate(q1))
line_dict = {i:j for (i,j) in enumerate(q)}
return q2
def process_file(f):
"""
(.html file) -> List of Spelling Mistakes
"""
global line_dict
re = readURL(f)
de = del_blankempty(re)
fd = form_dict(de)
fflist = []
chklst = []
for i in fd:
chklst = chklst + list_braces(i,line_dict)
fflist = fflist + find_index_mistakes(i,fd)
final_list = list(set(is_inside_braces_or_not(chklst,fflist)))
final_dict = {i:sorted(list(set([final_list[j] for j in range(len(final_list)) if final_list[j][0] == i])),key=lambda student: student[1]) for i in fd}
for i in line_dict:
if i in fd:
line_dict[i] = do_underline(line_dict[i],final_dict[i])
else:
line_dict[i] = line_dict[i]
create_html_file(line_dict)
print "Your Task is completed"
def create_html_file(a):
import io
fl = io.open('Spellcheck1.html','w', encoding='UTF-8')
for i in a:
fl.write(a[i])
print "Your HTML text file is created"
I am getting the following error every time i run the script.
Traceback (most recent call last):
File "checker.py", line 258, in <module>
process_file('https://www.fanfiction.net/s/9421614/1/The-Night-Blooming-Flower')
File "checker.py", line 243, in process_file
line_dict[i] = do_underline(line_dict[i],final_dict[i])
File "checker.py", line 89, in do_underline
a = [i.decode(encoding='UTF-8', errors='ignore') for i in line]
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 0: ordinal not in range(128)
Any suggestions how i can remove this error.
if there is a way which decodes evrything into UTF-8 coming from the given link, then i think it will solve the problem.

Categories

Resources