About encode/decode in ftplib in python3 - python

I have a request to handle with filename list on ftp server. But filename includes Asian character and other unknown characters. So I need to judge which filename can be decoded by gb2312, which can be decoded by iso-8859-1. That means if the filename list cannot be gotten using gb2312, then use iso-88591-1 to get. So I don't know how to write code in the following function which is in ftplib
def retrlines(self, cmd, callback = None):
"""Retrieve data in line mode. A new port is created for you.
Args:
cmd: A RETR, LIST, NLST, or MLSD command.
callback: An optional single parameter callable that is called
for each line with the trailing CRLF stripped.
[default: print_line()]
Returns:
The response code.
"""
if callback is None: callback = print_line
resp = self.sendcmd('TYPE A')
##################I need to update here############################
with self.transfercmd(cmd) as conn, \
conn.makefile('r', encoding='iso-8859-1') as fp:
###################################################################
while 1:
line = fp.readline()
print(line)
if self.debugging > 2: print('*retr*', repr(line))
if not line:
break
if line[-2:] == CRLF:
line = line[:-2]
elif line[-1:] == '\n':
line = line[:-1]
callback(line)
return self.voidresp()

You aren't including much of the code, so it's hard to tell exactly what is going on. But as a general rule, if the data you are interacting with isn't consistent in it's usage of encodings, you will have to interact with it in binary mode.
So try not passing in an encoding at all. Hopefully that will give you bytes data back, and you can then encode/decode according to the needs of each file.

Related

Message sent over socket missing the \n

I am generating a protocol for a tcpip socket between python and matlab. While trying to setup some sort of a protocol, I ran into a problem. It has to do with this set of code below
FPath = Path('c:/test/dogg.jpg')
HASH = Commons.get_file_md5_hash((FPath))
msg = ('IDINFO'+FPath.name+'HASH'+ HASH+'\n')
generates the message
IDINFOdogg.jpgHASH7ad1a930dab3c099939b66267b5c57f8
I have in the message: IDINFO which will tell the server the name of the file and HASH which will tell the file details.
After this I open up the file using
f = open(FPath,"rb")
chunk = f.read(1020)
and build a package with the tag DATA in front
msg = b`DATA` + chunk + b'\n'
Problem is that the b'\n' is not the same as in the first message. as Matlab cannot read the delimiter and won't continue grabbing data chunks.
Matlab code for Below reference. This isn't the entire object just the part that is potentially causing trouble.
To setup a callback.
set(gh.tcpipServer, 'BytesAvailableFcnMode','Terminator');
set(gh.tcpipServer, 'BytesAvailableFcn', #(h,e)gh.Serverpull(h,e));
The Function for looking at the bytes
function Serverpull(gh,h,e)
gh.msg = fread(gh.tcpipServer,gh.tcpipServer.BytesAvailable);
gh.msgdecode = char(transpose(gh.msg));
if strfind(gh.msgdecode,'IDINFO')
Hst = strfind(gh.msgdecode,'HASH');
gh.Fname = gh.msgdecode(7:Hst-1);
gh.HASH = gh.msgdecode(Hst+4:end);
fwrite(gh.tcpipServer,'GoodToGo');
gh.PrepareforDataAq()
elseif strfind(gh.msgdecode,'DATA')
fwrite(gh.fileID,gh.msg(5:end),'double');
elseif strfind(gh.msgdecode,'EOF')
fclose(gh.fileID);
display('File Transfer Complete')
end
end
function PrepareforDataAq(gh)
path = fullfile('c:\temp\',gh.Fname);
gh.fileID = fopen(path,'w');
end
For the TLDR,
How to make the string '\n' the same as b \n when building a tcp message from binary instead of strings before encoding?

Errors when splitting a large 2GB XML file - UnicodeErrors: 'charmap' codec... character maps to <undefined>

I have been wrestling without much success with a 2GB XML file on Windows 10 64-bit. I am using some code found on Github here and managed to get it going but have been getting UnicodeErrors on a particular character \u0126 which is a Ħ (a letter used in the Maltese alphabet). The script executes but after the first chunk is saved and the second started, the error comes up.
Edit: The XML file is a Disqus dump from a local portal.
I have followed the advice found in this SO question and set chcp 65001 and the setx PYTHONIOENCODING utf-8 in Windows command prompt and the echo command checks.
I have tried many of the solutions found in the "Questions that may already have your answer" but I still get the UnicodeError on the same letter. I have also tried a crude data.replace('Ħ', 'H') and also data.replace('\\u1026', 'H') but the error still comes up and in the same position. Every time I test something new takes around 5 minutes until the error comes up and I've been struggling for over a day with this nuisance.
I tried reading the file in Notepad++ 64-bit but the program ends up Not responding when I do a search as my 16GB RAM are being eaten and system becomes sluggish.
I have had to change the following part of the whole code's first line to read:
cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt', encoding='utf-8')
and also the second line to read:
with open(filename, 'rt', encoding='utf-8') as xml_file:
but still no juice. I also used errors='replace' and errors='ignore' but to no avail.
cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt')
with open(filename, 'rt') as xml_file:
while True:
# Read a chunk
chunk = xml_file.read(CHUNK_SIZE)
if len(chunk) < CHUNK_SIZE:
# End of file
# tell the parser we're done
p.Parse(chunk, 1)
# exit the loop
break
# process the chunk
p.Parse(chunk)
# Don't forget to close our handle
cur_file.close()
Another line I had to edit from the original code is: cur_file.write(data.encode('utf-8')) and had to change it to:
cur_file.write(data) # .encode('utf-8')) #*
as otherwise the execution was stopping with TypeError: write() argument must be str, not bytes
def char_data(data):
""" Called by the parser when he meet data """
global cur_size, start
wroteStart = False
if start is not None:
# The data belong to an element, we should write the start part first
cur_file.write('<%s%s>' % (start[0], attrs_s(start[1])))
start = None
wroteStart = True
# ``escape`` is too much for us, only & and < ned to be escaped there ...
data = data.replace('&', '&')
data = data.replace('<', '<')
if data == '>':
data = '>'
cur_file.write(data.encode('utf-8')) #*
cur_size += len(data)
if not wroteStart:
# The data was outside of an element, it could be the right moment to
# make the split
next_file()
Any help would be greatly appreciated.
EDIT: added traceback
The problem is always when trying to write the file.
Traceback (most recent call last):
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 249, in <module>
main(args[0], options.output_dir)
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 229, in main
p.Parse(chunk)
File "..\Modules\pyexpat.c", line 282, in CharacterData
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 180, in char_data
cur_file.write(data) # .encode('utf-8'))
File "C:\Users\myself\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in position 6: character maps to <undefined>
Edit: I have tried replacing the offending characters in Notepad++ but another one '\u200e' cropped up so replacing characters is not robust at all.
I have been a total noob. I modified the writing to file command to use a try: except block that just changes any unwanted character to the empty string. I know the file would loose some information like this, but at least I can split it and look inside!
This is what I did:
try:
cur_file.write(data) # .encode('utf-8')) # this was part of the original line
except UnicodeEncodeError:
data = ''
cur_file.write(data)

How to check for and discard invalid multi-line JSON log requests in log files?

I'm writing a script to parse some of our requests, and I need to be able to handle when we have a malformed or incomplete requests. So for example, a typical request would come in with the following format:
log-prefix: {JSON request data}\n
all on a single line, etc...
Then I found out that they have a character buffer limit of 1024 in their writer, so the requests could be spread across many lines, like so:
log-prefix: {First line of data
log-prefix: Second line of requests data
log-prefix: Final line of log data}\n
I'm able to handle this by just calling next on the iterator I'm using, and then removing the prefix, concatenating the requests, and then passing it to json.loads to return my dictionary that I need for writing to a file.
I'm doing that in the following way:
lines = (line.strip('\n') for line in inf.readlines())
for line in lines:
if not line.endswith('}'):
bad_lines = [line]
while not line.endswith('}'):
line = next(lines)
bad_lines.append(line)
form_line = malformed_data_handler(bad_lines)
else:
form_line = parse_out_json(line)
And my functions used in the above code are:
def malformed_data_handler(lines: Sequence) -> dict:
"""
Takes n malformed lines of bridge log data (where the JSON response has
been split across n lines, all containing prefixes) and correctly
delegates the parsing to parse_out_json before returning the concatenated
result as a dictionary.
:param lines: An iterable with malformed lines as the elements
:return: A dictionary ready for writing.
"""
logger.debug('Handling malformed data.')
parsed = ''
logger.debug(lines)
print(lines)
for line in lines:
logger.info('{}'.format(line))
parsed += parse_out_malformed(line)
logger.debug(parsed)
return json.loads(parsed, encoding='utf8')
def parse_out_json(line: str) -> dict:
"""
Parses out the JSON response returned from the Apache Bridge logs. Takes a
line and removes the prefix, returning a dictionary.
:param line:
:return:
"""
data = slice(line.find('{'), None)
return json.loads(line[data], encoding='utf8')
def parse_out_malformed(line: str) -> str:
prefix = 'bridge-rails: '
data = slice(line.find(prefix), None)
parsed = line[data].replace(prefix, '')
return parsed
So now to my problem, I've now found instances where the log data can look like this:
log-prefix: {First line of data
....
log-prefix: Last line of data (No closing brace)
log-prefix: {New request}
My first though to handle this was to add some sort of check to see if '{' in line. Since I'm using a generator for scalability to process the lines, I don't know that I have found one of these requests until I have already called next and pulled the line out of the line generator, and at that point I can't re-append it, and I'm not sure how to efficiently tell my process to then start from that line and continue normally.

Python Urllib Urlopen won't return after new line?

I'm trying to get authenticated by an API I'm attempting to access. I'm using urllib.parse.urlencode to encode the parameters which go in my URL. I'm using urllib.request.urlopen to fetch the content.
This should return 3 values from the server, such as:
SID=AAAAAAAAAAA
LSID=BBBBBBBBBBB
AUTH=CCCCCCCCCCC
The problem is it only returns the first value, and the trailing new line character.
import urllib.request
import urllib.parse
Emailparamx = 'Email'
Emailparam = Emailparamx.encode('utf-8')
email = 'myemail#stackoverflow.com'
email = email.encode('utf-8')
Passwdparam = 'Passwd'
Passwdparam = Passwdparam.encode('utf-8')
password = 'hidden'
password = password.encode('utf-8')
Accounttypeparam = 'accountType'
Accounttypeparam = Accounttypeparam.encode('utf-8')
accounttype = 'GOOGLE'
accounttype = accounttype.encode('utf-8')
Serviceparam = 'service'
Serviceparam = Serviceparam.encode('utf-8')
service = 'adwords'
service = service.encode('utf-8')
url = 'https://accounts.google.com/ClientLogin?'
urlen = url.encode('utf-8')
data = [(Emailparamx, email), (Passwdparam, password),
(Accounttypeparam, accounttype), (Serviceparam, service)]
auth = ''
dataurl = urllib.parse.urlencode(data)
accessurl = (url + "%s" % dataurl)
fh = urllib.request.urlopen(accessurl)
equals = '='
eqenc = equals.encode('utf-8')
try:
msg = fh.readline().split(eqenc)
print (msg)
And then msg prints
[b'SID', b'AAAAAAAAAAAAAAAAA\n']
I know that's some seriously ugly code, I'm about a week old in Python. Any help would be greatly appreciated.
The problem is that you're only calling readline once, so it only reads one line. If you want to read the lines one by one, you have to keep calling readline in a loop until done:
while True:
msg = fh.readline()
if not msg:
break
msg = msg.split(eqenc)
print(msg)
However, there's really no good reason to call readline here, because any file-like object (including a urlopen object) is already an iterable full of lines, so you can just do this:
for msg in fh:
print(msg)
Meanwhile, your original code has a try without an except or a finally, which will just raise a SyntaxError. Presumably you wanted something like this:
try:
for msg in fh:
print(msg)
except Exception as e:
print('Exception: {}'.format(e))
While we're at it, we can simplify your code a bit.
If you look at the examples:
Here is an example session that uses the GET method to retrieve a URL containing parameters:
That's exactly what you want to do here (except for the last line). All the extra stuff you're doing with encoding the strings is not only unnecessary, but incorrect. UTF-8 is the wrong encoding is the wrong encoding to use for URLs (you get away with it because all of your strings are pure ASCII); urlopen requires a string rather than an encoded byte string (although, at least in CPython 3.0-3.3, it happens to work if you give it byte strings that happen to be encoded properly); urlencode can take byte strings but may not do the right thing (you want to give it the original Unicode so it can quote things properly); etc.
Also, you probably want to decode the result (which is sent as ASCII—for more complicated examples, you'll have to either parse the fh.getheader('Content-Type'), or read the documentation for the API), and strip the newlines.
You also may want to build a structure you can use in your code instead of just printing it out. For example, if you store the results in login_info, and you need the SID in a later request, it's just login_info['SID'].
So, let's wrap things up in a function, then call that function:
import urllib.request
import urllib.parse
def client_login(email, passwd, account_type, service):
params = {'Email': email,
'Passwd': passwd,
'accountType': account_type,
'service': service}
qs = urllib.parse.urlencode(params)
url = 'https://accounts.google.com/ClientLogin?'
with urllib.request.urlopen(url + qs) as fh:
return dict(line.strip().decode('ascii').split('=', 1) for line in fh)
email = 'myemail#stackoverflow.com'
password = 'hidden'
accounttype = 'GOOGLE'
service = 'adwords'
try:
results = client_login(email, password, accounttype, service)
for key, value in results.items():
print('key "{}" is "{}".format(key, value))
except Exception as e:
print('Exception: {}'.format(e))

Python urlopen return value

I'm trying to pass existing URLs as parameter to load it's HTML in a single txt file:
for line in open('C:\Users\me\Desktop\URLS-HERE.txt'):
if line.startswith('http') and line.endswith('html\n') :
fichier = open("C:\Users\me\Desktop\other.txt", "a")
allhtml = urllib.urlopen(line)
fichier.write(allhtml)
fichier.close()
but i get the following error:
TypeError: expected a character buffer object
The value returned by urllib.urlopen() is a file like object, once you have opened it, you should read it with the read() method, as showed in the following snippet:
for line in open('C:\Users\me\Desktop\URLS-HERE.txt'):
if line.startswith('http') and line.endswith('html\n') :
fichier = open("C:\Users\me\Desktop\other.txt", "a")
allhtml = urllib.urlopen(line)
fichier.write(allhtml.read())
fichier.close()
Hope this helps!
The problem here is that urlopen returns a reference to a file object from which you should retrieve HTML.
for line in open(r"C:\Users\me\Desktop\URLS-HERE.txt"):
if line.startswith('http') and line.endswith('html\n') :
fichier = open(r"C:\Users\me\Desktop\other.txt", "a")
allhtml = urllib2.urlopen(line)
fichier.write(allhtml.read())
fichier.close()
Please note that urllib.urlopen function is marked as deprecated since python 2.6. It's recommended to use urllib2.urlopen instead.
Additionally, you have to be careful working with paths in your code. You should either escape each \
"C:\\Users\\me\\Desktop\\other.txt"
or use r prefix before a string. When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change.
r"C:\Users\me\Desktop\other.txt"

Categories

Resources