I have some data that is base64 encoded that I want to convert back to binary even if there is a padding error in it. If I use
base64.decodestring(b64_string)
it raises an 'Incorrect padding' error. Is there another way?
UPDATE: Thanks for all the feedback. To be honest, all the methods mentioned sounded a bit hit
and miss so I decided to try openssl. The following command worked a treat:
openssl enc -d -base64 -in b64string -out binary_data
It seems you just need to add padding to your bytes before decoding. There are many other answers on this question, but I want to point out that (at least in Python 3.x) base64.b64decode will truncate any extra padding, provided there is enough in the first place.
So, something like: b'abc=' works just as well as b'abc==' (as does b'abc=====').
What this means is that you can just add the maximum number of padding characters that you would ever need—which is two (b'==')—and base64 will truncate any unnecessary ones.
This lets you write:
base64.b64decode(s + b'==')
which is simpler than:
base64.b64decode(s + b'=' * (-len(s) % 4))
Note that if the string s already has some padding (e.g. b"aGVsbG8="), this approach will only work if the validate keyword argument is set to False (which is the default). If validate is True this will result in a binascii.Error being raised if the total padding is longer than two characters.
From the docs:
If validate is False (the default), characters that are neither in the normal base-64 alphabet nor the alternative alphabet are discarded prior to the padding check. If validate is True, these non-alphabet characters in the input result in a binascii.Error.
However, if validate is False (or left blank to be the default) you can blindly add two padding characters without any problem. Thanks to eel ghEEz for pointing this out in the comments.
As said in other responses, there are various ways in which base64 data could be corrupted.
However, as Wikipedia says, removing the padding (the '=' characters at the end of base64 encoded data) is "lossless":
From a theoretical point of view, the padding character is not needed,
since the number of missing bytes can be calculated from the number
of Base64 digits.
So if this is really the only thing "wrong" with your base64 data, the padding can just be added back. I came up with this to be able to parse "data" URLs in WeasyPrint, some of which were base64 without padding:
import base64
import re
def decode_base64(data, altchars=b'+/'):
"""Decode base64, padding being optional.
:param data: Base64 data as an ASCII byte string
:returns: The decoded byte string.
"""
data = re.sub(rb'[^a-zA-Z0-9%s]+' % altchars, b'', data) # normalize
missing_padding = len(data) % 4
if missing_padding:
data += b'='* (4 - missing_padding)
return base64.b64decode(data, altchars)
Tests for this function: weasyprint/tests/test_css.py#L68
Just add padding as required. Heed Michael's warning, however.
b64_string += "=" * ((4 - len(b64_string) % 4) % 4) #ugh
"Incorrect padding" can mean not only "missing padding" but also (believe it or not) "incorrect padding".
If suggested "adding padding" methods don't work, try removing some trailing bytes:
lens = len(strg)
lenx = lens - (lens % 4 if lens % 4 else 4)
try:
result = base64.decodestring(strg[:lenx])
except etc
Update: Any fiddling around adding padding or removing possibly bad bytes from the end should be done AFTER removing any whitespace, otherwise length calculations will be upset.
It would be a good idea if you showed us a (short) sample of the data that you need to recover. Edit your question and copy/paste the result of print repr(sample).
Update 2: It is possible that the encoding has been done in an url-safe manner. If this is the case, you will be able to see minus and underscore characters in your data, and you should be able to decode it by using base64.b64decode(strg, '-_')
If you can't see minus and underscore characters in your data, but can see plus and slash characters, then you have some other problem, and may need the add-padding or remove-cruft tricks.
If you can see none of minus, underscore, plus and slash in your data, then you need to determine the two alternate characters; they'll be the ones that aren't in [A-Za-z0-9]. Then you'll need to experiment to see which order they need to be used in the 2nd arg of base64.b64decode()
Update 3: If your data is "company confidential":
(a) you should say so up front
(b) we can explore other avenues in understanding the problem, which is highly likely to be related to what characters are used instead of + and / in the encoding alphabet, or by other formatting or extraneous characters.
One such avenue would be to examine what non-"standard" characters are in your data, e.g.
from collections import defaultdict
d = defaultdict(int)
import string
s = set(string.ascii_letters + string.digits)
for c in your_data:
if c not in s:
d[c] += 1
print d
Use
string += '=' * (-len(string) % 4) # restore stripped '='s
Credit goes to a comment somewhere here.
>>> import base64
>>> enc = base64.b64encode('1')
>>> enc
>>> 'MQ=='
>>> base64.b64decode(enc)
>>> '1'
>>> enc = enc.rstrip('=')
>>> enc
>>> 'MQ'
>>> base64.b64decode(enc)
...
TypeError: Incorrect padding
>>> base64.b64decode(enc + '=' * (-len(enc) % 4))
>>> '1'
>>>
If there's a padding error it probably means your string is corrupted; base64-encoded strings should have a multiple of four length. You can try adding the padding character (=) yourself to make the string a multiple of four, but it should already have that unless something is wrong
Incorrect padding error is caused because sometimes, metadata is also present in the encoded string
If your string looks something like: 'data:image/png;base64,...base 64 stuff....'
then you need to remove the first part before decoding it.
Say if you have image base64 encoded string, then try below snippet..
from PIL import Image
from io import BytesIO
from base64 import b64decode
imagestr = 'data:image/png;base64,...base 64 stuff....'
im = Image.open(BytesIO(b64decode(imagestr.split(',')[1])))
im.save("image.png")
You can simply use base64.urlsafe_b64decode(data) if you are trying to decode a web image. It will automatically take care of the padding.
Check the documentation of the data source you're trying to decode. Is it possible that you meant to use base64.urlsafe_b64decode(s) instead of base64.b64decode(s)? That's one reason you might have seen this error message.
Decode string s using a URL-safe alphabet, which substitutes - instead
of + and _ instead of / in the standard Base64 alphabet.
This is for example the case for various Google APIs, like Google's Identity Toolkit and Gmail payloads.
Adding the padding is rather... fiddly. Here's the function I wrote with the help of the comments in this thread as well as the wiki page for base64 (it's surprisingly helpful) https://en.wikipedia.org/wiki/Base64#Padding.
import logging
import base64
def base64_decode(s):
"""Add missing padding to string and return the decoded base64 string."""
log = logging.getLogger()
s = str(s).strip()
try:
return base64.b64decode(s)
except TypeError:
padding = len(s) % 4
if padding == 1:
log.error("Invalid base64 string: {}".format(s))
return ''
elif padding == 2:
s += b'=='
elif padding == 3:
s += b'='
return base64.b64decode(s)
There are two ways to correct the input data described here, or, more specifically and in line with the OP, to make Python module base64's b64decode method able to process the input data to something without raising an un-caught exception:
Append == to the end of the input data and call base64.b64decode(...)
If that raises an exception, then
i. Catch it via try/except,
ii. (R?)Strip any = characters from the input data (N.B. this may not be necessary),
iii. Append A== to the input data (A== through P== will work),
iv. Call base64.b64decode(...) with those A==-appended input data
The result from Item 1. or Item 2. above will yield the desired result.
Caveats
This does not guarantee the decoded result will be what was originally encoded, but it will (sometimes?) give the OP enough to work with:
Even with corruption I want to get back to the binary because I can still get some useful info from the ASN.1 stream").
See What we know and Assumptions below.
TL;DR
From some quick tests of base64.b64decode(...)
it appears that it ignores non-[A-Za-z0-9+/] characters; that includes ignoring =s unless they are the last character(s) in a parsed group of four, in which case the =s terminate the decoding (a=b=c=d= gives the same result as abc=, and a==b==c== gives the same result as ab==).
It also appears that all characters appended are ignored after the point where base64.b64decode(...) terminates decoding e.g. from an = as the fourth in a group.
As noted in several comments above, there are either zero, or one, or two, =s of padding required at the end of input data for when the [number of parsed characters to that point modulo 4] value is 0, or 3, or 2, respectively. So, from items 3. and 4. above, appending two or more =s to the input data will correct any [Incorrect padding] problems in those cases.
HOWEVER, decoding cannot handle the case where the [total number of parsed characters modulo 4] is 1, because it takes a least two encoded characters to represent the first decoded byte in a group of three decoded bytes. In uncorrupted encoded input data, this [N modulo 4]=1 case never happens, but as the OP stated that characters may be missing, it could happen here. That is why simply appending =s will not always work, and why appending A== will work when appending == does not. N.B. Using [A] is all but arbitrary: it adds only cleared (zero) bits to the decoded, which may or not be correct, but then the object here is not correctness but completion by base64.b64decode(...) sans exceptions.
What we know from the OP and especially subsequent comments is
It is suspected that there are missing data (characters) in the
Base64-encoded input data
The Base64 encoding uses the standard 64 place-values plus padding:
A-Z; a-z; 0-9; +; /; = is padding. This is confirmed, or at least
suggested, by the fact that openssl enc ... works.
Assumptions
The input data contain only 7-bit ASCII data
The only kind of corruption is missing encoded input data
The OP does not care about decoded output data at any point after that corresponding to any missing encoded input data
Github
Here is a wrapper to implement this solution:
https://github.com/drbitboy/missing_b64
I got this error without any use of base64. So i got a solution that error is in localhost it works fine on 127.0.0.1
In my case Gmail Web API was returning the email content as a base64 encoded string, but instead of encoded with the standard base64 characters/alphabet, it was encoded with the "web-safe" characters/alphabet variant of base64. The + and / characters are replaced with - and _. For python 3 use base64.urlsafe_b64decode().
This can be done in one line - no need to add temporary variables:
b64decode(f"{s}{'=' * (4 - len(s) % 4)}")
In case this error came from a web server: Try url encoding your post value. I was POSTing via "curl" and discovered I wasn't url-encoding my base64 value so characters like "+" were not escaped so the web server url-decode logic automatically ran url-decode and converted + to spaces.
"+" is a valid base64 character and perhaps the only character which gets mangled by an unexpected url-decode.
You should use
base64.b64decode(b64_string, ' /')
By default, the altchars are '+/'.
I ran into this problem as well and nothing worked.
I finally managed to find the solution which works for me. I had zipped content in base64 and this happened to 1 out of a million records...
This is a version of the solution suggested by Simon Sapin.
In case the padding is missing 3 then I remove the last 3 characters.
Instead of "0gA1RD5L/9AUGtH9MzAwAAA=="
We get "0gA1RD5L/9AUGtH9MzAwAA"
missing_padding = len(data) % 4
if missing_padding == 3:
data = data[0:-3]
elif missing_padding != 0:
print ("Missing padding : " + str(missing_padding))
data += '=' * (4 - missing_padding)
data_decoded = base64.b64decode(data)
According to this answer Trailing As in base64 the reason is nulls. But I still have no idea why the encoder messes this up...
def base64_decode(data: str) -> str:
data = data.encode("ascii")
rem = len(data) % 4
if rem > 0:
data += b"=" * (4 - rem)
return base64.urlsafe_b64decode(data).decode('utf-8')
Simply add additional characters like "=" or any other and make it a multiple of 4 before you try decoding the target string value. Something like;
if len(value) % 4 != 0: #check if multiple of 4
while len(value) % 4 != 0:
value = value + "="
req_str = base64.b64decode(value)
else:
req_str = base64.b64decode(value)
In my case I faced that error while parsing an email. I got the attachment as base64 string and extract it via re.search. Eventually there was a strange additional substring at the end.
dHJhaWxlcgo8PCAvU2l6ZSAxNSAvUm9vdCAxIDAgUiAvSW5mbyAyIDAgUgovSUQgWyhcMDAyXDMz
MHtPcFwyNTZbezU/VzheXDM0MXFcMzExKShcMDAyXDMzMHtPcFwyNTZbezU/VzheXDM0MXFcMzEx
KV0KPj4Kc3RhcnR4cmVmCjY3MDEKJSVFT0YK
--_=ic0008m4wtZ4TqBFd+sXC8--
When I deleted --_=ic0008m4wtZ4TqBFd+sXC8-- and strip the string then parsing was fixed up.
So my advise is make sure that you are decoding a correct base64 string.
Clear your browser cookie and recheck again, it should work.
In my case I faced this error, after deleting the venv for the perticular project and it showing error for each fields so I tried by changing the BROWSER(Chrome to Edge), And actually it worked..
I'm trying to work with:
softScheck/tplink-smartplug
I'm stuck in a loop of errors. The fix for the first, causes the other, and the fix for the other, causes the first. The code is all found in tplink-smartplug.py at the git link.
cmd = "{\"system\":{\"set_relay_state\":{\"state\":0}}}"
sock_tcp.send(encrypt(cmd))
def encrypt(string):
key = 171
result = "\0\0\0\0"
for i in string:
a = key ^ ord(i)
key = a
result += chr(a)
return result
As it is, result = 'Ðòøÿ÷Õï¶Å Ôùðè·Ä°Ñ¥ÀâØ£òçöÔîÞ£Þ' and I get the error (on line 92 in original file: sock_tcp.send(encrypt(cmd)):
a bytes-like object is required, not 'str'
so I change the function call too:
sock_tcp.send(encrypt(cmd.encode('utf-8')))
and my error changes too:
ord() expected string of length 1, but int found
I understand what ord() is trying to do, and I understand the encoding. But what I don't understand is...how am I supposed to send this encrypted message to my smart plugin, if I can't give the compiler what it wants? Is there a work around? I'm pretty sure the original git was written in Python 2 or earlier. So maybe I'm not converting to Python 3 correctly?
Thanks for reading, I appreciate any help.
In Python 2, the result of encode is a str byte-string, which is a sequence of 1-byte str values. So, when you do for i in string:, each i is a str, and you have to call ord(i) to turn it into a number from 0 to 255.
In Python 3, the result of encode is a bytes byte-string, which is a sequence of 1-byte integers. So when you do for i in string:, each i is already an int from 0 to 255, so you don't have to do anything to convert it. (And, if you try to do it anyway, you'll get the TypeError you're seeing.)
Meanwhile, you're building result up as a str. In Python 2, that's fine, but in Python 3, that means it's Unicode, not bytes. Hence the other TypeError you're seeing.
For more details on how to port Python 2 string-handling code to Python 3, you should read the Porting Guide, especially Text versus binary data, and maybe the Unicode HOWTO if you need more background.
One way you can write the code to work the same way for both Python 2 and 3 is to use a bytearray for both values:
def encrypt(string):
key = 171
result = bytearray(b"\0\0\0\0")
for i in bytearray(string):
a = key ^ i
key = a
result.append(a)
return result
cmd = u"{\"system\":{\"set_relay_state\":{\"state\":0}}}"
sock_tcp.send(encrypt(cmd.encode('utf-8')))
Notice the u prefix on cmd, which makes sure it's Unicode even in Python 2, and the b prefix on result, which makes sure it's bytes even in Python 3. Although since you know cmd is pure ASCII, it might be simpler to just do this:
cmd = b"{\"system\":{\"set_relay_state\":{\"state\":0}}}"
sock_tcp.send(encrypt(cmd))
If you don't care about Python 2, you can just for for i in string: without converting it to a bytearray, but you still probably want to use one for result. (Being able to append an int directly to it makes for simpler code—and it's even more efficient, as a nice bonus.)
for i in magicList:
enemyName = myfont.render(enemy.name,1,(255,255,255))
number = []
number.append(i)
pos = len(number)
mDisplayText=myfont.render((str(pos, i)),1,(255,255,255))
I am trying to display on screen every item from the list 'magicList' as well as a number in front of that indicating the position so it would look someting like this
1) Fireball
2) Explosion
3) Heal
I have been able to display it with just the text but i can't seem to do it with the numbers, is my use of the len flawed or is it something else because everytime it try that it returns the error:
mDisplayText=myfont.render((str(pos, i)),1,(255,255,255))
TypeError: text must be a unicode or bytes
it's weird that its like this considering pos should be an integer, but if anyone knows what's wrong here i'd love to know.
For anyone that is asking why i don't just do it without the numbers, I have to use the numbers because that is how the player will select it, doing this way means i can append a potentially infinite number of items to the list without having to pre-program them all.
I don't see any problem with the len function, I have seen below in the documentation, please see this, it is not obvious to say with out looking at your data, hence sharing this info.
Null characters (‘x00’) raise a TypeError. Both Unicode and char (byte) strings are accepted. For Unicode strings only UCS-2 characters (‘u0001’ to ‘uFFFF’) are recognized. Anything greater raises a UnicodeError.
See this link for more info https://www.pygame.org/docs/ref/font.html
One more thing I also suspect casting "str(pos, i) ", because I tried below lines to rule out the root cause and ended up having error " print(str(pos, i))
TypeError: coercing to str: need a bytes-like object, int found"
i ='Some text'
pos =2
print(str(pos, i))
I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)
The problem is the following:
I read first 24 bytes of a file:
with open (from_folder+"/"+i, "rb") as myfile:
header=str(myfile.read(24))
then I look for pattern in it:
if y[1] in header:
shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])
where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']
y[1] is the pattern and = r'\x47\x40\x00'
the file has it inside, as you can see from the picture below.
the program does NOT find this pattern (r'\x47\x40\x00') in the file header.
so, I tried to print header:
You see? Python sees it as 'G#' instead of '\x47\x40'
and if i search for 'G#'+r'\x00' in header - everything is ok. It finds it.
Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G#'+r'\x00'.
OR
why python sees first two numbers as 'G#' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?
with open (from_folder+"/"+i, "rb") as myfile:
header=myfile.read(24)
header = str(binascii.hexlify(header))[2:-1]
the result I get is:
And I can work with it
4740001b0000b00d0001c100000001efff3690e23dffffff
P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.
In Python 3 you'll get bytes from a binary read, rather than a string.
No need to convert it to a string by str.
Print will try to convert bytes to something human readable.
If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))
Output as redirected from the console:
b'\x00G#\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0
You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.
If you want to apply a string search on '\x00\x47\x40', use:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))
Which will give you:
b'\x00G#\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0
So there's a number of separate issues at play here:
print tries to print something human readable, which succeeds only for the first two chars.
You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.
I'm using pyserial in python 2.7.5 which according to the docs:
read(size=1)
Parameters:
size – Number of bytes to read.
Returns:
Bytes read from the port.
Read size bytes from the serial port. If a timeout is set it may return less characters as requested. With no timeout it will block until the requested number of bytes is read.
Changed in version 2.5: Returns an instance of bytes when available (Python 2.6 and newer) and str otherwise.
Mostly I want to use the values as hex values and so when I use them I use the following code:
ch = ser.read()
print ch.encode('hex')
This works no problem.
But now I'm trying to read just ONE value as an integer, because it's read in as a string from serial.read, I'm encountering error after error as I try to get an integer value.
For example:
print ch
prints nothing because it's an invisible character (in this case chr(0x02)).
print int(ch)
raises an error
ValueError: invalid literal for int() with base 10: '\x02'
trying print int(ch,16), ch.decode(), ch.encode('dec'), ord(ch), unichr(ch) all give errors (or nothing).
In fact, the only way I have got it to work is converting it to hex, and then back to an integer:
print int(ch.encode('hex'),16)
this returns the expected 2, but I know I am doing it the wrong way. How do I convert a a chr(0x02) value to a 2 more simply?
Believe me, I have searched and am finding ways to do this in python 3, and work-arounds using imported modules. Is there a native way to do this without resorting to importing something to get one value?
edit: I have tried ord(ch) but it is returning 90 and I KNOW the value is 2, 1) because that's what I'm expecting, and 2) because when I get an error, it tells me (as above)
Here is the code I am using that generates 90
count = ser.read(1)
print "count:",ord(ch)
the output is count: 90
and as soon as I cut and pasted that code above I saw the error count != ch!!!!
Thanks
Use the ord function. What you have in your input is a chr(2) (which, as a constant, can also be expressed as '\x02').
i= ord( chr(2) )
i= ord( '\x02' )
would both store the integer 2 in variable i.
I think you want to use ord()
ch = '\x02'
print ord(ch)
=> 2