Check if a string is encoded in base64 using Python - python

Is there a good way to check if a string is encoded in base64 using Python?

I was looking for a solution to the same problem, then a very simple one just struck me in the head. All you need to do is decode, then re-encode. If the re-encoded string is equal to the encoded string, then it is base64 encoded.
Here is the code:
import base64
def isBase64(s):
try:
return base64.b64encode(base64.b64decode(s)) == s
except Exception:
return False
That's it!
Edit: Here's a version of the function that works with both the string and bytes objects in Python 3:
import base64
def isBase64(sb):
try:
if isinstance(sb, str):
# If there's any unicode here, an exception will be thrown and the function will return false
sb_bytes = bytes(sb, 'ascii')
elif isinstance(sb, bytes):
sb_bytes = sb
else:
raise ValueError("Argument must be string or bytes")
return base64.b64encode(base64.b64decode(sb_bytes)) == sb_bytes
except Exception:
return False

import base64
import binascii
try:
base64.decodestring("foo")
except binascii.Error:
print "no correct base64"

This isn't possible. The best you could do would be to verify that a string might be valid Base 64, although many strings consisting of only ASCII text can be decoded as if they were Base 64.

The solution I used is based on one of the prior answers, but uses more up to date calls.
In my code, the my_image_string is either the image data itself in raw form or it's a base64 string. If the decode fails, then I assume it's raw data.
Note the validate=True keyword argument to b64decode. This is required in order for the assert to be generated by the decoder. Without it there will be no complaints about an illegal string.
import base64, binascii
try:
image_data = base64.b64decode(my_image_string, validate=True)
except binascii.Error:
image_data = my_image_string

Using Python RegEx
import re
txt = "VGhpcyBpcyBlbmNvZGVkIHRleHQ="
x = re.search("^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$", txt)
if (x):
print("Encoded")
else:
print("Non encoded")

Before trying to decode, I like to do a formatting check first as its the lightest weight check and does not return false positives thus following fail-fast coding principles.
Here is a utility function for this task:
RE_BASE64 = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$"
def likeBase64(s:str) -> bool:
return False if s is None or not re.search(RE_BASE64, s) else True

if the length of the encoded string is the times of 4, it can be decoded
base64.encodestring("whatever you say").strip().__len__() % 4 == 0
so, you just need to check if the string can match something like above, then it won't throw any exception(I Guess =.=)
if len(the_base64string.strip()) % 4 == 0:
# then you can just decode it anyway
base64.decodestring(the_base64string)

#geoffspear is correct in that this is not 100% possible but you can get pretty close by checking the string header to see if it matches that of a base64 encoded string (re: How to check whether a string is base64 encoded or not).
# check if a string is base64 encoded.
def isBase64Encoded(s):
pattern = re.compile("^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$")
if not s or len(s) < 1:
return False
else:
return pattern.match(s)
Also not that in my case I wanted to return false if the string is empty to avoid decoding as there's no use in decoding nothing.

I know I'm almost 8 years late but you can use a regex expression thus you can verify if a given input is BASE64.
import re
encoding_type = 'Encoding type: '
base64_encoding = 'Base64'
def is_base64():
element = input("Enter encoded element: ")
expression = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$"
matches = re.match(expression, element)
if matches:
print(f"{encoding_type + base64_encoding}")
else:
print("Unknown encoding type.")
is_base64()

def is_base64(s):
s = ''.join([s.strip() for s in s.split("\n")])
try:
enc = base64.b64encode(base64.b64decode(s)).strip()
return enc == s
except TypeError:
return False
In my case, my input, s, had newlines which I had to strip before the comparison.

x = 'possibly base64 encoded string'
result = x
try:
decoded = x.decode('base64', 'strict')
if x == decoded.encode('base64').strip():
result = decoded
except:
pass
this code put in the result variable decoded string if x is really encoded, and just x if not. Just try to decode doesn't always work.

Related

MD5 Hash Cracker -- Unicode Objects must be encoded before hashing [duplicate]

I have this error:
Traceback (most recent call last):
File "python_md5_cracker.py", line 27, in <module>
m.update(line)
TypeError: Unicode-objects must be encoded before hashing
when I try to execute this code in Python 3.2.2:
import hashlib, sys
m = hashlib.md5()
hash = ""
hash_file = input("What is the file name in which the hash resides? ")
wordlist = input("What is your wordlist? (Enter the file name) ")
try:
hashdocument = open(hash_file, "r")
except IOError:
print("Invalid file.")
raw_input()
sys.exit()
else:
hash = hashdocument.readline()
hash = hash.replace("\n", "")
try:
wordlistfile = open(wordlist, "r")
except IOError:
print("Invalid file.")
raw_input()
sys.exit()
else:
pass
for line in wordlistfile:
# Flush the buffer (this caused a massive problem when placed
# at the beginning of the script, because the buffer kept getting
# overwritten, thus comparing incorrect hashes)
m = hashlib.md5()
line = line.replace("\n", "")
m.update(line)
word_hash = m.hexdigest()
if word_hash == hash:
print("Collision! The word corresponding to the given hash is", line)
input()
sys.exit()
print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()
It is probably looking for a character encoding from wordlistfile.
wordlistfile = open(wordlist,"r",encoding='utf-8')
Or, if you're working on a line-by-line basis:
line.encode('utf-8')
EDIT
Per the comment below and this answer.
My answer above assumes that the desired output is a str from the wordlist file. If you are comfortable in working in bytes, then you're better off using open(wordlist, "rb"). But it is important to remember that your hashfile should NOT use rb if you are comparing it to the output of hexdigest. hashlib.md5(value).hashdigest() outputs a str and that cannot be directly compared with a bytes object: 'abc' != b'abc'. (There's a lot more to this topic, but I don't have the time ATM).
It should also be noted that this line:
line.replace("\n", "")
Should probably be
line.strip()
That will work for both bytes and str's. But if you decide to simply convert to bytes, then you can change the line to:
line.replace(b"\n", b"")
You must have to define encoding format like utf-8,
Try this easy way,
This example generates a random number using the SHA256 algorithm:
>>> import hashlib
>>> hashlib.sha256(str(random.getrandbits(256)).encode('utf-8')).hexdigest()
'cd183a211ed2434eac4f31b317c573c50e6c24e3a28b82ddcb0bf8bedf387a9f'
import hashlib
string_to_hash = '123'
hash_object = hashlib.sha256(str(string_to_hash).encode('utf-8'))
print('Hash', hash_object.hexdigest())
The error already says what you have to do. MD5 operates on bytes, so you have to encode Unicode string into bytes, e.g. with line.encode('utf-8').
To store the password (PY3):
import hashlib, os
password_salt = os.urandom(32).hex()
password = '12345'
hash = hashlib.sha512()
hash.update(('%s%s' % (password_salt, password)).encode('utf-8'))
password_hash = hash.hexdigest()
encoding this line fixed it for me.
m.update(line.encode('utf-8'))
Please take a look first at that answer.
Now, the error message is clear: you can only use bytes, not Python strings (what used to be unicode in Python < 3), so you have to encode the strings with your preferred encoding: utf-32, utf-16, utf-8 or even one of the restricted 8-bit encodings (what some might call codepages).
The bytes in your wordlist file are being automatically decoded to Unicode by Python 3 as you read from the file. I suggest you do:
m.update(line.encode(wordlistfile.encoding))
so that the encoded data pushed to the md5 algorithm are encoded exactly like the underlying file.
You could open the file in binary mode:
import hashlib
with open(hash_file) as file:
control_hash = file.readline().rstrip("\n")
wordlistfile = open(wordlist, "rb")
# ...
for line in wordlistfile:
if hashlib.md5(line.rstrip(b'\n\r')).hexdigest() == control_hash:
# collision
If it's a single line string. wrapt it with b or B. e.g:
variable = b"This is a variable"
or
variable2 = B"This is also a variable"
This program is the bug free and enhanced version of the above MD5 cracker that reads the file containing list of hashed passwords and checks it against hashed word from the English dictionary word list. Hope it is helpful.
I downloaded the English dictionary from the following link
https://github.com/dwyl/english-words
# md5cracker.py
# English Dictionary https://github.com/dwyl/english-words
import hashlib, sys
hash_file = 'exercise\hashed.txt'
wordlist = 'data_sets\english_dictionary\words.txt'
try:
hashdocument = open(hash_file,'r')
except IOError:
print('Invalid file.')
sys.exit()
else:
count = 0
for hash in hashdocument:
hash = hash.rstrip('\n')
print(hash)
i = 0
with open(wordlist,'r') as wordlistfile:
for word in wordlistfile:
m = hashlib.md5()
word = word.rstrip('\n')
m.update(word.encode('utf-8'))
word_hash = m.hexdigest()
if word_hash==hash:
print('The word, hash combination is ' + word + ',' + hash)
count += 1
break
i += 1
print('Itiration is ' + str(i))
if count == 0:
print('The hash given does not correspond to any supplied word in the wordlist.')
else:
print('Total passwords identified is: ' + str(count))
sys.exit()

Encoding a file with ord function

I'm trying to encode a file and output the encode into a new file, but I got this error:
TypeError: ord() expected string of length 1, but int found
My code:
from sys import argv, exit
def encode(data):
encoded = ''
while data:
current = data[0]
count = 1
for i in data[1:]:
if i == current:
count += 1
else:
break
if count == 255:
break
encoded += '{}{}'.format(chr(ord(current) & 255), chr(count & 255)) #error occurs here.
data = data[count:]
return encoded
if __name__ == '__main__':
if len(argv) < 2:
print('Please specify input file!')
exit(0)
with open(argv[1], 'rb') as (f):
data = f.read()
with open(argv[1] + '.out', 'wb') as (f):
f.write(encode(data))
Additional question: How do I decode the encoded file?
You are reading bytes (open(..., 'rb')), so when you take one element of the byte string, you get a byte, ie. a number. This number already is the character code, so just leave out the ord. Alternatively, you could open the file without the b modifier (open(..., 'r')), which will return a string; I would advise to keep it as a byte string though (or you could run into encoding issues if you are parsing something non-ascii).
You will run into a similar problem saving your file: you cannot write a string into a file opened with the b modifier. Since you have characters outside the ascii range (>128), writing as a string is not a good idea, since python will try to encode your characters (eg. in UTF-8), and you will end up with completely different bytes. Therefore, the best solution probably is not to concat your data to a string in your loop (the part where you do '{}{}'.format(...), but to have a list (encoded = [], concat with encoded.append(current)) and convert that to a byte string using bytes(encoded) after your loop. You can then pass that to write without a problem.
As for how to decode your file, you can just open the file like you do for encoding, read two bytes b1 and b2, and append [b1]*b2 to your output (again, as a list), and convert that to a byte string with bytes().

How to read UTF16-BE encoded bytes with length header

I want to decode a series of strings of variable length which have been encoded in UTF16-BE preceded by a two-bytes long big-endian integer indicating the half the byte-length of the following string. e.g:
Length String (encoded) Length String (encoded) ...
\x00\x05 \x00H\x00e\x00l\x00l\x00o \x00\x06 \x00W\x00o\x00r\x00l\x00d\x00! ...
All these strings and their length headers are concatenated in one big bytestring.
I have the encoded bytestring as bytes object in memory. I would like to have an iterable function which would yield strings until it reaches the end of the ByteString.
Not a huge improvement, but your code can be streamlined a bit.
def decode_strings(byte_string: ByteString) -> Generator[str]:
with io.BytesIO(byte_string) as stream:
while (s := stream.read(2)):
length = int.from_bytes(s, byteorder="big")
yield bytes.decode(stream.read(length), encoding="utf_16_be")
Currently I do it like this, but somehow I'm imagining Raymond Hettinger's "There must be a better way!".
import io
import functools
from typing import ByteString
from typing import Iterable
# Decoders
int_BE = functools.partial(int.from_bytes, byteorder="big")
utf16_BE = functools.partial(bytes.decode, encoding="utf_16_be")
encoded_strings = b"\x00\x05\x00H\x00e\x00l\x00l\x00o\x00\x06\x00W\x00o\x00r\x00l\x00d\x00!"
header_length = 2
def decode_strings(byte_string: ByteString) -> Iterable[str]:
stream = io.BytesIO(byte_string)
while True:
length = int_BE(stream.read(header_length))
if length:
text = utf16_BE(stream.read(length * 2))
yield text
else:
break
stream.close()
if __name__ == "__main__":
for text in decode_strings(encoded_strings):
print(text)
Thanks for any suggestions.

How do you use a variable argument on base64.b64encode but when i am not using the prompt window?

This question is similar to this one here but if I put this into this code like so:
import base64
theone = input('Enter your plaintext: ')
encoded = str(base64.b64encode(theone))
encoded = base64.b64encode(encoded.encode('ascii'))
encoded = encoded[2:]
o = len(encoded)
o = o-1
encoded = encoded[:o]
print(encoded)
it raises this problem:
line 58, in b64encode
encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'
And then if I remove this line of code:
encoded = base64.b64encode(encoded.encode('ascii'))
then it raises the same error. I'm not sure what to do from here and I would be grateful for any help.
You seem to be having problems with bytes and strings. The value returned by input is a string (str), but base64.b64encode expects bytes (bytes).
If you print a bytes instance you see something like
b'spam'
To remove the leading 'b' you need to decode back to a str.
To make your code work, pass bytes to base64.b64encode, and decode the result to print it.
>>> theone = input('Enter your plaintext: ')
Enter your plaintext: Hello World!
>>> encoded = base64.b64encode(theone.encode())
>>> encoded
b'SGVsbG8gV29ybGQh'
>>> print(encoded.decode())
SGVsbG8gV29ybGQh

Python: How do I compare unicode to ascii text?

I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?
Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.
Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),
You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)

Categories

Resources