Converting words into Unicode - python

Can anyone please explain this block of code to me? I don't really understand.
Why do I need to declare uniMessage = "" what is the use of it here? Sometimes when I code I realise that I need to declare it beforehand but sometimes I don't need to.
Why I need to use += and also convert the user inputted word into str? Isn't += is like unicode = unicode + str(ord(char))? I don't see the impact of += here why can't I just use =. And if I thought that the user inputted message is already a string? Why do I need to convert it into str again?
Also it is not necessary to convert convMessage += chr(int(alphabet)) into integer isn't it?
message = input("Enter a word ")
uniMessage = ""
for char in message:
uniMessage += str(ord(char))
print("Unicode message:", uniMessage)
convMessage = ""
for i in range(0, len(uniMessage)-1, 2):
alphabet = uniMessage[i] + uniMessage[i+1]
convMessage += chr(int(alphabet))
print("Orginal message:", convMessage)

Important clarification
The code is not a real Unicode encoding/decoding because is supposing that the Unicode characters you will input have just two decimal digits. You can test that yourself if you enter def as input, those characters have 100, 101 and 102 ASCII code.
(1)uniMessage = "" needed because you are using it the first time as a right part of an assignment operator.
uniMessage += str(ord(char))
is "equivalent" to:
uniMessage = uniMessage + str(ord(char))
and in Python a variable need to be declared before used and the = operator evaluate the right operant first.
(2) += is just syntax sugar, so yes, you can use the =, but some people would say that is less pythonic and "harder" to read ;) I recommend to use += when you can. You have to convert to string using str because you are before converting user input to a number using ord. ord
(3) Is necessary because uniMessage is a string, a string full of digits characters, but still a string.

That code might intend to convert characters into their internal number representation and back to a string. But that code just fails for 8-bit coded characters below line feed (typically 0x0A = 10dec) and above "c" (0x63 = 99dec).
Besides that, in Python every string is a Unicode string with a UTF-8 encoding. Using other encoding is possible (str.encode()), however, will yield "bytes"

Related

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

How do I make this program work with spaces? (Text to ASCII and ASCII to text)

I wrote a program that turns a text into ASCII numbers, and then it turns the ASCII numbers back into the original text. Right now it works with both lowercase letters and uppercase letters but it doesn't work with spaces. This is how my code looks right now:
message_hidden = input("Enter a message that will be hidden: ")
hidden = ""
norm_message = ""
for i in message_hidden:
hidden = hidden + str(ord(i)-23)
print(hidden)
for i in range(0, len(hidden), 2):
code = hidden[i] + hidden[i+1]
norm_message = norm_message + (chr(int(code)+23))
print("The first message was: ", norm_message)
My first attempt was to rewrite the first loop like this:
for i in message_hidden:
if i.isalpha():
hidden = hidden + str(ord(i)-23)
else:
hidden = hidden + i
print(hidden)
And from here I don't know how I should write the second loop to make it work. Can anyone give me some suggestions about how I should go from here?
ord(' ') == 32 and 32-23 == 9, which is single digit. You are assuming that your numerical codes are all 2 digits. If you want to keep that assumption, you need to find a different way to encrypt the space character. To do this find a 2-digit number that isn't one of the numbers obtained by a-zA-Z. Using an explicit if -- encrypt the space character to this number. When decrypting, you will also need to use an explicit if to handle this case.
Alternatively, find a different function (other that subtracting by 23), which you apply to ord(letter) -- one which gives 2 digit numbers for all ord() values that you are interested in. There are infinitely many functions which satisfy this property. Whether or not you can find one which would require less code than simply putting a band aid on the space character is another question.

How to replace unicode characters in string with something else python?

I have a string that I got from reading a HTML webpage with bullets that have a symbol like "•" because of the bulleted list. Note that the text is an HTML source from a webpage using Python 2.7's urllib2.read(webaddress).
I know the unicode character for the bullet character as U+2022, but how do I actually replace that unicode character with something else?
I tried doing
str.replace("•", "something")
but it does not appear to work... how do I do this?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "*")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")
(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)
Encode string as unicode.
>>> special = u"\u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'
import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)
Try this one.
you will get the output in a normal string
str.encode().decode('unicode-escape')
and after that, you can perform any replacement.
str.replace('•','something')
str1 = "This is Python\u500cPool"
Encode the string to ASCII and replace all the utf-8 characters with '?'.
str1 = str1.encode("ascii", "replace")
Decode the byte stream to string.
str1 = str1.decode(encoding="utf-8", errors="ignore")
Replace the question mark with the desired character.
str1 = str1.replace("?"," ")
Funny the answer is hidden in among the answers.
str.replace("•", "something")
would work if you use the right semantics.
str.replace(u"\u2022","something")
works wonders ;) , thnx to RParadox for the hint.
If you want to remove all \u character. Code below for you
def replace_unicode_character(self, content: str):
content = content.encode('utf-8')
if "\\x80" in str(content):
count_unicode = 0
i = 0
while i < len(content):
if "\\x" in str(content[i:i + 1]):
if count_unicode % 3 == 0:
content = content[:i] + b'\x80\x80\x80' + content[i + 3:]
i += 2
count_unicode += 1
i += 1
content = content.replace(b'\x80\x80\x80', b'')
return content.decode('utf-8')

Unescape/unquote binary strings in (extended) url encoding in python

for analysis I'd have to unescape URL-encoded binary strings (non-printable characters most likely). The strings sadly come in the extended URL-encoding form, e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
How do I get this into binary data in python? (urllib.unquote only handles the "%HH"-form)
The strings sadly come in the extended URL-encoding form, e.g. "%u616f"
Incidentally that's not anything to do with URL-encoding. It's an arbitrary made-up format produced by the JavaScript escape() function and pretty much nothing else. If you can, the best thing to do would be to change the JavaScript to use the encodeURIComponent function instead. This will give you a proper, standard URL-encoded UTF-8 string.
e.g. "%u616f". I want to store them in a file that then contains the raw binary values, eg. 0x61 0x6f here.
Are you sure 0x61 0x6f (the letters "ao") is the byte stream you want to store? That would imply UTF-16BE encoding; are you treating all your strings that way?
Normally you'd want to turn the input into Unicode then write it out using an appropriate encoding, such as UTF-8 or UTF-16LE. Here's a quick way of doing it, relying on the hack of making Python read '%u1234' as the string-escaped format u'\u1234':
>>> ex= 'hello %e9 %u616f'
>>> ex.replace('%u', r'\u').replace('%', r'\x').decode('unicode-escape')
u'hello \xe9 \u616f'
>>> print _
hello é 慯
>>> _.encode('utf-8')
'hello \xc2\xa0 \xe6\x85\xaf'
I guess you will have to write the decoder function by yourself. Here is an implementation to get you started:
def decode(file):
while True:
c = file.read(1)
if c == "":
# End of file
break
if c != "%":
# Not an escape sequence
yield c
continue
c = file.read(1)
if c != "u":
# One hex-byte
yield chr(int(c + file.read(1), 16))
continue
# Two hex-bytes
yield chr(int(file.read(2), 16))
yield chr(int(file.read(2), 16))
Usage:
input = open("/path/to/input-file", "r")
output = open("/path/to/output-file", "wb")
output.writelines(decode(input))
output.close()
input.close()
Here is a regex-based approach:
# the replace function concatenates the two matches after
# converting them from hex to ascii
repfunc = lambda m: chr(int(m.group(1), 16))+chr(int(m.group(2), 16))
# the last parameter is the text you want to convert
result = re.sub('%u(..)(..)', repfunc, '%u616f')
print result
gives
ao

Get str repr with double quotes Python

I'm using a small Python script to generate some binary data that will be used in a C header.
This data should be declared as a char[], and it will be nice if it could be encoded as a string (with the pertinent escape sequences when they are not in the range of ASCII printable chars) to keep the header more compact than with a decimal or hexadecimal array encoding.
The problem is that when I print the repr of a Python string, it is delimited by single quotes, and C doesn't like that. The naive solution is to do:
'"%s"'%repr(data)[1:-1]
but that doesn't work when one of the bytes in the data happens to be a double quote, so I'd need them to be escaped too.
I think a simple replace('"', '\\"') could do the job, but maybe there's a better, more pythonic solution out there.
Extra point:
It would be convenient too to split the data in lines of approximately 80 characters, but again the simple approach of splitting the source string in chunks of size 80 won't work, as each non printable character takes 2 or 3 characters in the escape sequence. Splitting the list in chunks of 80 after getting the repr won't help either, as it could divide escape sequence.
Any suggestions?
You could try json.dumps:
>>> import json
>>> print(json.dumps("hello world"))
"hello world"
>>> print(json.dumps('hëllo "world"!'))
"h\u00ebllo \"world\"!"
I don't know for sure whether json strings are compatible with C but at least they have a pretty large common subset and are guaranteed to be compatible with javascript;).
Better not hack the repr() but use the right encoding from the beginning. You can get the repr's encoding directly with the encoding string_escape
>>> "naïveté".encode("string_escape")
'na\\xc3\\xafvet\\xc3\\xa9'
>>> print _
na\xc3\xafvet\xc3\xa9
For escaping the "-quotes I think using a simple replace after escape-encoding the string is a completely unambiguous process:
>>> '"%s"' % 'data:\x00\x01 "like this"'.encode("string_escape").replace('"', r'\"')
'"data:\\x00\\x01 \\"like this\\""'
>>> print _
"data:\x00\x01 \"like this\""
If you're asking a python str for its repr, I don't think the type of quote is really configurable. From the PyString_Repr function in the python 2.6.4 source tree:
/* figure out which quote to use; single is preferred */
quote = '\'';
if (smartquotes &&
memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
!memchr(op->ob_sval, '"', Py_SIZE(op)))
quote = '"';
So, I guess use double quotes if there is a single quote in the string, but don't even then if there is a double quote in the string.
I would try something like writing my own class to contain the string data instead of using the built in string to do it. One option would be deriving a class from str and writing your own repr:
class MyString(str):
__slots__ = []
def __repr__(self):
return '"%s"' % self.replace('"', r'\"')
print repr(MyString(r'foo"bar'))
Or, don't use repr at all:
def ready_string(string):
return '"%s"' % string.replace('"', r'\"')
print ready_string(r'foo"bar')
This simplistic quoting might not do the "right" thing if there's already an escaped quote in the string.
repr() isn't what you want. There's a fundamental problem: repr() can use any representation of the string that can be evaluated as Python to produce the string. That means, in theory, that it might decide to use any number of other constructs which wouldn't be valid in C, such as """long strings""".
This code is probably the right direction. I've used a default of wrapping at 140, which is a sensible value for 2009, but if you really want to wrap your code to 80 columns, just change it.
If unicode=True, it outputs a L"wide" string, which can store Unicode escapes meaningfully. Alternatively, you might want to convert Unicode characters to UTF-8 and output them escaped, depending on the program you're using them in.
def string_to_c(s, max_length = 140, unicode=False):
ret = []
# Try to split on whitespace, not in the middle of a word.
split_at_space_pos = max_length - 10
if split_at_space_pos < 10:
split_at_space_pos = None
position = 0
if unicode:
position += 1
ret.append('L')
ret.append('"')
position += 1
for c in s:
newline = False
if c == "\n":
to_add = "\\\n"
newline = True
elif ord(c) < 32 or 0x80 <= ord(c) <= 0xff:
to_add = "\\x%02x" % ord(c)
elif ord(c) > 0xff:
if not unicode:
raise ValueError, "string contains unicode character but unicode=False"
to_add = "\\u%04x" % ord(c)
elif "\\\"".find(c) != -1:
to_add = "\\%c" % c
else:
to_add = c
ret.append(to_add)
position += len(to_add)
if newline:
position = 0
if split_at_space_pos is not None and position >= split_at_space_pos and " \t".find(c) != -1:
ret.append("\\\n")
position = 0
elif position >= max_length:
ret.append("\\\n")
position = 0
ret.append('"')
return "".join(ret)
print string_to_c("testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing", max_length = 20)
print string_to_c("Escapes: \"quote\" \\backslash\\ \x00 \x1f testing \x80 \xff")
print string_to_c(u"Unicode: \u1234", unicode=True)
print string_to_c("""New
lines""")

Categories

Resources