I have some strings have uses subscript and superscript.
Is there anyway i can remove them while keeping my string?
Here is an example, ¹ºUnless otherwise indicated. How can i remove the superscript of ¹º?
Thanks in advance!
The only sure way you can do is to enumerate all superscript and subscript symbols that might occur and remove the characters that match this set.
If your string is not so weird, you may try to identify for "letter other" and "number other" categories, which would cover other characters in addition to super- and subscripts. Such as this:
import unicodedata
s = "¹ºUnless otherwise indicated"
cleaned = "".join(c for c in s if unicodedata.category(c) not in ["No", "Lo"])
The ordinal values of ASCII characters (subscript/superscript characters are not in the ASCII table) are in the range(128). Note that range(128) excludes the upper bound (and when a lower bound is not provided, 0 is assumed to be the lower bound) of the range, so this maps to all of the numbers from 0-127. So, you can strip out any characters which are not in this range:
>>> x = '¹ºUnless otherwise indicated'
>>> y = ''.join([i for i in x if ord(i) < 128])
>>> y
'Unless otherwise indicated'
This iterates over all of the characters of x, excludes any which are not in the ASCII range, and then joins the resulting list of characters back into a str
Related
Create a list from generator expression:
V = [('\\u26' + str(x)) for x in range(63,70)]
First issue: if you try to use just "\u" + str(...) it gives a decoder error right away. Seems like it tries to decode immediately upon seeing the \u instead of when a full chunk is ready. I am trying to work around that with double backslash.
Second, that creates something promising but still cannot actually print them as unicode to console:
>>> print([v[0:] for v in V])
['\\u2663', '\\u2664', '\\u2665', .....]
>>> print(V[0])
\u2663
What I would expect to see is a list of symbols that look identical to when using commands like '\u0123' such as:
>>> print('\u2663')
♣
Any way to do that from a generated list? Or is there a better way to print them instead of the '\u0123' format?
This is NOT what I want. I want to see the actual symbols drawn, not the Unicode values:
>>> print(['{}'.format(v[0:]) for v in V])
['\\u2663', '\\u2664', '\\u2665', '\\u2666', '\\u2667', '\\u2668', '\\u2669']
Unicode is a character to bytes encoding, not escape sequences. Python 3 strings are Unicode. To return the character that corresponds to a Unicode code point use chr :
chr(i)
Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().
The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16). ValueError will be raised if i is outside that range.
To generate the characters between 2663 and 2670:
>>> [chr(x) for x in range(2663,2670)]
['੧', '੨', '੩', '੪', '੫', '੬', '੭']
Escape sequences use hexadecimal notation though. 0x2663 is 9827 in decimal, and 0x2670 becomes 9840.
>>> [chr(x) for x in range(9827,9840)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
You can use also use hex numeric literals:
>>> [chr(x) for x in range(0x2663,0x2670)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
or, to use exactly the same logic as the question
>>> [chr(0x2600 + x) for x in range(0x63,0x70)]
['♣', '♤', '♥', '♦', '♧', '♨', '♩', '♪', '♫', '♬', '♭', '♮', '♯']
The reason the original code doesn't work is that escape sequences are used to represent a single character in a string when we can't or don't want to type the character itself. The interpreter or compiler replaces them with the corresponding character immediatelly. The string \\u26 is an escaped \ followed by u, 2 and 6:
>>> len('\\u26')
4
I am trying to create a program that can get a string and return the string with the characters sorted by their ASCII Values (Ex: "Hello, World!" should return "!,HWdellloor").
I've tried this and it worked :
text = "Hello, World!"
cop = []
for i in range(len(text)):
cop.append(ord(text[i]))
cop.sort()
ascii_text = " ".join([str(chr(item)).strip() for item in cop])
print(ascii_text)
But I was curious to see if something like this is possible with string manipulation functions alone.
Ascii is a subset of Unicode. The first 127 codepoints of Unicode are identical to the first 127 codepoints of ascii (though when ascii was invented in 1963 they didn't use the word codepoint). So if your data is all ascii and you just call the built-in sort() method or the sorted() function, you will get it in ascii order.
Like this:
>>> text = "Hello, World!"
>>> ''.join(sorted(text))
' !,HWdellloor'
The result is sorted by the Unicode codepoint of the character, but since that is the same as the ascii codepoint, there is no difference.
As you can see, the first character in the sorted result is a space, ascii 32, which is the printable character with the lowest codepoint, 32 (yes, you can argue about whether it is printable) followed by digits (codepoints 48-57), then the uppercase letters (65-90), then the lowercase letters (97-122), with punctuation and symbols scattered in the gaps between, and control codes like carriage return and linefeed in the range 0-31 and 127.
You can confirm this by sorting string.ascii_letters:
>>> import string
>>> ''.join(sorted(string.ascii_letters))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
and then checking the sequence against one of the many ascii tables available.
If you are wondering about the placement of the various blocks of characters, uppercase and lowercase were arranged in such a way that there was a one-bit difference between an uppercase character (example A: 0x41) and the corresponding lowercase character (a: 0x61), which simplified case-insensitive matching; and there were similar arguments for placement of the other blocks. For some of the control codes there were strong legacy imperatives, from punched paper tape patterns.
Why even we have to have two characters like space ?
and why space is chr(32) and not chr(0) ?
also chr(160) is A half space ??
chr(0) isn't actually a space, it's a NULL character. chr(n) returns the ASCII character for the number n.
When you print(chr(0)), it just prints the representation of the NULL character, which is nothing.
Observe:
>>> print('hi'+chr(0)+'hello')
hihello
>>> print('hi'+chr(32)+'hello')
hi hello
Note that NULL is not None, nor is it even an empty string:
>>> chr(0) is None
False
>>> chr(0) == ''
False
It is literally nothing.
chr(0) is NULL character, which is very significant and chr(32) is ' '. The point of NULL character is to terminate strings for example. So what you see like x = "abcd" is actaully x = "abcd\00", where of course \00 is the same as chr(0). Without null character you would not be able to determine the end of strings, because what might happen is that you read string byte by byte, but right after "abcd"there is something else stored in memory, y = "efgh" for example. If there would be no null char at the end of x, calling print(x) would print 'abcdefgh' and maybe even more garbage that is not x because the computer would not know when to stop.
If not mistaken chr(int) converts the int (Decimal value) to the character in the ascii code...
char(0) is Null
char(32) is space
Actually chr(n) returns not the ASCII code but the Unicode codepoint for n. The first elements Unicode happen to be the same as the ASCII ones.
Try it yourself: chr(15265) returns '㮡' in Python 3.6
Consider, I have a string which has some binary data of the following form:
n\xe1v\u011bsy a p\u0159\xedv\u011bsy Tlumi\u010de pro autobusy
Now I want to identify, if a string has binary data or not. I am trying the following code:
def isBinary(line):
print line
return "xe" in line
But this does not work. How can I effectively identify if a string contains binary data or not.
You can't look for the substring 'xe' because '\xe[0-9]' is actually just a representation of a special character.
Instead could check whether the ASCII value of each character is within desired ranges. e.g. if I only wanted alphabetical characters:
for c in input_str:
ascii_val = ord(c)
upper = range(65, 91)
lower = range(97, 123)
if ord(c) not in upper and ord(c) not in lower:
print("NON-ALPHABETICAL CHARACTER FOUND!")
break
You could use "\xe1" in line, which will look for the byte value 0xE1 in the line. But you really have to define "binary data"; what constitutes text data and what is binary? Let's suppose your definition is "ASCII" - that is to say, anything 0x80 or above marks it as binary. In that case:
def is_binary(line): # PEP 8 naming - snake_words not mixedCase
return any(ord(x) > 0x80 for x in line)
You might also want to check if there's a "\x00" in the line, as that often signifies binary data.
def is_binary(line):
return "\x00" in line or any(ord(x) > 0x80 for x in line)
character = (%.,-();'0123456789-—:`’)
character.replace(" ")
character.delete()
I want to delete or replace all the special characters and numbers from my program, I know it can be done in the one string just not sure how to space all the special characters with quotes or anything. Somehow I'm supposed to separate all the special character in the parenthesis just not sure how to break up and keep all the characters stored in the variable.
The translate method is my preferred way of doing this. Create a mapping between the chars you want mapped and then apply that table to your input string.
from string import maketrans
special = r"%.,-();'0123456789-—:`’"
blanks = " " * len(special)
table = maketrans(special, blanks)
input_string.translate(table)
Seems like a good application for filter
>>> s = 'This is a test! It has #1234 and letters?'
>>> filter(lambda i: i.isalpha(), s)
'ThisisatestIthasandletters'
You can have a function with an optional fill value, if not set it will just delete/remove the non alpha characters or you can specify a default replace value:
def delete_replace(s,fill_char = ""):
return "".join([x if x.isalpha() else fill_char for x in s])