Index strings by letter including diacritics

Index strings by letter including diacritics - python

I'm not sure how to formulate this question, but I'm looking for a magic function that makes this code
for x in magicfunc("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
print(x)
Behave like this:
H̶
e̕
l̛
l͠
o͟
̨
w̡
o̷
r̀
l҉
ḑ
!͜
Basically, is there a built in unicode function or method that takes a string and outputs an array per glyph with all their respective unicode decorators and diacritical marks and such? The same way that a text editor moves the cursor over to the next letter instead of iterating all of the combining characters.
If not, I'll write the function myself, no help needed. Just wondering if it already exists.

You can use unicodedata.combining to find out if a character is combining:
def combine(s: str) -> Iterable[str]:
buf = None
for x in s:
if unicodedata.combining(x) != 0:
# combining character
buf += x
else:
if buf is not None:
yield buf
buf = x
if buf is not None:
yield buf
Result:
>>> for x in combine("H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"):
... print(x)
...
H̶
e̕
l̛
l͠
o͟
̨
w̡
o̷
r̀
l
ḑ
!͜
Issue is that COMBINING CYRILLIC MILLIONS SIGN is not recognized as combining, not sure why. You could also test if COMBINING is in the unicodedata.name(x) for the character, that should solve it.

The 3rd party regex module can search by glyph:
>>> import regex
>>> s="H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜"
>>> for x in regex.findall(r'\X',s):
... print(x)
...
H̶
e̕
l̛
l͠
o͟
̨
w̡
o̷
r̀
l҉
ḑ
!͜

Related

How to read values one whitespace separated value at a time?

In C++ you can read one value at a time like this:
//from console
cin >> x;
//from file:
ifstream fin("file name");
fin >> x;
I would like to emulate this behaviour in Python. It seems, however, that the ordinary ways to get input in Python read either whole lines, the whole file, or a set number of bits.
I would like a function, let's call it one_read(), that reads from a file until it encounters either a white-space or a newline character, then stops. Also, on subsequent calls to one_read() the input should begin where it left off.
Examples of how it should work:
# file input.in is:
# 5 4
# 1 2 3 4 5
n = int(one_read())
k = int(one_read())
a = []
for i in range(n):
a.append(int(one_read()))
# n = 5 , k = 4 , a = [1,2,3,4,5]
How can I do this?

I think the following should get you close. I admit I haven't tested the code carefully. It sounds like itertools.takewhile should be your friend, and a generator like yield_characters below will be useful.
from itertools import takewhile
import regex as re
# this function yields characters from a file one a at a time.
def yield_characters(file):
with open(file, 'r') as f:
while f:
line = f.readline()
for char in line:
yield char
# double check this. My python regex is weak.
def not_whitespace(char):
return bool(re.match(r"\S", char))
# this should use takewhile to get iterators while something is
def read_one(file):
chars = yield_character(file)
while chars:
yield list(takewhile(not_whitespace, chars)).join()
The read_one above is a generator, so you will need to do something like call list on it.

Normally you would just read a line at a time, then split this and work with each part. However if you can't do this for resource reasons, you can implement your own reader which will read one character at a time, and then yield a word each time it reaches a delimiter (or in this example also a newline or the end of the file).
This implemention uses a context manager to handle the file opening/reading, though this might be overkill:
from functools import partial
class Words():
def __init__(self, fname, delim):
self.delims = ['\n', delim]
self.fname = fname
self.fh = None
def __enter__(self):
self.fh = open(self.fname)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.fh.close()
def one_read(self):
chars = []
for char in iter(partial(self.fh.read, 1), ''):
if char in self.delims:
# delimiter signifies end of word
word = ''.join(chars)
chars = []
yield word
else:
chars.append(char)
# Assuming x.txt contains 12 34 567 8910
with Words('/tmp/x.txt', ' ') as w:
print(next(w.one_read()))
# 12
print(next(w.one_read()))
# 34
print(list(w.one_read()))
# [567, 8910]

More or less anything that operates on files in Python can operate on the standard input and standard output. The sys standard library module defines stdin and stdout which give you access to those streams as file-like objects.
Reading a line at a time is considered idiomatic in Python because the other way is quite error-prone (just one C++ example question on Stack Overflow). But if you insist: you will have to build it yourself.
As you've found, .read(n) will read at most n text characters (technically, Unicode code points) from a stream opened in text mode. You can't tell where the end of the word is until you read the whitespace, but you can .seek back one spot - though not on the standard input, which isn't seekable.
You should also be aware that the built-in input will ignore any existing data on the standard input before prompting the user:
>>> sys.stdin.read(1) # blocks
foo
'f'
>>> # the `foo` is our input, the `'f'` is the result
>>> sys.stdin.read(1) # data is available; doesn't block
'o'
>>> input()
bar
'bar'
>>> # the second `o` from the first input was lost

Try creating a class to remember where the operation left off.
The __init__ function takes the filename, you could modify this to take a list or other iterable.
read_one checks if there is anything left to read, and if there is, removes and returns the item at index 0 in the list; that being everything until the first whitespace.
class Reader:
def __init__(self, filename):
self.file_contents = open(filename).read().split()
def read_one(self):
if self.file_contents != []:
return self.file_contents.pop(0)
Initalise the function as follows and adapt to your liking:
reader = Reader(filepath)
reader.read_one()

How can I use .replace() on a .txt file with accented characters?

So I have a code that takes a .txt file and adds it to a variable as a string.
Then, I try to use .replace() on it to change the character "ó" to "o", but it is not working! The console prints the same thing.
Code:
def normalize(filename):
#Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error.
#File says: "Es una rubrica de evaluación." (among many emojis)
txt_raw = open(filename, "r", errors="ignore")
txt_read = txt_raw.read()
#Here, only the "o" is replaced. In the real code, I use a for loop to iterate through all chrs.
rem_accent_txt = txt_read.replace("ó", "o")
print(rem_accent_txt)
return
Expected output:
"Es una rubrica de evaluacion."
Current Output:
"Es una rubrica de evaluación."
It does not print an error or anything, it just prints it as it is.
I believe the problem lies on the fact that the string comes from a file because when I just create a string and use the code, it does work, but it does not work when I get the string from a file.
EDIT: SOLUTION!
Thanks to #juanpa.arrivillaga and #das-g I came up with this solution:
from unidecode import unidecode
def get_txt(filename):
txt_raw = open(filename, "r", encoding="utf8")
txt_read = txt_raw.read()
txt_decode = unidecode(txt_read)
print(txt_decode)
return txt_decode

Almost certainly, what is occuring is that you have a unormalized unicode strings. Essentially, there are two ways to create "ó" in unicode:
>>> combining = 'ó'
>>> composed = 'ó'
>>> len(combining), len(composed)
(2, 1)
>>> list(combining)
['o', '́']
>>> list(composed)
['ó']
>>> import unicodedata
>>> list(map(unicodedata.name, combining))
['LATIN SMALL LETTER O', 'COMBINING ACUTE ACCENT']
>>> list(map(unicodedata.name, composed))
['LATIN SMALL LETTER O WITH ACUTE']
Just normalize your strings:
>>> composed == combining
False
>>> composed == unicodedata.normalize("NFC", combining)
True
Although, taking a step back, do you really want to remove accents? Or do you just want to normalize to composed, like the above?
As an aside, you shouldn't ignore the errors when reading your text file. You should use the correct encoding. I suspect what is happening is that you are writing your text file using an incorrect encoding, because you should be able to handle emojis just fine, they aren't anything special in unicode.
>>> emoji = "😀"
>>> print(emoji)
😀
>>>
>>> unicodedata.name(emoji)
'GRINNING FACE'

stripping the correct float value out of my string

I am using python to process pcap files and input the processed values to a text file. The text file has around 8000 rows and some times, the text file has string such as 7.70.582 . In my further processing of the text file i am splitting the file into lines and extracting each of the float values in every line. Then I get this error
ValueError: invalid literal for float(): 7.70.582
In such cases I am interested only in 7.70 and I need to avoid everything after the second decimal including it. Is there any trick to extract only the string till the first character after the first decimal point?
I was searching for an answer for this and it seems there has been no such situation asked before.
Or is there a method where I can skip those lines where this kind of errors are happening?

I'm not a huge fan of this approach, but the simplest might be something like:
strs = [
"7",
"7.70",
"7.70.582",
"7.70.582.123"
]
def parse(s):
s += ".."
return float(s[:s.index(".", s.index(".")+1)])
for s in strs:
print(s, parse(s))
It's a more legible approach might be to use something like:
def parse(s):
if s.count('.') <= 1: return float(s)
return float(s[:s.index(".", s.index(".")+1)])
Or, based off Ajax1234's answer:
def parse(s):
return float('.'.join(s.split('.')[:2]))
All versions output:
7 7.0
7.70 7.7
7.70.582 7.7
7.70.582.123 7.7

You can use a regular expression, like this one:
https://pythex.org/?regex=%5E(%5B0-9%5D%2B%5C.%5B0-9%5D%2B).*&test_string=7.70.582&ignorecase=0&multiline=0&dotall=0&verbose=0
If your line is like '7.70.582' this regex will extract the 7.70 into the first group:
^([0-9]+.[0-9]+).*
https://docs.python.org/2/library/re.html
import re
line = "7654 16.317 8.651 7.70.582 17.487"
val = line.split(" ")[3]
m = re.search('^([0-9]+\.[0-9]+).*', val)
m.group(1)
'7.70'
float(m.group(1))
7.70

You can use str.split() and '.'.join:
s = "7654 16.317 8.651 7.70.582 17.487"
final_data = map(float, ['.'.join(i.split('.')[:-1]) if len(i.split('.')) > 2 else i for i in s.split()])
Output:
[7654.0, 16.317, 8.651, 7.7, 17.487]
Regarding the single string:
s = ["7.70.582"]
final_data = map(float, ['.'.join(i.split('.')[:-1]) if len(i.split('.')) > 2 else i for i in s])
Output:
[7.7]

Ignore newline character in binary file with Python?

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)

(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.

You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

python Incorrect formatting Cyrillic

def inp(text):
tmp = str()
arr = ['.' for x in range(1, 40 - len(text))]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
Output:
tester.................................
om.....................................
sup....................................
jope...................................
тестер...........................
ом...................................
суп.................................
жопа...............................
Why is Python not properly handling Cyrillic? End of the line is not straight and scrappy. Using the formatting goes the same. How can this be corrected? thanks

Read this:
http://docs.python.org/2/howto/unicode.html
Basically, what you have in text parameter to inp function is a string. In Python 2.7, strings are bytes by default. Cyrilic characters are not mapped 1-1 to bytes when encoded in e.g. utf-8 encoding, but require more than one byte (usually 2 in utf-8), so when you do len(text) you don't get the number of characters, but number of bytes.
In order to get the number of characters, you need to know your encoding. Assuming it's utf-8, you can decode text to that encoding and it will print right:
#!/usr/bin/python
# coding=utf-8
def inp(text):
tmp = str()
utext = text.decode('utf-8')
l = len(utext)
arr = ['.' for x in range(1, 40 - l)]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
The important lines are these two:
utext = text.decode('utf-8')
l = len(utext)
where you first decode the text, which results in an unicode string. After that, you can use the built in len to get the length in characters, which is what you want.
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Index strings by letter including diacritics - python

The 3rd party regex module can search by glyph: >>> import regex >>> s="H̶e̕l̛l͠o͟ ̨w̡o̷r̀l҉ḑ!͜" >>> for x in regex.findall(r'\X',s): ... print(x) ... H̶ e̕ l̛ l͠ o͟ ̨ w̡ o̷ r̀ l҉ ḑ !͜

Related

How to read values one whitespace separated value at a time?

How can I use .replace() on a .txt file with accented characters?

stripping the correct float value out of my string

Ignore newline character in binary file with Python?

python Incorrect formatting Cyrillic

Categories

Resources