How can I update QTextEdit as user writes on it? (on Python) - python

I'm working on a Python+Qt WebSMS app/script. It asks for a number and message, and sends it to Vodafone via mechanize. Since Vodafone of my country doesn't support UTF-8, at least for WebSMS and every SMS should be shorter than 160 chars, I'm using this setup:
def setMesaj():
global mesaj
mesaj = unicode(self.textEdit.toPlainText().toUtf8(), "utf-8")
mesaj = mesaj.encode("ascii", "ignore")
if (len(mesaj)) > 159:
print "[WARN-1] Mesaj 160 karakterden fazla?"
i = len(mesaj) - 159
mesaj = mesaj [:-i]
print mesaj
QtCore.QObject.connect(self.textEdit, QtCore.SIGNAL("textChanged()"), setMesaj)
Well, It works. If message goes over 160 chars, the last letter is automatically removed, and If user tries to type any "weird" character, It's not accepted.
Here's my question: The variable 'mesaj' works perfectly, but It doesn't update the QTextEdit thing, so when It doesn't get anything over 160 chars (or Unicode), it still looks like allowed to the user. So, how can I update QTextEdit as user writes on it and make the changes appear syncronized?
Thanks,

def setMesaj(self):
mesaj = unicode(self.toPlainText().toUtf8(), "utf-8")
ascii = mesaj.encode("ascii", "ignore")
if ascii != mesaj:
self.setPlainText(ascii)
if (len(mesaj)) > 159:
QtGui.QMessageBox.warning(self, 'warning', "[WARN-1] Mesaj 160 karakterden fazla?")
i = len(mesaj) - 159
mesaj = mesaj [:-i]
self.setPlainText(mesaj)
This would be my quick and dirty approach, however you still have to put the text cursor in the correct position after making the edits.

One way to detect the right position for the text cursor would be to use codecs.register_error to define a custom error function, one that duplicates "ignore", but also remembers how many characters in front of the cursor were deleted, and to shift the cursor that many positions to the left after encoding.

Related

Get rid of unicode decimal charater

I have a huge file which looks like this :
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
As you can see the file contains some unicode decimal, I would like to replace all of them with their latin character before using the file. Even opening it with the utf-8 encoding, the errors are not suppress.
do you know a way to do it. I want to create a dictionary and retrieve the Numbers at index 2.
for : 6883;jumarre;83;295;0; => i have 83
for : 6887;khướu;62;325;0 => i have &#7899 => which is false , i should have 62
with codecs.open('JeuxdeMotsPolarise_test.txt', 'r', 'utf-8', errors = 'ignore') as text_file:
text_file =(text_file.read())
#print(text_file)
dico_lexique = ({i.split(";")[1]:i.split(";")[2:]for i in text_file.split("\n") if i})
This is the result given with trying #serge proposition, but it leaves blank spaces between lines.
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hi âu;81;294;0
6819;hi cu;64;338;0
6820;hi yn;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hu cao c;152;298;0
6854;huyn ;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kn kn;73;303;0
6886;khoang;64;323;0
6887;khu;62;325;0
Edit : I redownload the original file and the error of missing ";" was corrected.
for example:
=> 6850;hổ mang;54;298;0 (that is how is appeared in the now update file)
Thank you everybody
#PanagiotisKanavos has correctly guessed that html.unescape was able to replace the xml char reference with their unicode character. The hard part is that some refs are correctly ended with their terminating semicolon (;) while others are not. And in that latter case, if one entity if followed with a semicolon separator, the separator will be eaten by the convertion, shifting the following fields.
So the only reliable way is to:
process the file line by line as as CSV file with ; delimiter
eventually contatenate the middle field from the second to the fourth starting form the end
unescape that middle field
If you want to convert the file, you could do:
with open('file.csv') as fd, open('fixed.csv', 'w', newline='') as fdout:
rd = csv.reader(fd, delimiter=';')
wr = csv.writer(fdout, delimiter=';')
for row in rd:
if len(row)> 5:
row[1] = ';'.join(row[1:len(row)-3])
del row[2:len(row)-3]
row[1] = html.unescape(row[1])
wr.writerow(row)
If you only want to build a mapping from field 0 to field 2:
values = {}
with open('file.csv') as fd:
rd = csv.reader(fd, delimiter=';')
for row in rd:
values[field[0]] = field[-3]
This text isn't UTF8 or Unicode in general. It's HTML-encoded text, most likely Vietnamese. Those escape sequences correspond to Vietnamese characters, for example &#432 is ư - in fact, I just typed the edit sequence in the SO edit box and the correct character appeared. ớ is ớ.
Copying the entire text outside a code block produces
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
Googling for Họ Khướu returns this Wikipedia page about Họ Khướu.
I think it's safe to assume this is HTML-encoded Vietnamese text. To convert it to Unicode you can use html.unescape :
import html
line='6887;khướu;62;325;0'
properLine=html.unescape(line)
UPDATE
The text posted above is just the original text with an extra newline per page. It's SO's markdown renderer that converts the escape sequences to the corresponding glyphs.
The funny thing is that this line :
6853;h&#432&#417u cao cổ152;298;0
Can't be rendered because the HTML entities aren't properly terminated. html.unescape on the other hand will convert the characters. Clearly, html.unescape is far more forgiving than SO's markdown renderer.
Either of these lines :
html.unescape('6853;hươu cao cổ152;298;0')
html.unescape('6853;h&#432&#417u cao cổ152;298;0')
Returns :
6853;h\u01b0\u01a1u cao c\u1ed5152;298;0
Repair the file first before you load it into a CSV parser.
Assuming Maarten in the comments is right, change the encoding:
iconv -f cp1252 -t utf-8 < JeuxdeMotsPolarise_test.txt > JeuxdeMotsPolarise_test.utf8.txt
Then replace the escapes with proper characters.
perl -C -i -lpe'
s/&#([0-9]+);?/chr $1/eg; # replace entities
s/;?(\d+;\d+;\d+)$/;$1/; # put back semicolon
# if it was consumed accidentally
' JeuxdeMotsPolarise_test.utf8.txt
Contents of JeuxdeMotsPolarise_test.utf8.txt after running the substitutions:
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ;152;298;0
6854;huyền đề;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0

WM_GETTEXT returns text separated with nulls

import time
import win32gui
import win32con
while True:
time.sleep(1)
buf = win32gui.PyMakeBuffer(255)
window = win32gui.GetForegroundWindow()
title = win32gui.GetWindowText(window)
control = win32gui.FindWindowEx(window, 0, 'Edit', None)
length = win32gui.SendMessage(control, win32con.WM_GETTEXT, 255, buf)
result = buf[:length]
print('Title: ', win32gui.GetWindowText(window))
print(str(buf[:length*2], "UTF_8")
Why it returns string separated with nulls? When I've tried just buff[:length] I had half of my string because of that nulls
bytearray(b'H\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00!\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x9dL\x03E\x888P\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe0\xedL\x03\xa9\xc4\xffb\xa0\tO\x00j\x8c\x1bZ\xa04\xc6\x02IP\x12\x8d\x00\x00\x00\x00\x00\x00\x00\x00\xa0X?\x03\xed`\x05\x89\xa0n\xfb\x02.\x02\xea\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xc0*X\x00\xf4b\x9c\xf9\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xd6\x8d\x02\x98?n\xb2\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00D\xcc\x02\xbey\xee\x08\x00\x00\x00\x00\x00\x00\x00')
edit:
result = buf.tobytes()[:length*2:2]
print(result.decode("UTF-8"))
The code follow work as I wanted but I'm not sure It has been written correctly
What you are getting back from the Win32 API is a UTF-16 string. Each character is 16-bits, so that's why it appears as if a null byte is in between each ascii when viewed as a byte array.
This is the correct way to interpret that string:
length = win32gui.SendMessage(control, win32con.WM_GETTEXT, 255, buf)
result = buf[0:length*2]
text = result.decode("utf-16")
Your solution manages to work with a utf-8 decode because you are skipping over all the null chars. That works fine, but will generate weird results (and possibly throw an exception) as soon as unicode characters are typed typed into that edit control.

ID3v1 Null Byte parsing in python

I am writing a tool to parse ID3 tags from files an edit them in a GUI fashion. Up until now everything is great. However I am trying to remove the null byte terminators when displaying the info and then adding it back when user saves it to preserver the ID3v1 format. However when doing a check for the null terminator I get nothing.
This is the portion of the code related to the handlig of the tag:
if(bytes.decode(check) == "TAG"):
title = self.__clean(bytes.decode(f.read(30)))
artist = self.__clean(bytes.decode(f.read(30)))
album = self.__clean(bytes.decode(f.read(30)))
year = bytes.decode(f.read(4))
comment = self.__clean(bytes.decode(f.read(30)))
tmp_gen = bytes.decode(f.read(1))
genre = self.__clean(Utils.genreByteToString(tmp_gen))
return TagV1(title, artist, album, year, comment, genre)
return None
The clean method is here:
def __clean(self, string):
counter = 0
for i in range(0, len(string)):
w = string[i]
if(not w.strip()) or b"\00" == w or w == b"00" or w == bytes.decode(b"\00"):
counter+=1
else:
counter = 0
if(counter == 2):
return string[0:i-1]
return string
I've tried every possible combination know of null byte. Either not w or not w.split() I even tried putting it in bytes and then looping thorught that for null byte but still nothing. My counter always stays 0 on the debugger. Also when trying to copy the value from the debugger it appears as this which is an empty space. In the debugger it appears as an empty square. I would appreciate the input.
Using PyChar 2017 1.4
I figured out that the only solution that works is to use
w == str.decode(b"\00") or rstrip("\0")
as denoted by Marteen
Everything else seems to not work. However there are still some places where it doesn't work. For example the comment in the file I am trying doesn't have null bytes until the last one.
Upon further inspection with a hex editor I have found some odd characters. The comment continues on with the \20 character in hex until position 29 where a null character is (for denoting it has a track indicator) the next character is a \01 for the track. Oddly the genre indicator is a 0C which translates to (cannot paste it, it's a box with ceros in it).
EDIT: Using the __clean() method checking for decoded null terminator aswell as w.isspace() seemed to fix the issue in both other cases.

python for loop to find and print names from blizzard api behaving erratically

I am working on a script to find and print the ilvls of my guild in python through the blizzard api. I realize that my code is ugly/horrible/not optimized. This is my first python project and I am learning as I go.
The problem I am having is that when I run the script, it gives me erratic results when printing the names of characters in my guild. Some it will print multiple times, others it will work as intended and just print once. I am more than likely going about the process entirely wrong, but what I have come up with has worked until now.
Here is the code:
def guild_list(glink):
with urllib.request.urlopen(glink) as url:
gSource = url.read()
gSourceDecoded = gSource.decode(encoding='UTF-8')
gSource2 = str.replace(gSourceDecoded,"\"",' ')
#finds total number of "characters"
stringo = gSource2.split()
nameCount = str(stringo.count('character'))
gFirstSpace = gSource2.find('character')
nameC = 0
while nameC != int(nameCount):
nextName = gSource2.find('character', gFirstSpace + 1)
spaceBeforeName = gSource2.find(' ', nextName + 18)
spaceAfterName = gSource2.find(' ', spaceBeforeName + 1)
nameLen = spaceAfterName - spaceBeforeName
cName = gSource2[spaceBeforeName + 1:spaceBeforeName + nameLen]
gFirstSpace = gFirstSpace + nameC
print(spaceBeforeName,'space before character name.')
print(spaceAfterName,'space after character name.')
nameC = nameC + 1
print(cName)
print(nameC,'number of instance \"character\" found.')
guild_list('http://us.battle.net/api/wow/guild/mugthol/license%20and%20registration?fields=members')
The results I get start out repeating the same name a few times. Then gradually start listing each name only once. This is where I am confused.
Results:
708 space before character name.
717 space after character name.
Euphoria
1 number of "character" found.
708 space before character name.
717 space after character name.
Euphoria
2 number of "character" found.
708 space before character name.
717 space after character name.
...
255 number of "character" found.
32740 space before character name.
32748 space after character name.
Bawbity
256 number of "character" found.
32997 space before character name.
33009 space after character name.
Kilikinilei
Thank you for the help, and again I apologize if my code is horrible to read. I am learning as I go.
You shouldn't be using string operations to parse that data. It is in JSON format so just use the builtin json library to parse it into native Python dicts.
import json
import requests
def guild_info(guild_name, fields="members"):
url = "http://us.battle.net/api/wow/guild/{guild_name}/license%20and%20registration?fields={fields}".format(guild_name=guild_name, fields=fields)
pg = requests.get(url).content
return json.loads(pg)
info = guild_info("mugthol")
gets you a dict containing the data. You can then use it like
for member in info["members"]:
char = member["character"]
print("{name:<15} Level {level} ".format(**char))
which gets you
Introvert Level 88
Euphoria Level 86
Timid Level 90
Intricacy Level 87
Obscurity Level 90
Silhouette Level 40
Ragingfupa Level 87
Enragedfupa Level 90
Ragingticks Level 90
# ... etc

python - determining the end of a field reading binary file

I'm working with a binary save game file, the file contains a number of fields most are fixed but there are sveral variable length fields which I'm having issues parsing because I don't know the length of them. What I am trying to do is read from a known offset until it reaches either a nullbyte or or returns nothing with that I would then be able to generare the offset for the next field.
The file I'm working with is www.retro-gaming-world.com/SAVE.DAT
the beggining of the field is at 0x8C30 having issues foguring out where it ends though.
I tried doing this with the following code but I don't think I'm going about this right.
while catch:
if "0" in temp2:
print "found it"
print temp2
print hex(infile.tell())
break
temp = infile.read(1)
temp2 += temp
You should use '\0' to represent null character:
>>> ord('0')
48
>>> ord('\0')
0

Categories

Resources