Related
I have a script that allows me to extract the info obtained from excel to a list, this list contains str values that contain phrases such as: "I like cooking", "My dog´s name is Doug", etc.
So I've tried this code that I found on the Internet, knowing that the int function has a way to transform an actual phrase into numbers.
The code I used is:
lista=["I like cooking", "My dog´s name is Doug", "Hi, there"]
test_list = [int(i, 36) for i in lista]
Running the code I get the following error:
builtins.ValueError: invalid literal for int() with base 36: "I like
cooking"
But I´ve tried the code without the spaces or punctuation, and i get an actual value, but I do need to take those characters into consideration.
To expand on the bytearray approach you could use int.to_bytes and int.from_bytes to actually get an int back, although the integers will be much longer than you show in your example.
def to_int(s):
return int.from_bytes(bytearray(s, 'utf-8'), 'big', signed=False)
def to_str(s):
return s.to_bytes((s.bit_length() +7 ) // 8, 'big').decode()
lista = ["I like cooking",
"My dog´s name is Doug",
"Hi, there"]
encoded = [to_int(s) for s in lista]
decoded = [to_str(s) for s in encoded]
encoded:
[1483184754092458833204681315544679,
28986146900667755422058678317652141643897566145770855,
1335744041264385192549]
decoded:
['I like cooking',
'My dog´s name is Doug',
'Hi, there']
As noted in the comments, converting phrases to integers with int() won't work if the phrase contains whitespace or most non-alphanumeric characters with a few exceptions.
If your phrases all use a common encoding, then you might get something closer to what you want by converting your strings to bytearrays. For example:
s = 'My dog´s name is Doug'
b = bytearray(s, 'utf-8')
print(list(b))
# [77, 121, 32, 100, 111, 103, 194, 180, 115, 32, 110, 97, 109, 101, 32, 105, 115, 32, 68, 111, 117, 103]
From there you would have to figure out whether or not you want to preserve the list of integers representing each phrase or combine them in some way depending on what you intend to do with these numerical string representations.
Since you want to convert your text for an AI, you should do something like this:
import re
def clean_text(text, vocab):
'''
normalizes the string
'''
chars = {'\'':[u"\u0060", u"\u00B4", u"\u2018", u"\u2019"], 'a':[u"\u00C0", u"\u00C1", u"\u00C2", u"\u00C3", u"\u00C4", u"\u00C5", u"\u00E0", u"\u00E1", u"\u00E2", u"\u00E3", u"\u00E4", u"\u00E5"],
'e':[u"\u00C8", u"\u00C9", u"\u00CA", u"\u00CB", u"\u00E8", u"\u00E9", u"\u00EA", u"\u00EB"],
'i':[u"\u00CC", u"\u00CD", u"\u00CE", u"\u00CF", u"\u00EC", u"\u00ED", u"\u00EE", u"\u00EF"],
'o':[u"\u00D2", u"\u00D3", u"\u00D4", u"\u00D5", u"\u00D6", u"\u00F2", u"\u00F3", u"\u00F4", u"\u00F5", u"\u00F6"],
'u':[u"\u00DA", u"\u00DB", u"\u00DC", u"\u00DD", u"\u00FA", u"\u00FB", u"\u00FC", u"\u00FD"]}
for gud in chars:
for bad in chars[gud]:
text = text.replace(bad, gud)
if 'http' in text:
return ''
text = text.replace('&', ' and ')
text = re.sub(r'\.( +\.)+', '..', text)
#text = re.sub(r'\.\.+', ' ^ ', text)
text = re.sub(r',+', ',', text)
text = re.sub(r'\-+', '-', text)
text = re.sub(r'\?+', ' ? ', text)
text = re.sub(r'\!+', ' ! ', text)
text = re.sub(r'\'+', "'", text)
text = re.sub(r';+', ':', text)
text = re.sub(r'/+', ' / ', text)
text = re.sub(r'<+', ' < ', text)
text = re.sub(r'>+', ' > ', text)
text = text.replace('%', '% ')
text = text.replace(' - ', ' : ')
text = text.replace(' -', " - ")
text = text.replace('- ', " - ")
text = text.replace(" '", " ")
text = text.replace("' ", " ")
#for c in ".,:":
# text = text.replace(c + ' ', ' ' + c + ' ')
text = re.sub(r' +', ' ', text.strip(' '))
for i in text:
if i not in vocab:
text = text.replace(i, '')
return text
def arr_to_vocab(arr, vocabDict):
'''
returns a provided array converted with provided vocab dict, all array elements have to be in the vocab, but not all vocab elements have to be in the input array, works with strings too
'''
try:
return [vocabDict[i] for i in arr]
except Exception as e:
print (e)
return []
def str_to_vocab(vocab):
'''
generates vocab dicts
'''
to_vocab = {}
from_vocab = {}
for index, i in enumerate(vocab):
to_vocab[index] = i
from_vocab[i] = index
return to_vocab, from_vocab
vocab = sorted([chr(i) for i in range(32, 127)]) # a basic vocab for your model
vocab.insert(0, None)
toVocab, fromVocab = str_to_vocab(vocab) #converting vocab into usable form
your_data_str = ["I like cooking", "My dog´s name is Doug", "Hi, there"] #your data, a list of strings
X = []
for i in your_data_str:
X.append(arr_to_vocab(clean_text(i, vocab), fromVocab)) # normalizing and converting to "ints" each string
# your data is now almost ready for your model, just pad it to the size of your input with zeros and it's done
print (X)
If you want to know how convert an "int" string back to a string, tell me.
Using a Korean Input Method Editor (IME), it's possible to type 버리 + 어 and it will automatically become 버려.
Is there a way to programmatically do that in Python?
>>> x, y = '버리', '어'
>>> z = '버려'
>>> ord(z[-1])
47140
>>> ord(x[-1]), ord(y)
(47532, 50612)
Is there a way to compute that 47532 + 50612 -> 47140?
Here's some more examples:
가보 + 아 -> 가봐
끝나 + ㄹ -> 끝날
I'm a Korean. First, if you type 버리 + 어, it becomes 버리어 not 버려. 버려 is an abbreviation of 버리어 and it's not automatically generated. Also 가보아 cannot becomes 가봐 automatically during typing by the same reason.
Second, by contrast, 끝나 + ㄹ becomes 끝날 because 나 has no jongseong(종성). Note that one character of Hangul is made of choseong(초성), jungseong(중성), and jongseong. choseong and jongseong are a consonant, jungseong is a vowel. See more at Wikipedia. So only when there's no jongseong during typing (like 끝나), there's a chance that it can have jongseong(ㄹ).
If you want to make 버리 + 어 to 버려, you should implement some Korean language grammar like, especially for this case, abbreviation of jungseong. For example ㅣ + ㅓ = ㅕ, ㅗ + ㅏ = ㅘ as you provided. 한글 맞춤법 chapter 4. section 5 (I can't find English pages right now) defines abbreviation like this. It's possible, but not so easy job especially for non-Koreans.
Next, if what you want is just to make 끝나 + ㄹ to 끝날, it can be a relatively easy job since there're libraries which can handle composition and decomposition of choseong, jungseong, jongseong. In case of Python, I found hgtk. You can try like this (nonpractical code):
# hgtk methods take one character at a time
cjj1 = hgtk.letter.decompose('나') # ('ㄴ', 'ㅏ', '')
cjj2 = hgtk.letter.decompose('ㄹ') # ('ㄹ', '', '')
if cjj1[2]) == '' and cjj2[1]) == '':
cjj = (cjj1[0], cjj1[1], cjj2[0])
cjj2 = None
Still, without proper knowledge of Hangul, it will be very hard to get it done.
You could use your own Translation table.
The drawback is you have to input all pairs manual or you have a file to get it from.
For instance:
# Sample Korean chars to map
k = [[('버리', '어'), ('버려')], [('가보', '아'), ('가봐')], [('끝나', 'ㄹ'), ('끝날')]]
class Korean(object):
def __init__(self):
self.map = {}
for m in k:
key = m[0][0] + m[0][1]
self.map[hash(key)] = m[1]
def __getitem__(self, item):
return self.map[hash(item)]
def translate(self, s):
return [ self.map[hash(token)] for token in s]
if __name__ == '__main__':
k_map = Korean()
k_chars = [ m[0][0] + m[0][1] for m in k]
print('Input: %s' % k_chars)
print('Output: %s' % k_map.translate(k_chars))
one_char_3 = k[0][0][0] + k[0][0][1]
print('%s = %s' % (one_char_3, k_map[ one_char_3 ]) )
Input: ['버리어', '가보아', '끝나ㄹ']
Output: ['버려', '가봐', '끝날']
버리어 = 버려
Tested with Python:3.4.2
Can we use regex to detect text within a pdf (using pdfquery or another tool)?
I know we can do this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("Cash")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
cash = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % \
(left_corner, bottom_corner-30, \
left_corner+150, bottom_corner)).text()
print cash
'179,000.00'
But we need something like this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:regex("\d{1,3}(?:,\d{3})*(?:\.\d{2})?")')
cash = str(label.attr('x0'))
print cash
'179,000.00'
This is not exactly a lookup for a regex, but it works to format/filter the possible extractions:
def regex_function(pattern, match):
re_obj = re.search(pattern, match)
if re_obj != None and len(re_obj.groups()) > 0:
return re_obj.group(1)
return None
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pattern = ''
pdf.extract( [
('with_parent','LTPage[pageid=1]'),
('with_formatter', 'text'),
('year', 'LTTextLineHorizontal:contains("Form 1040A (")',
lambda match: regex_function(SOME_PATTERN_HERE, match)))
])
I didn't test this next one, but it might work also:
def some_regex_function_feature():
# here you could use some regex.
return float(this.get('width',0)) * float(this.get('height',0)) > 40000
pdf.pq('LTPage[page_index="1"] *').filter(regex_function_filter_here)
[<LTTextBoxHorizontal>, <LTRect>, <LTRect>]
I want to implement a for loop in the Tkinter tkMessageBox.showinfo() in python
I need to print a list of lists in the box.
What I currently have is:
tkMessageBox.showinfo(
"Help INFORMATION",
"help1 help2 \n help3 help4 \n help5 help6"
)
What I want is:
Something like below..
my_list=[['help1','help2'],['help3','help4'],['help5','help6']]
tkMessageBox.showinfo(
"Help INFORMATION",
for i in my_list:
i + "\n" #cant use print as I want to display it in the dialog box and not in the console.
)
So that the output in the dialog box should be like this :
help1 help2
help3 help4
help5 help6
But what I get is:
Syntax Error -> for i in my_list:
How about this:
my_list=[['help1','help2'],['help3','help4'],['help5','help6']]
tkMessageBox.showinfo(
"Help INFORMATION",
'\n'.join(map(' '.join, my_list))
)
I did not test it but should ideally do the job.
ok you can try this , I know it's not the most efficient one but it works !
my_list=[['help1','help2'],['help3','help4'],['help5','help6']]
def to_tuples(list):
tuples = []
for sublist in list :
tuples.append(tuple(sublist))
return tuples
def dialog_info(tuples):
res = ""
for element in tuples :
res += ' '.join(element)
res += '\n'
return res
print dialog_info(my_list)
now you can just use :
my_list = [['help1', 'help2'], ['help3', 'help4'], ['help5', 'help6']]
tkMessageBox.showinfo(
"Help INFORMATION",
dialog_info(my_list)
)
You can use
'\n'.join(map(' '.join, my_list))
instead of the for loop.
I am quiet new to regular expressions. I have a string that looks like this:
str = "abc/def/([default], [testing])"
and a dictionary
dict = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
and using Python RE, I want str in this form, after comparisons of each element in dict to str:
str = "abc/def/(2.7, 2.1)"
Any help how to do it using Python RE?
P.S. its not the part of any assignment, instead it is the part of my project at work and I have spent many hours to figure out solution but in vain.
import re
st = "abc/def/([default], [testing], [something])"
dic = {'abc/def/[default]' : '2.7',
'abc/def/[testing]' : '2.1',
'bcd/xed/[something]' : '3.1'}
prefix_regex = "^[\w*/]*"
tag_regex = "\[\w*\]"
prefix = re.findall(prefix_regex, st)[0]
tags = re.findall(tag_regex, st)
for key in dic:
key_prefix = re.findall(prefix_regex, key)[0]
key_tag = re.findall(tag_regex, key)[0]
if prefix == key_prefix:
for tag in tags:
if tag == key_tag:
st = st.replace(tag, dic[key])
print st
OUTPUT:
abc/def/(2.7, 2.1, [something])
Here is a solution using re module.
Hypotheses :
there is a dictionary whose keys are composed of a prefix and a variable part, the variable part is enclosed in brackets ([])
the values are strings by which the variable parts are to be replaced in the string
the string is composed by a prefix, a (, a list of variable parts and a )
the variable parts in the string are enclosed in []
the variable parts in the string are separated by a comma followed by optional spaces
Python code :
import re
class splitter:
pref = re.compile("[^(]+")
iden = re.compile("\[[^]]*\]")
def __init__(self, d):
self.d = d
def split(self, s):
m = self.pref.match(s)
if m is not None:
p = m.group(0)
elts = self.iden.findall(s, m.span()[1])
return p, elts
return None
def convert(self, s):
p, elts = self.split(s)
return p + "(" + ", ".join((self.d[p + elt] for elt in elts)) + ")"
Usage :
s = "abc/def/([default], [testing])"
d = {'abc/def/[default]' : '2.7', 'abc/def/[testing]' : '2.1'}
sp = splitter(d)
print(sp.convert(s))
output :
abc/def/(2.7, 2.1)
Regex is probably not required here. Hope this helps
lhs,rhs = str.split("/(")
rhs1,rhs2 = rhs.strip(")").split(", ")
lhs+="/"
print "{0}({1},{2})".format(lhs,dict[lhs+rhs1],dict[lhs+rhs2])
output
abc/def/(2.7,2.1)