Python / Get unique tokens from a file with a exception - python

I want to find the number of unique tokens in a file. For this purpose I wrote the below code:
splittedWords = open('output.txt', encoding='windows-1252').read().lower().split()
uniqueValues = set(splittedWords)
print(uniqueValues)
The output.txt file is like this:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl
club+Noun toplanti+Noun+A3pl+P3sg
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
With this code I can get the unique tokens like Türkiye+Noun, Türkiye+Noun+Gen. But I want to get forexample Türkiye+Noun, Türkiye+Noun+Gen like only one token before the + sign. I only want Türkiye part. In the end Türkiye+Noun and Türkiye+Noun+Gen tokens needs to be same and only treated as a single unique token. I think I need to write regex for this purpose.

It seems the word you want is always the 1st in a list of '+'-joined words:
Split the splitted words at + and take the 0th one:
text = """Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl
club+Noun toplanti+Noun+A3pl+P3sg
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj """
splittedWords = text.lower().replace("\n"," ").split()
uniqueValues = set( ( s.split("+")[0] for s in splittedWords))
print(uniqueValues)
Output:
{'imha', 'çaba', 'ülke', 'arzula', 'terörizm', 'olus', 'daha', 'istikrar', 'küresel',
'sagla', 'önle', 'üzere', 'nisbi', 'türkiye', 'gelis', 'bir', 'karar', 'hedef', '2',
've', 'silah', 'kur', 'alan', 'club', 'boyut', '-', 'anlasma', 'iliski',
'izafi', 'kurumsal', 'karsi', 'ankara', 'ortaklik', 'obur', 'kitle', 'güven',
'uygula', 'ol', 'düzey', 'konsey', 'teknik', 'rejim', 'komite', 'gümrük', 'samimi',
'gel', 'yay', 'toplanti', '.', 'asama', 'mahiyet', 'ab', '69', 'için',
'paylas', '6', '/', 'nispi', 'dünya', 'at', 'sayili', 'görece', 'isbirlik', 'birlik',
',', 'tüm', 'ile', 'düzen', 'uyar', 'göster', 'tehdit', 'madde'}
You might need to do some additional cleanup to remove things like
',' '6' '/'
Split and remove anything thats just numbers or punctuation
from string import digits, punctuation
remove=set(digits+punctuation)
splittedWords = text.lower().split()
uniqueValues = set( ( s.split("+")[0] for s in splittedWords))
# remove from set anything that only consists of numbers or punctuation
uniqueValues = uniqueValues - set ( x for x in uniqueValues if all(c in remove for c in x))
print(uniqueValues)
to get it as:
{'teknik', 'yay', 'göster','hedef', 'terörizm', 'ortaklik','ile', 'daha', 'ol', 'istikrar',
'paylas', 'nispi', 'üzere', 'sagla', 'tüm', 'önle', 'asama', 'uygula', 'güven', 'kur',
'türkiye', 'gel', 'dünya', 'gelis', 'sayili', 'ab', 'club', 'küresel', 'imha', 'çaba',
'olus', 'iliski', 'izafi', 'mahiyet', 've', 'düzey', 'anlasma', 'tehdit', 'bir', 'düzen',
'obur', 'samimi', 'boyut', 'ülke', 'arzula', 'rejim', 'gümrük', 'karar', 'at', 'karsi',
'nisbi', 'isbirlik', 'alan', 'toplanti', 'ankara', 'birlik', 'kurumsal', 'için', 'kitle',
'komite', 'silah', 'görece', 'uyar', 'madde', 'konsey'}

You can split all the tokens you have now on "+" and take only the first one.
uniqueValues = set(map(lambda x: x.split('+')[0], splittedWords))
Here I use map. Map will apply the function (the lambda part) on all values of the splittedWords.

Related

How to use Python to parse a SVG document from URL (get points of a polyline)

I'm looking for a Python extension to parse a SVG's "points" values from the <polyline> elements and print them? Possibly to parse it from the URL? or I could save the SVG and do it locally.
I just need it to parse the points values and print them separately for each polyline element. So it will print something like this for each points value of the current <polyline> element.
[[239,274],[239,274],[239,274],[239,275],[239,275],[238,276],[238,276],[237,276],[237,276],[236,276],[236,276],[236,277] [236,277],[235,277],[235,277],[234,278],[234,278],[233,279],[233,279],[232,280] [232,280],[231,280],[231,280],[230,280],[230,280],[230,280],[229,280],[229,280]]
So after the first polyline element gets parsed and printed, it would parse the next polyline element and get the value for points and print it just like the first one until there is no more to be printed.
The SVG's URL: http://colorillo.com/bx0l.inline.svg
Here is a HTML example of a polyline element from the SVG
<polyline points="239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280" style="fill: none; stroke: #000000; stroke-width: 1; stroke-linejoin: round; stroke-linecap: round; stroke-antialiasing: false; stroke-antialias: 0; opacity: 0.8"/>
I'm just looking for some quick help, and a example.. If you're able to help me out that would be neat.
I believe there is an HTML extraction package somewhere, but this is the kind of task I would do with core python, and the regular expressions module. Let txt be the text you presented <polyline..., so:
Importing regular expression module
In [22]: import re
Performing the search:
In [24]: g = re.search('polyline points="(.*?)"', txt)
In the above regex I use polyline points=" as an anchor (I omitted the < because it has a meaning in regex`) and capture all the rest until the next quotation marks.
The text you want is achieved by:
In [25]: g.group(1)
Out[25]: '239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280'
Update
It's safer to use xml to parse the data, here is one way to do it (xml.etree is included with the standard library):
In [32]: import xml.etree.ElementTree as ET
In [33]: root = ET.fromstring(txt)
Since your data is formatted as a root tag already, you don't need futher extractions:
In [35]: root.tag
Out[35]: 'polyline'
And all the properties are actually XML attributes, converted to a dictionary:
In [37]: root.attrib
Out[37]:
{'points': '239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280', 'style': 'fill: none; stroke: #000000; stroke-width: 1; stroke-linejoin: round; stroke-linecap: round; stroke-antialiasing: false; stroke-antialias: 0; opacity: 0.8'}
So here you have it:
In [38]: root.attrib['points']
Out[38]: '239,274 239,274 239,274 239,275 239,275 238,276 238,276 237,276 237,276 236,276 236,276 236,277 236,277 235,277 235,277 234,278 234,278 233,279 233,279 232,280 232,280 231,280 231,280 230,280 230,280 230,280 229,280 229,280'
If you like further to split this to groups according to commas and spaces, I would do this:
Get all groups separated by a space using split with no arguments:
>>> p = g.group(1).split()
>>> p
['239,274', '239,274', '239,274', '239,275', '239,275', '238,276', '238,276', '237,276', '237,276', '236,276', '236,276', '236,277', '236,277', '235,277', '235,277', '234,278', '234,278', '233,279', '233,279', '232,280', '232,280', '231,280', '231,280', '230,280', '230,280', '230,280', '229,280', '229,280']
Now for each string, split it at the comma which will return a list of strings. I use map to convert each such list to a list of ints:
>>> p2 = [list(map(int, numbers.split(','))) for numbers in p]
>>> p2
[[239, 274], [239, 274], [239, 274], [239, 275], [239, 275], [238, 276], [238, 276], [237, 276], [237, 276], [236, 276], [236, 276], [236, 277], [236, 277], [235, 277], [235, 277], [234, 278], [234, 278], [233, 279], [233, 279], [232, 280], [232, 280], [231, 280], [231, 280], [230, 280], [230, 280], [230, 280], [229, 280], [229, 280]]
And this will shed some more light:
>>> '123,456'.split(',')
['123', '456']
>>> list(map(int, '123,456'.split(',')))
[123, 456]
Below
import xml.etree.ElementTree as ET
from collections import namedtuple
import requests
import re
Point = namedtuple('Point', 'x y')
all_points = []
r = requests.get('http://colorillo.com/bx0l.inline.svg')
if r.status_code == 200:
data = re.sub(' xmlns="[^"]+"', '', r.content.decode('utf-8'), count=1)
root = ET.fromstring(data)
poly_lines = root.findall('.//polyline')
for poly_line in poly_lines:
tmp = []
_points = poly_line.attrib['points'].split(' ')
for _p in _points:
tmp.append(Point(*[int(z) for z in _p.split(',')]))
all_points.append(tmp)
for points in all_points:
tmp = [str([p.x, p.y]).replace(' ','') for p in points]
line = ','.join(tmp)
print('[' + line + ']')

Get a part of a token after a specific character

I want to get a part of token in a text file. So far I wrote the code below:
from collections import Counter
import re
freq_dist = set()
words = re.findall(r'[\w+]+', open('output.txt').read())
freq_dist = Counter(words).most_common(10)
print(freq_dist)
My output.txt is as follows:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl
club+Noun toplanti+Noun+A3pl+P3sg
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
I want to get the parts after the first + sign and save them in a list in a descending form. Forexaple, in Türkiye+Noun I want to get +Noun part or in terörizm+Noun+Gen I want to get Noun+gen or in isbirlik+Noun+P3sg I want to get Noun+P3sg and after this I want to list them by their count in a descending order like how many times +Noun or +Noun+gen appeared in the text.
How about splitting your input on spaces?
from collections import Counter
words = [word.split('+', 1)[1].strip() for word in open('output.txt').read().split(' ') if len(word)]
freq_dist = Counter(words).most_common(10)
print(freq_dist)
This would give you:
[('Noun', 16), ('Punc', 8), ('Adj', 8), ('Noun+P3sg', 6), ('Num', 5), ('Conj', 4), ('Noun+Gen', 3), ('Noun+P3sg+Gen', 3), ('Noun+Loc', 2), ('Verb+PastPart+P3pl', 2)]

How to remove 'Synset', '( )' , and '.pos_tag.numbers' on hypernyms and hyponyms?

I write a code for finding hypernyms and hyponyms using NLTK wordnet.
here's my codes (this is the example for hyponyms) :
from nltk.corpus import wordnet as wn
word1 = ['learn']
word2 = ['study']
def getSynonyms(words):
synonymList1 = []
wordnetSynset1 = wn.synsets(words)
tempList1=[]
for synset1 in wordnetSynset1:
synLemmas = synset1.hyponyms()
for i in xrange(len(synLemmas)):
word = synLemmas[i] #.replace('_',' ')
if word not in tempList1:
tempList1.append(word)
synonymList1.append(tempList1)
return synonymList1
def cekSynonyms(word1, word2):
tmp = 0
for i in xrange(len(word1)):
for j in xrange(len(word2)):
getsyn1 = getSynonyms(word1[i])
getsyn2 = getSynonyms(word2[j])
ds1 = [x for y in getsyn1 for x in y]
ds2 = [x for y in getsyn2 for x in y]
print ds1,"\n",ds2,"\n\n"
for k in xrange(len(ds1)):
for l in xrange(len(ds2)):
if ds1[k] == ds2[l]:
tmp = 1
return tmp
print cekSynonyms(word1, word2)
print
and here's the output :
[Synset('absorb.v.02'), Synset('catch_up.v.02'), Synset('relearn.v.01'), Synset('study.v.05'), Synset('ascertain.v.04'), Synset('discover.v.04'), Synset('get_the_goods.v.01'), Synset('trip_up.v.01'), Synset('wise_up.v.01'), Synset('understudy.v.01'), Synset('audit.v.02'), Synset('drill.v.03'), Synset('train.v.02'), Synset('catechize.v.02'), Synset('coach.v.01'), Synset('condition.v.01'), Synset('drill.v.04'), Synset('enlighten.v.01'), Synset('ground.v.04'), Synset('indoctrinate.v.01'), Synset('induct.v.05'), Synset('lecture.v.01'), Synset('mentor.v.01'), Synset('reinforce.v.02'), Synset('spoonfeed.v.02'), Synset('train.v.01'), Synset('tutor.v.01'), Synset('unteach.v.01'), Synset('unteach.v.02'), Synset('test.v.06')]
[Synset('resurvey.n.01'), Synset('assay.n.03'), Synset('blue_book.n.01'), Synset('case_study.n.01'), Synset('green_paper.n.01'), Synset('medical_report.n.01'), Synset('position_paper.n.01'), Synset('progress_report.n.01'), Synset('white_book.n.01'), Synset('allometry.n.01'), Synset('architecture.n.02'), Synset('bibliotics.n.01'), Synset('communications.n.01'), Synset('engineering.n.02'), Synset('escapology.n.01'), Synset('frontier.n.03'), Synset('futurology.n.01'), Synset('genealogy.n.02'), Synset('graphology.n.01'), Synset('humanistic_discipline.n.01'), Synset('major.n.04'), Synset('military_science.n.01'), Synset('numerology.n.01'), Synset('occultism.n.01'), Synset('ology.n.01'), Synset('protology.n.01'), Synset('science.n.01'), Synset('theogony.n.01'), Synset('theology.n.01'), Synset('design.n.06'), Synset('draft.n.03'), Synset('vignette.n.03'), Synset('lucubration.n.02'), Synset('anatomize.v.02'), Synset('assay.v.01'), Synset('audit.v.01'), Synset('check.v.01'), Synset('compare.v.01'), Synset('diagnose.v.01'), Synset('diagnose.v.02'), Synset('investigate.v.01'), Synset('review.v.01'), Synset('screen.v.02'), Synset('sieve.v.02'), Synset('survey.v.01'), Synset('survey.v.05'), Synset('trace.v.01'), Synset('view.v.02'), Synset('major.v.01'), Synset('compare.v.03'), Synset('factor.v.03'), Synset('audit.v.02'), Synset('drill.v.03'), Synset('train.v.02'), Synset('cram.v.03'), Synset('memorize.v.01')]
1
My questions is how to remove Synset , ( ), .pos_tags.numbers on hypernyms and hyponyms ?
So there's only show the words like ['train', 'memorize']
I've tried on synLemmas = synset1.lemma_names() and word = synLemmas[i].replace('_',' ') and it works. Here's the output :
[u'learn', u'larn', u'acquire', u'hear', u'get word', u'get wind', u'pick up', u'find out', u'get a line', u'discover', u'see', u'memorize', u'memorise', u'con', u'study', u'read', u'take', u'teach', u'instruct', u'determine', u'check', u'ascertain', u'watch']
[u'survey', u'study', u'work', u'report', u'written report', u'discipline', u'subject', u'subject area', u'subject field', u'field', u'field of study', u'bailiwick', u'sketch', u'cogitation', u'analyze', u'analyse', u'examine', u'canvass', u'canvas', u'consider', u'learn', u'read', u'take', u'hit the books', u'meditate', u'contemplate']
Programmatically, Synsets objects are not strings ;P
You can check the type of any Python object using the built-in type function:
>>> x = 'Foo bar'
>>> type(x)
<class 'str'>
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> type(wn.synsets('dog'))
<class 'list'>
>>> type(wn.synsets('dog')[0])
<class 'nltk.corpus.reader.wordnet.Synset'>
Linguistically, Synsets are concepts/meanings/ideas.
One word can have multiple meaning, so multiple synsets.
One meaning can be expressed in different words/lemmas.
If we look at the word dog, we see that it points to multiple synsets and with different definitions:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> for ss in wn.synsets('dog'):
... print (ss, ':', ss.definition())
...
Synset('dog.n.01') : a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Synset('frump.n.01') : a dull unattractive unpleasant girl or woman
Synset('dog.n.03') : informal term for a man
Synset('cad.n.01') : someone who is morally reprehensible
Synset('frank.n.02') : a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
Synset('pawl.n.01') : a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
Synset('andiron.n.01') : metal supports for logs in a fireplace
Synset('chase.v.01') : go after with the intent to catch
And each synset can be expressed as different words/lemmas:
>>> for ss in wn.synsets('dog'):
... print (ss, ':', ss.lemma_names())
...
Synset('dog.n.01') : ['dog', 'domestic_dog', 'Canis_familiaris']
Synset('frump.n.01') : ['frump', 'dog']
Synset('dog.n.03') : ['dog']
Synset('cad.n.01') : ['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel']
Synset('frank.n.02') : ['frank', 'frankfurter', 'hotdog', 'hot_dog', 'dog', 'wiener', 'wienerwurst', 'weenie']
Synset('pawl.n.01') : ['pawl', 'detent', 'click', 'dog']
Synset('andiron.n.01') : ['andiron', 'firedog', 'dog', 'dog-iron']
Synset('chase.v.01') : ['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'dog', 'go_after', 'track']
Since we know that each word represent multiple synsets in wordnet, you CAN'T access the hyper/hyponyms from a word/lemma.
To access the hyper/hyponyms, you would need to first disambiguate the meaning of the word within the context first.
Sentence: I ate a dog for breakfast.
Ambiguous Word: dog
Disambiguated Synset: Synset('frank.n.02')
Only after you know which synset is the correct meaning of the word in context, then can you access the hypernym of Synset, e.g.
>>> wn.synsets('dog')[4]
Synset('frank.n.02')
>>> wn.synsets('dog')[4].hypernyms()
[Synset('sausage.n.01')]
>>> wn.synsets('dog')[4].hypernyms()[0]
Synset('sausage.n.01')
>>> wn.synsets('dog')[4].hypernyms()[0].lemma_names()
['sausage']
If you only want the inside you can use the function .name()
For example if you apply into Synset('frump.n.01').name() it will return 'frump.n.01'

Python unicode decode not working for outlook exported csv

Hi I've exported an outlook contacts csv file and loaded it into a python shell.
I have a number of European names in the list and the following for example
tmp = 'Fern\xc3\x9fndez'
tmp.encode("latin-1")
results in an error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
while
tmp.decode('latin-1')
gives me
u'Fern\xc3\x9fndez'
How do I get the text to read as Fernandez? (not too worried about the accents - but happy to have them)
You must be using Python 2.x. Here is one way to print out the character (depending on which encoding you are working with):
>>> tmp = 'Fern\xc3\x9fndez'
>>> print tmp.decode('utf-8') # print formats the string for stdout
Fernßndez
>>> print tmp.decode('latin1')
FernÃndez
Are you sure you have the right character? Is it utf-8? And another way:
>>> print unicode(tmp, 'latin1')
FernÃndez
>>> print unicode(tmp, 'utf-8')
Fernßndez
Interesting. So none of these options worked for you? Incidentally, I ran the string through a few other encodings just to see if any of them had a character more in line with what I would expect. Unfortunately, I don't see any that look quite right:
>>> for encoding in ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8']:
try:
print encoding + ': ' + tmp.decode(encoding)
except:
pass
cp037: ãÁÊ>C¤>ÀÁ:
cp437: Fernßndez
cp500: ãÁÊ>C¤>ÀÁ:
cp737: Fern├θndez
cp775: Fern├¤ndez
cp850: Fernßndez
cp852: Fern├čndez
cp855: Fern├Ъndez
cp857: Fern├şndez
cp860: Fern├Óndez
cp861: Fernßndez
cp862: Fernßndez
cp863: Fernßndez
cp865: Fernßndez
cp866: Fern├Яndez
cp869: Fern├ίndez
cp875: ΖΧΈ>Cμ>ΦΧ:
cp932: Fernテ殤dez
cp949: Fern횩ndez
cp1006: Fernﺣndez
cp1026: ãÁÊ>C¤>ÀÁ:
cp1140: ãÁÊ>C€>ÀÁ:
cp1250: FernĂźndez
cp1251: FernГџndez
cp1252: Fernßndez
cp1254: Fernßndez
cp1256: Fernأںndez
cp1258: FernĂŸndez
gbk: Fern脽ndez
gb18030: Fern脽ndez
latin_1: FernÃndez
iso8859_2: FernĂndez
iso8859_4: FernÃndez
iso8859_5: FernУndez
iso8859_6: Fernأndez
iso8859_7: FernΓndez
iso8859_9: FernÃndez
iso8859_10: FernÃndez
iso8859_13: FernĆndez
iso8859_14: FernÃndez
iso8859_15: FernÃndez
koi8_r: Fernц÷ndez
koi8_u: Fernц÷ndez
mac_cyrillic: Fern√Яndez
mac_greek: FernΟündez
mac_iceland: Fernßndez
mac_latin2: Fernßndez
mac_roman: Fernßndez
mac_turkish: Fernßndez
ptcp154: FernГҹndez
shift_jis: Fernテ殤dez
shift_jis_2004: Fernテ殤dez
shift_jisx0213: Fernテ殤dez
utf_16: 敆湲鿃摮穥
utf_16_be: 䙥牮쎟湤敺
utf_16_le: 敆湲鿃摮穥
utf_8: Fernßndez

PyParsing: Is this correct use of setParseAction()?

I have strings like this:
"MSE 2110, 3030, 4102"
I would like to output:
[("MSE", 2110), ("MSE", 3030), ("MSE", 4102)]
This is my way of going about it, although I haven't quite gotten it yet:
def makeCourseList(str, location, tokens):
print "before: %s" % tokens
for index, course_number in enumerate(tokens[1:]):
tokens[index + 1] = (tokens[0][0], course_number)
print "after: %s" % tokens
course = Group(DEPT_CODE + COURSE_NUMBER) # .setResultsName("Course")
course_data = (course + ZeroOrMore(Suppress(',') + COURSE_NUMBER)).setParseAction(makeCourseList)
This outputs:
>>> course.parseString("CS 2110")
([(['CS', 2110], {})], {})
>>> course_data.parseString("CS 2110, 4301, 2123, 1110")
before: [['CS', 2110], 4301, 2123, 1110]
after: [['CS', 2110], ('CS', 4301), ('CS', 2123), ('CS', 1110)]
([(['CS', 2110], {}), ('CS', 4301), ('CS', 2123), ('CS', 1110)], {})
Is this the right way to do it, or am I totally off?
Also, the output of isn't quite correct - I want course_data to emit a list of course symbols that are in the same format as each other. Right now, the first course is different from the others. (It has a {}, whereas the others don't.)
This solution memorizes the department when parsed, and emits a (dept,coursenum) tuple when a number is found.
from pyparsing import Suppress,Word,ZeroOrMore,alphas,nums,delimitedList
data = '''\
MSE 2110, 3030, 4102
CSE 1000, 2000, 3000
'''
def memorize(t):
memorize.dept = t[0]
def token(t):
return (memorize.dept,int(t[0]))
course = Suppress(Word(alphas).setParseAction(memorize))
number = Word(nums).setParseAction(token)
line = course + delimitedList(number)
lines = ZeroOrMore(line)
print lines.parseString(data)
Output:
[('MSE', 2110), ('MSE', 3030), ('MSE', 4102), ('CSE', 1000), ('CSE', 2000), ('CSE', 3000)]
Is this the right way to do it, or am
I totally off?
It's one way to do it, though of course there are others (e.g. use as parse actions two bound method -- so the instance the method belongs to can keep state -- one for the dept code and another for the course number).
The return value of the parseString call is harder to bend to your will (though I'm sure sufficiently dark magic will do it and I look forward to Paul McGuire explaining how;-), so why not go the bound-method route as in...:
from pyparsing import *
DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode")
COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber")
class MyParse(object):
def __init__(self):
self.result = None
def makeCourseList(self, str, location, tokens):
print "before: %s" % tokens
dept = tokens[0][0]
newtokens = [(dept, tokens[0][1])]
newtokens.extend((dept, tok) for tok in tokens[1:])
print "after: %s" % newtokens
self.result = newtokens
course = Group(DEPT_CODE + COURSE_NUMBER).setResultsName("Course")
inst = MyParse()
course_data = (course + ZeroOrMore(Suppress(',') + COURSE_NUMBER)
).setParseAction(inst.makeCourseList)
ignore = course_data.parseString("CS 2110, 4301, 2123, 1110")
print inst.result
this emits:
before: [['CS', '2110'], '4301', '2123', '1110']
after: [('CS', '2110'), ('CS', '4301'), ('CS', '2123'), ('CS', '1110')]
[('CS', '2110'), ('CS', '4301'), ('CS', '2123'), ('CS', '1110')]
which seems to be what you require, if I read your specs correctly.
data = '''\
MSE 2110, 3030, 4102
CSE 1000, 2000, 3000'''
def get_courses(data):
for row in data.splitlines():
department, *numbers = row.replace(",", "").split()
for number in numbers:
yield department, number
This would give a generator for the course codes. A list can be made with list() if need be, or you can iterate over it directly.
Sure, everybody loves PyParsing. For easy stuff like this split is sooo much easier to grok:
data = '''\
MSE 2110, 3030, 4102
CSE 1000, 2000, 3000'''
all = []
for row in data.split('\n'):
klass,num_l = row.split(' ',1)
all.extend((klass,int(num)) for num in num_l.split(','))

Categories

Resources