tokenizing: how to not tokenize punctuation like `^* in python for NLP - python

I want to tokenize string punctuation except `*^
I've tried but the result, all types of punctuation are separated, while for some punctuation I don't want to separate
when i use:
text = "hai*ini^ema`il saya lunar!?"
tokenizer = TweetTokenizer()
nltk_tokens = tokenizer.tokenize(text)
nltk_tokens
i get:
['hai', '*', 'ini', '^', 'ema', '`', 'il', 'saya', 'lunar', '!', '?']
what i want is:
['hai*ini^ema`il', 'saya', 'lunar', '!', '?']
I want to tokenize but not tokenize *^`

Try this:
def phrasalize(tokens):
s = " ".join(tokens)
match = re.match("((\w*\s[\*\^\`]\s\w*)+)", s)
while match:
s = s.replace(match.group(1), match.group(1).replace(' ', ''))
match = re.match("((\w*\s[\*\^\`]\s\w*)+)", s)
return s
tokens = ['hai', '*', 'ini', '^', 'ema', '`', 'il', 'saya', 'lunar', '!', '?']
phrasalize(tokens)
[out]:
'hai*ini^ema`il saya lunar ! ?'

Related

Select all string in list and append a string - python

I wanna add "." to the end of any item of doing variable.
and I want output like this => I am watching.
import random
main = ['I', 'You', 'He', 'She']
main0 = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
rd = random.choice(main)
rd0 = random.choice(doing)
result = []
'''
if rd == main[0]:
result.append(rd)
result.append(main0[0])
if rd == main[1]:
result.append(rd)
result.append(main0[1])
if rd == main[2]:
result.append(rd)
result.append(main0[2])
if rd == main[3]:
result.append(rd)
result.append(main0[3])
'''
result.append(rd0)
print(result)
well, I tried those codes.
'.'.append(doing)
'.'.append(doing[0])
'.'.append(rd0)
but no one of them works, and only returns an error that:
Traceback (most recent call last):
File "c:\Users\----\Documents\Codes\Testing\s.py", line 21, in <module>
'.'.append(rd0)
AttributeError: 'str' object has no attribute 'append'
Why not just select a random string from each list and then build the output as you want:
import random
main = ['I', 'You', 'He', 'She']
verb = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
p1 = random.choice(main)
p2 = random.choice(verb)
p3 = random.choice(doing)
output = ' '.join([p1, p2, p3]) + '.'
print(output) # You is playing.
Bad English, but the logic seems to be what you want here.
Just add a period after making the string:
import random
main = ['I', 'You', 'He', 'She']
main0 = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
' '.join([random.choice(i) for i in [main, main0, doing]]) + '.'

I want to find the length of my each words in text file

I'm trying to find the length of words individually in my text file. I tried it by following code but this code is showing me up the count of words that how many times this word is used in file.
text = open(r"C:\Users\israr\Desktop\counter\Bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word in d:
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
for key in list(d.keys()):
print(key, ":", d[key])
And the output is something like this
china : 14
emerged : 1
as : 16
one : 5
of : 44
the : 108
world's : 7
first : 2
civilizations, : 1
in : 26
fertile : 1
basin : 1
yellow : 1
river : 1
north : 1
plain. : 1
Basically i want a list of words having same length for example china , first , world :5 this 5 is the length of all this words and so on the words having different length in other list
İf you need all word's total lengths seperatly, you can find them using this formula:
len(word) * count(word) for all word in words
equalivent in python:
d[key] * len(key)
Change last 2 lines with below:
for key in list(d.keys()):
print(key, ":", d[key] * len(key))
----EDIT----
It ıs what you asked in comments, I guess. Below code gives you groups whose members are the same length.
for word in words:
if len(word) in d:
if word not in d[len(word)]:
d[len(word)].append(word)
else:
# Add the word to dictionary with count 1
d[len(word)] = [word]
for key in list(d.keys()):
print(key, ":", d[key])
Output of this code:
3 : ['the', 'bc,', '(c.', 'who', 'was', '100', 'bc)', 'and', 'xia', 'but', 'not', 'one', 'due', '8th', '221', 'qin', 'shi', 'for', 'his', 'han', '220', '206', 'has', 'war', 'all', 'far']
8 : ['earliest', 'describe', 'writings', 'indicate', 'commonly', 'however,', 'cultural', 'history,', 'regarded', 'external', 'internal', 'culture,', 'troubled', 'imperial', 'selected', 'replaced', 'republic', 'mainland', "people's", 'peoples,', 'multiple', 'kingdoms', 'xinjiang', 'present.', '(carried']
5 : ['known', 'china', 'early', 'shang', 'texts', 'grand', 'ruled', 'river', 'which', 'along', 'these', 'arose', 'years', 'their', 'rule.', 'began', 'first', 'those', 'huang', 'title', 'after', 'until', '1912,', 'tasks', 'elite', 'young', '1949.', 'unity', 'being', 'civil', 'parts', 'other', 'world', 'waves', 'basis']
7 : ['written', 'records', 'history', 'dynasty', 'ancient', 'century', 'mention', 'writing', 'period,', 'xia.[5]', 'valley,', 'chinese', 'various', 'centers', 'yangtze', "world's", 'cradles', 'concept', 'mandate', 'justify', 'central', 'country', 'smaller', 'period.', 'another', 'warring', 'created', 'himself', 'huangdi', 'marking', 'systems', 'enabled', 'emperor', 'control', 'routine', 'handled', 'special', 'through', "china's", 'between', 'periods', 'culture', 'western', 'foreign']
2 : ['of', 'as', 'wu', 'by', 'no', 'is', 'do', 'in', 'to', 'be', 'at', 'or', 'bc', '21', 'ad']
4 : ['date', 'from', '1250', 'bc),', 'king', 'such', 'book', '11th', '(296', 'held', 'both', 'with', 'zhou', 'into', 'much', 'qin,', 'fell', 'soon', '(206', 'ad).', 'that', 'vast', 'were', 'men,', 'last', 'qing', 'then', 'most', 'whom', 'eras', 'have', 'some', 'asia', 'form']
9 : ['1600–1046', 'mentioned', 'documents', 'chapters,', 'historian', '2070–1600', 'existence', 'neolithic', 'millennia', 'thousands', '(1046–256', 'pressures', 'following', 'developed', 'conquered', '"emperor"', 'beginning', 'dynasties', 'directly.', 'centuries', 'carefully', 'difficult', 'political', 'dominated', 'stretched', 'contact),']
6 : ['during', "ding's", '(early', 'bamboo', 'annals', 'before', 'shang,', 'yellow', 'cradle', 'river.', 'shang.', 'oldest', 'heaven', 'weaken', 'states', 'spring', 'autumn', 'became', 'warred', 'times.', 'china.', 'death,', 'peace,', 'failed', 'recent', 'steppe', 'china;', 'tibet,', 'modern']
12 : ['reign,[1][2]', 'twenty-first', 'longer-lived', 'bureaucratic', 'calligraphy,', '(1644–1912),', '(1927–1949).', 'occasionally', 'immigration,']
11 : ['same.[3][4]', 'independent', 'traditional', 'territories', 'well-versed', 'literature,', 'philosophy,', 'assimilated', 'population.', 'warlordism,']
10 : ['historical', 'originated', 'continuous', 'supplanted', 'introduced', 'government', 'eventually', 'splintered', 'literature', 'philosophy', 'oppressive', 'successive', 'alternated', 'influences', 'expansion,']
1 : ['a', '–']
13 : ['civilization.', 'civilizations', 'examinations.', 'statehood—the', 'assimilation,']
17 : ['civilizations,[6]']
16 : ['civilization.[7]']
0 : ['']
14 : ['administrative']
18 : ['scholar-officials.']
Below is full version of code.
text = open("bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if len(word) in d:
if word not in d[len(word)]:
d[len(word)].append(word)
else:
d[len(word)] = [word]
for key in list(d.keys()):
print(key, ":", d[key])
When you look at the code for dealing with each word, you will see your problem..
for word in words:
if word in d:
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
Here you are checking if a word is in the dictionary. If it is, add 1 to its key when we find it. If it is not, initialize it at 1. This is the core concept for counting repetitions.
If you want to count the length of the word, you could simply do.
for word in words:
if word not in d:
d[word] = len(word)
And to output your dict, you can do
for k, v in d.items():
print(k, ":", v)
You can create a list of word lengths and then process them through python's built-in Counter:
from collections import Counter
with open("mytext.txt", "r") as f:
words = f.read().split()
words_lengths = [len(word) for word in words]
counter = Counter(words_lengths)
The output would be smth like:
In[1]:counter
Out[1]:Counter({7: 146, 9: 73, 5: 73, 4: 146, 1: 73})
Where keys are words lengths, and values are the number of times they occurred.
You can work with that as with usual dictionary.

AttributeError: 'list' object has no attribute 'isdigit'. Specifying POS of each and every word in sentences list efficiently?

Suppose I am having lists of list of sentences (in a large corpus) as collections of tokenized words. The sample format is as follows:
The format of tokenized_raw_data is as follows:
[['arxiv', ':', 'astro-ph/9505066', '.'], ['seds', 'page', 'on', '``',
'globular', 'star', 'clusters', "''", 'douglas', 'scott', '``', 'independent',
'age', 'estimates', "''", 'krysstal', '``', 'the', 'scale', 'of', 'the',
'universe', "''", 'space', 'and', 'time', 'scaled', 'for', 'the', 'beginner',
'.'], ['icosmos', ':', 'cosmology', 'calculator', '(', 'with', 'graph',
'generation', ')', 'the', 'expanding', 'universe', '(', 'american',
'institute', 'of', 'physics', ')']]
I want to apply the pos_tag.
What I have tried up to now is as follows.
import os, nltk, re
from nltk.corpus import stopwords
from unidecode import unidecode
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
def read_data():
global tokenized_raw_data
with open("path//merge_text_results_pu.txt", 'r', encoding='utf-8', errors = 'replace') as f:
raw_data = f.read()
tokenized_raw_data = '\n'.join(nltk.line_tokenize(raw_data))
read_data()
def function1():
tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
tagged_tokens = nltk.pos_tag(unfiltered_tokens)
nouns = [word.encode('utf-8') for word,pos in tagged_tokens
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
joined_nouns_text = (' '.join(map(bytes.decode, nouns))).strip()
noun_tokens = [t for t in wordpunct_tokenize(joined_nouns_text)]
stop_words = set(stopwords.words("english"))
function1()
I am getting the following error.
> AttributeError: 'list' object has no attribute 'isdigit'
Please help how to overcome this error in time-efficient manner? Where I am going wrong?
Note: I am using Python 3.7 on Windows 10.
Try this-
word_list=[]
for i in range(len(unfiltered_tokens)):
word_list.append([])
for i in range(len(unfiltered_tokens)):
for word in unfiltered_tokens[i]:
if word[1:].isalpha():
word_list[i].append(word[1:])
then after do
tagged_tokens=[]
for token in word_list:
tagged_tokens.append(nltk.pos_tag(token))
You will get your desired results! Hope this helped.

Python Identifier Identification

I'm reading a Python file in a Python program and I want to get the list of all identifiers, literals, separators and terminator in the Python file being read. Using identifiers as example:
one_var = "something"
two_var = "something else"
other_var = "something different"
Assuming the variables above are in the file being read, the result should be:
list_of_identifiers = [one_var, two_var, other_var]
Same thing goes for literals, terminators and separators. Thanks
I already wrote code for all operators and keywords:
import keyword, operator
list_of_operators = []
list_of_keywords = []
more_operators = ['+', '-', '/', '*', '%', '**', '//', '==', '!=', '>', '<', '>=', '<=', '=', '+=', '-=', '*=', '/=', '%=', '**=', '//=', '&', '|', '^', '~', '<<', '>>', 'in', 'not in', 'is', 'is not', 'not', 'or', 'and']
with open('file.py') as data_source:
for each_line in data_source:
new_string = str(each_line).split(' ')
for each_word in new_string:
if each_word in keyword.kwlist:
list_of_keywords.append(each_word)
elif each_word in operator.__all__ or each_word in more_operators:
list_of_operators.append(each_word)
print("Operators found:\n", list_of_operators)
print("Keywords found:\n", list_of_keywords)
import ast
with open('file.py') as data_source:
ast_root = ast.parse(data_source.read())
identifiers = set()
for node in ast.walk(ast_root):
if isinstance(node, ast.Name):
identifiers.add(node.id)
print(identifiers)

how to select specific words and put them into tuple - list?

I got a result of a long string through using BeautifulSoup.
It is shaped something like this:
<span>title1</span>
<span>title2</span>
<span>title3</span>
<span>title4</span>
I want to specifically select "link#" and "title" and put them in a list - tuple like the one below:
[(link1,title1),(link2,title2),(link3,title3),(link4,title4)]
Due to my lack of understandings in python,
I don't even know what to search for.
I've been trying to do this for like 6 hours and still couldn't find the way.
the bs code i used
def extract(self):
self.url ="http://aetoys.tumblr.com"
self.source = requests.get(self.url)
self.text = self.source.text
self.soup = BeautifulSoup(self.text)
for self.div in self.soup.findAll('li',{'class':'has-sub'}):
for self.li in self.div.find_all('a'):
print(self.li)
You just need to extract the href:
out = [] # store lists of lists
for self.div in self.soup.findAll('li',{'class':'has-sub'}):
out.append([x["href"] for x in self.div.find_all('a',href=True)])
print([x["href"] for x in self.div.find_all('a',href=True)])
['#', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '/kingdom', '/tera', '/torico', '/titan', '/seven', '/fairytail', '/soma', '/amsal', '/berserk', '/ghoul', '/kaizi', '/piando']
['#', '/onepiece_book', '/onepiece']
['#', '/naruto_book', '/naruto']
['#', '/bleach_book', '/bleach']
['#', '/conan', '/silver', '/hai', '/nise', '/hunterbyhunter', '/baku', '/unhon', '/souleater', '/liargame', '/kenichi', '/dglayman', '/magi', '/suicide', '/pedal']
['#', '/dobaku', '/gisei', '/dragonball', '/hagaren', '/gantz', '/doctor', '/dunk', '/susi', '/reborn', '/airgear', '/island', '/crows', '/beelzebub', '/zzang', '/akira', '/tennis', '/kuroco', '/claymore', '/deathnote']
To get a single list:
url ="http://aetoys.tumblr.com"
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text)
print [ x["href"] for div in soup.findAll('li',{'class':'has-sub'}) for x in div.find_all('a',href=True)]
['#', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '/kingdom', '/tera', '/torico', '/titan', '/seven', '/fairytail', '/soma', '/amsal', '/berserk', '/ghoul', '/kaizi', '/piando', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '#', '/conan', '/silver', '/hai', '/nise', '/hunterbyhunter', '/baku', '/unhon', '/souleater', '/liargame', '/kenichi', '/dglayman', '/magi', '/suicide', '/pedal', '#', '/dobaku', '/gisei', '/dragonball', '/hagaren', '/gantz', '/doctor', '/dunk', '/susi', '/reborn', '/airgear', '/island', '/crows', '/beelzebub', '/zzang', '/akira', '/tennis', '/kuroco', '/claymore', '/deathnote']
If you really want tuples:
out = []
for div in soup.findAll('li',{'class':'has-sub'}):
out.append(tuple(x["href"] for x in div.find_all('a',href=True)))

Categories

Resources