Select all string in list and append a string - python - python

I wanna add "." to the end of any item of doing variable.
and I want output like this => I am watching.
import random
main = ['I', 'You', 'He', 'She']
main0 = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
rd = random.choice(main)
rd0 = random.choice(doing)
result = []
'''
if rd == main[0]:
result.append(rd)
result.append(main0[0])
if rd == main[1]:
result.append(rd)
result.append(main0[1])
if rd == main[2]:
result.append(rd)
result.append(main0[2])
if rd == main[3]:
result.append(rd)
result.append(main0[3])
'''
result.append(rd0)
print(result)
well, I tried those codes.
'.'.append(doing)
'.'.append(doing[0])
'.'.append(rd0)
but no one of them works, and only returns an error that:
Traceback (most recent call last):
File "c:\Users\----\Documents\Codes\Testing\s.py", line 21, in <module>
'.'.append(rd0)
AttributeError: 'str' object has no attribute 'append'

Why not just select a random string from each list and then build the output as you want:
import random
main = ['I', 'You', 'He', 'She']
verb = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
p1 = random.choice(main)
p2 = random.choice(verb)
p3 = random.choice(doing)
output = ' '.join([p1, p2, p3]) + '.'
print(output) # You is playing.
Bad English, but the logic seems to be what you want here.

Just add a period after making the string:
import random
main = ['I', 'You', 'He', 'She']
main0 = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
' '.join([random.choice(i) for i in [main, main0, doing]]) + '.'

Related

tokenizing: how to not tokenize punctuation like `^* in python for NLP

I want to tokenize string punctuation except `*^
I've tried but the result, all types of punctuation are separated, while for some punctuation I don't want to separate
when i use:
text = "hai*ini^ema`il saya lunar!?"
tokenizer = TweetTokenizer()
nltk_tokens = tokenizer.tokenize(text)
nltk_tokens
i get:
['hai', '*', 'ini', '^', 'ema', '`', 'il', 'saya', 'lunar', '!', '?']
what i want is:
['hai*ini^ema`il', 'saya', 'lunar', '!', '?']
I want to tokenize but not tokenize *^`
Try this:
def phrasalize(tokens):
s = " ".join(tokens)
match = re.match("((\w*\s[\*\^\`]\s\w*)+)", s)
while match:
s = s.replace(match.group(1), match.group(1).replace(' ', ''))
match = re.match("((\w*\s[\*\^\`]\s\w*)+)", s)
return s
tokens = ['hai', '*', 'ini', '^', 'ema', '`', 'il', 'saya', 'lunar', '!', '?']
phrasalize(tokens)
[out]:
'hai*ini^ema`il saya lunar ! ?'

I want to find the length of my each words in text file

I'm trying to find the length of words individually in my text file. I tried it by following code but this code is showing me up the count of words that how many times this word is used in file.
text = open(r"C:\Users\israr\Desktop\counter\Bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word in d:
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
for key in list(d.keys()):
print(key, ":", d[key])
And the output is something like this
china : 14
emerged : 1
as : 16
one : 5
of : 44
the : 108
world's : 7
first : 2
civilizations, : 1
in : 26
fertile : 1
basin : 1
yellow : 1
river : 1
north : 1
plain. : 1
Basically i want a list of words having same length for example china , first , world :5 this 5 is the length of all this words and so on the words having different length in other list
İf you need all word's total lengths seperatly, you can find them using this formula:
len(word) * count(word) for all word in words
equalivent in python:
d[key] * len(key)
Change last 2 lines with below:
for key in list(d.keys()):
print(key, ":", d[key] * len(key))
----EDIT----
It ıs what you asked in comments, I guess. Below code gives you groups whose members are the same length.
for word in words:
if len(word) in d:
if word not in d[len(word)]:
d[len(word)].append(word)
else:
# Add the word to dictionary with count 1
d[len(word)] = [word]
for key in list(d.keys()):
print(key, ":", d[key])
Output of this code:
3 : ['the', 'bc,', '(c.', 'who', 'was', '100', 'bc)', 'and', 'xia', 'but', 'not', 'one', 'due', '8th', '221', 'qin', 'shi', 'for', 'his', 'han', '220', '206', 'has', 'war', 'all', 'far']
8 : ['earliest', 'describe', 'writings', 'indicate', 'commonly', 'however,', 'cultural', 'history,', 'regarded', 'external', 'internal', 'culture,', 'troubled', 'imperial', 'selected', 'replaced', 'republic', 'mainland', "people's", 'peoples,', 'multiple', 'kingdoms', 'xinjiang', 'present.', '(carried']
5 : ['known', 'china', 'early', 'shang', 'texts', 'grand', 'ruled', 'river', 'which', 'along', 'these', 'arose', 'years', 'their', 'rule.', 'began', 'first', 'those', 'huang', 'title', 'after', 'until', '1912,', 'tasks', 'elite', 'young', '1949.', 'unity', 'being', 'civil', 'parts', 'other', 'world', 'waves', 'basis']
7 : ['written', 'records', 'history', 'dynasty', 'ancient', 'century', 'mention', 'writing', 'period,', 'xia.[5]', 'valley,', 'chinese', 'various', 'centers', 'yangtze', "world's", 'cradles', 'concept', 'mandate', 'justify', 'central', 'country', 'smaller', 'period.', 'another', 'warring', 'created', 'himself', 'huangdi', 'marking', 'systems', 'enabled', 'emperor', 'control', 'routine', 'handled', 'special', 'through', "china's", 'between', 'periods', 'culture', 'western', 'foreign']
2 : ['of', 'as', 'wu', 'by', 'no', 'is', 'do', 'in', 'to', 'be', 'at', 'or', 'bc', '21', 'ad']
4 : ['date', 'from', '1250', 'bc),', 'king', 'such', 'book', '11th', '(296', 'held', 'both', 'with', 'zhou', 'into', 'much', 'qin,', 'fell', 'soon', '(206', 'ad).', 'that', 'vast', 'were', 'men,', 'last', 'qing', 'then', 'most', 'whom', 'eras', 'have', 'some', 'asia', 'form']
9 : ['1600–1046', 'mentioned', 'documents', 'chapters,', 'historian', '2070–1600', 'existence', 'neolithic', 'millennia', 'thousands', '(1046–256', 'pressures', 'following', 'developed', 'conquered', '"emperor"', 'beginning', 'dynasties', 'directly.', 'centuries', 'carefully', 'difficult', 'political', 'dominated', 'stretched', 'contact),']
6 : ['during', "ding's", '(early', 'bamboo', 'annals', 'before', 'shang,', 'yellow', 'cradle', 'river.', 'shang.', 'oldest', 'heaven', 'weaken', 'states', 'spring', 'autumn', 'became', 'warred', 'times.', 'china.', 'death,', 'peace,', 'failed', 'recent', 'steppe', 'china;', 'tibet,', 'modern']
12 : ['reign,[1][2]', 'twenty-first', 'longer-lived', 'bureaucratic', 'calligraphy,', '(1644–1912),', '(1927–1949).', 'occasionally', 'immigration,']
11 : ['same.[3][4]', 'independent', 'traditional', 'territories', 'well-versed', 'literature,', 'philosophy,', 'assimilated', 'population.', 'warlordism,']
10 : ['historical', 'originated', 'continuous', 'supplanted', 'introduced', 'government', 'eventually', 'splintered', 'literature', 'philosophy', 'oppressive', 'successive', 'alternated', 'influences', 'expansion,']
1 : ['a', '–']
13 : ['civilization.', 'civilizations', 'examinations.', 'statehood—the', 'assimilation,']
17 : ['civilizations,[6]']
16 : ['civilization.[7]']
0 : ['']
14 : ['administrative']
18 : ['scholar-officials.']
Below is full version of code.
text = open("bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if len(word) in d:
if word not in d[len(word)]:
d[len(word)].append(word)
else:
d[len(word)] = [word]
for key in list(d.keys()):
print(key, ":", d[key])
When you look at the code for dealing with each word, you will see your problem..
for word in words:
if word in d:
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
Here you are checking if a word is in the dictionary. If it is, add 1 to its key when we find it. If it is not, initialize it at 1. This is the core concept for counting repetitions.
If you want to count the length of the word, you could simply do.
for word in words:
if word not in d:
d[word] = len(word)
And to output your dict, you can do
for k, v in d.items():
print(k, ":", v)
You can create a list of word lengths and then process them through python's built-in Counter:
from collections import Counter
with open("mytext.txt", "r") as f:
words = f.read().split()
words_lengths = [len(word) for word in words]
counter = Counter(words_lengths)
The output would be smth like:
In[1]:counter
Out[1]:Counter({7: 146, 9: 73, 5: 73, 4: 146, 1: 73})
Where keys are words lengths, and values are the number of times they occurred.
You can work with that as with usual dictionary.

How to plot per minute word frequency from a python dataframe

I have the a dataframe constructed from voice transcription of multiple audio files (one for each person):
# Name Start_Time Duration Transcript
# Person A 12:12:2018 12:12:00 3.5 Transcript from Person A
# Person B 12:12:2018 12:14:00 5.5 Transcript from Person B
# .........................
# .........................
# Person N 12:12:2018 13:00:00 9.0 Transcript from Person N
Is there a way to:
Find out the 'n' most frequent words spoken per 'x' minutes of the complete conversation.
Is there a way to plot the 'n' most frequent words per 'x' minutes of the complete duration of the conversation.
For part 2. a per 'x' minute bar plot with the height of the bar scaled to the sum of the occurrences of the 'n' most common words? Is there a more intuitive way of graphically showing this information?
EDIT:
I am attaching a basic minimal Ipython notebook of what I have right now
Ipython notebook
Problems:
Resampling the dataframe per 60s doesn't do the trick as some of the conversations can be more than 60 second long. e.g The first and 4th row in the dataframe below have conversations lasting 114 seconds. Not sure if these be split in to exact 60s, even if they can the split may cascade in to the next 1 min time-slot and and make its duration more than 60s, as in case of the first and second row in the dataframe below.
Start Time Name Start Time End Time Duration Transcript
2019-04-13 18:51:22.567532 Person A 2019-04-13 18:51:22.567532 2019-04-13 18:53:16.567532 114 A dude meows on this cool guy my gardener met yesterday for no apparent reason.
2019-04-13 18:53:24.567532 Person D 2019-04-13 18:53:24.567532 2019-04-13 18:54:05.567532 41 Your homie flees from the king for a disease.
2019-04-13 18:57:14.567532 Person B 2019-04-13 18:57:14.567532 2019-04-13 18:57:55.567532 41 The king hacks some guy because the sky is green.
2019-04-13 18:59:32.567532 Person D 2019-04-13 18:59:32.567532 2019-04-13 19:01:26.567532 114 A cat with rabies spies on a cat with rabies for a disease.
As each minute interval has a different set of top 'n' frequency words, the bar plot didn't seem possible. I settled for a Dot plot 2 which looks confusing because of the y-axis height of different words. Is there a better plot to visualise this data?
Also listing the complete code here:
import pandas as pd
import random
import urllib
import plotly
import plotly.graph_objs as go
from datetime import datetime,timedelta
from collections import Counter
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
def printdfhtml(df):
old_width = pd.get_option('display.max_colwidth')
pd.set_option('display.max_colwidth', -1)
display(HTML(df.to_html(index=True)))
pd.set_option('display.max_colwidth', old_width)
def removeStopwords(wordlist, stopwords):
return [w for w in wordlist if w not in stopwords]
def stripNonAlphaNum(text):
import re
return re.compile(r'\W+', re.UNICODE).split(text)
def sortFreqDict(freqdict):
aux = [(freqdict[key], key) for key in freqdict]
aux.sort()
aux.reverse()
return aux
def sortDictKeepTopN(freqdict,keepN):
return dict(Counter(freqdict).most_common(keepN))
def wordListToFreqDict(wordlist):
wordfreq = [wordlist.count(p) for p in wordlist]
return dict(zip(wordlist,wordfreq))
s_nouns = ["A dude", "My bat", "The king", "Some guy", "A cat with rabies", "A sloth", "Your homie", "This cool guy my gardener met yesterday", "Superman"]
p_nouns = ["These dudes", "Both of my cars", "All the kings of the world", "Some guys", "All of a cattery's cats", "The multitude of sloths living under your bed", "Your homies", "Like, these, like, all these people", "Supermen"]
s_verbs = ["eats", "kicks", "gives", "treats", "meets with", "creates", "hacks", "configures", "spies on", "retards", "meows on", "flees from", "tries to automate", "explodes"]
p_verbs = ["eat", "kick", "give", "treat", "meet with", "create", "hack", "configure", "spy on", "retard", "meow on", "flee from", "try to automate", "explode"]
infinitives = ["to make a pie.", "for no apparent reason.", "because the sky is green.", "for a disease.", "to be able to make toast explode.", "to know more about archeology."]
people = ["Person A","Person B","Person C","Person D"]
start_time = datetime.now() - timedelta(minutes = 10)
complete_transcript = pd.DataFrame(columns=['Name','Start Time','End Time','Duration','Transcript'])
for i in range(1,10):
start_time = start_time + timedelta(seconds = random.randint(10,240)) # random delay bw ppl talking 10sec to 4 mins
curr_transcript = " ".join([random.choice(s_nouns), random.choice(s_verbs), random.choice(s_nouns).lower() or random.choice(p_nouns).lower(), random.choice(infinitives)])
talk_duration = random.randint(5,120) # 5 sec to 2 min talk
end_time = start_time + timedelta(seconds = talk_duration)
complete_transcript.loc[i] = [random.choice(people),
start_time,
end_time,
talk_duration,
curr_transcript]
df = complete_transcript.copy()
df = df.sort_values(['Start Time'])
df.index=df['Start Time']
printdfhtml(df)
re_df = df.copy()
re_df = re_df.drop("Name", axis=1)
re_df = re_df.drop("End Time", axis=1)
re_df = re_df.drop("Start Time", axis=1)
re_df = re_df.resample('60S').sum()
printdfhtml(re_df)
stopwords = ['a', 'about', 'above', 'across', 'after', 'afterwards']
stopwords += ['again', 'against', 'all', 'almost', 'alone', 'along']
stopwords += ['already', 'also', 'although', 'always', 'am', 'among']
stopwords += ['amongst', 'amoungst', 'amount', 'an', 'and', 'another']
stopwords += ['any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere']
stopwords += ['are', 'around', 'as', 'at', 'back', 'be', 'became']
stopwords += ['because', 'become', 'becomes', 'becoming', 'been']
stopwords += ['before', 'beforehand', 'behind', 'being', 'below']
stopwords += ['beside', 'besides', 'between', 'beyond', 'bill', 'both']
stopwords += ['bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant']
stopwords += ['co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de']
stopwords += ['describe', 'detail', 'did', 'do', 'done', 'down', 'due']
stopwords += ['during', 'each', 'eg', 'eight', 'either', 'eleven', 'else']
stopwords += ['elsewhere', 'empty', 'enough', 'etc', 'even', 'ever']
stopwords += ['every', 'everyone', 'everything', 'everywhere', 'except']
stopwords += ['few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first']
stopwords += ['five', 'for', 'former', 'formerly', 'forty', 'found']
stopwords += ['four', 'from', 'front', 'full', 'further', 'get', 'give']
stopwords += ['go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her']
stopwords += ['here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers']
stopwords += ['herself', 'him', 'himself', 'his', 'how', 'however']
stopwords += ['hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed']
stopwords += ['interest', 'into', 'is', 'it', 'its', 'itself', 'keep']
stopwords += ['last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made']
stopwords += ['many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine']
stopwords += ['more', 'moreover', 'most', 'mostly', 'move', 'much']
stopwords += ['must', 'my', 'myself', 'name', 'namely', 'neither', 'never']
stopwords += ['nevertheless', 'next', 'nine', 'no', 'nobody', 'none']
stopwords += ['noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of']
stopwords += ['off', 'often', 'on','once', 'one', 'only', 'onto', 'or']
stopwords += ['other', 'others', 'otherwise', 'our', 'ours', 'ourselves']
stopwords += ['out', 'over', 'own', 'part', 'per', 'perhaps', 'please']
stopwords += ['put', 'rather', 're', 's', 'same', 'see', 'seem', 'seemed']
stopwords += ['seeming', 'seems', 'serious', 'several', 'she', 'should']
stopwords += ['show', 'side', 'since', 'sincere', 'six', 'sixty', 'so']
stopwords += ['some', 'somehow', 'someone', 'something', 'sometime']
stopwords += ['sometimes', 'somewhere', 'still', 'such', 'system', 'take']
stopwords += ['ten', 'than', 'that', 'the', 'their', 'them', 'themselves']
stopwords += ['then', 'thence', 'there', 'thereafter', 'thereby']
stopwords += ['therefore', 'therein', 'thereupon', 'these', 'they']
stopwords += ['thick', 'thin', 'third', 'this', 'those', 'though', 'three']
stopwords += ['three', 'through', 'throughout', 'thru', 'thus', 'to']
stopwords += ['together', 'too', 'top', 'toward', 'towards', 'twelve']
stopwords += ['twenty', 'two', 'un', 'under', 'until', 'up', 'upon']
stopwords += ['us', 'very', 'via', 'was', 'we', 'well', 'were', 'what']
stopwords += ['whatever', 'when', 'whence', 'whenever', 'where']
stopwords += ['whereafter', 'whereas', 'whereby', 'wherein', 'whereupon']
stopwords += ['wherever', 'whether', 'which', 'while', 'whither', 'who']
stopwords += ['whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with']
stopwords += ['within', 'without', 'would', 'yet', 'you', 'your']
stopwords += ['yours', 'yourself', 'yourselves','']
x_trace = np.linspace(1,len(re_df.index),len(re_df.index))
n_top_words = 3
y_trace1 = []
y_trace2 = []
y_trace3 = []
for index, row in re_df.iterrows():
str_to_check=str(row['Transcript']).lower()
if(str_to_check!='0') and (str_to_check!=''):
print('-----------------------------')
wordlist = stripNonAlphaNum(str_to_check)
wordlist = removeStopwords(wordlist, stopwords)
dictionary = wordListToFreqDict(wordlist)
print('text: ',str_to_check)
print('words dropped dict: ',dictionary)
sorteddict = sortDictKeepTopN(dictionary,n_top_words)
cnt=0
for s in sorteddict:
print(str(s))
if cnt==0:
y_trace1.append(s)
elif cnt==1:
y_trace2.append(s)
elif cnt==2:
y_trace3.append(s)
cnt+=1
trace1 = {"x": x_trace,
"y": y_trace1,
"marker": {"color": "pink", "size": 12},
"mode": "markers",
"name": "1st",
"type": "scatter"
}
trace2 = {"x": x_trace,
"y": y_trace2,
"marker": {"color": "blue", "size": 12},
"mode": "markers",
"name": "2nd",
"type": "scatter",
}
trace3 = {"x": x_trace,
"y": y_trace3,
"marker": {"color": "grey", "size": 12},
"mode": "markers",
"name": "3rd",
"type": "scatter",
}
data = [trace3, trace2, trace1]
layout = {"title": "Most Frequent Words per Minute",
"xaxis": {"title": "Time (in Minutes)", },
"yaxis": {"title": "Words"}}
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

AttributeError: 'list' object has no attribute 'isdigit'. Specifying POS of each and every word in sentences list efficiently?

Suppose I am having lists of list of sentences (in a large corpus) as collections of tokenized words. The sample format is as follows:
The format of tokenized_raw_data is as follows:
[['arxiv', ':', 'astro-ph/9505066', '.'], ['seds', 'page', 'on', '``',
'globular', 'star', 'clusters', "''", 'douglas', 'scott', '``', 'independent',
'age', 'estimates', "''", 'krysstal', '``', 'the', 'scale', 'of', 'the',
'universe', "''", 'space', 'and', 'time', 'scaled', 'for', 'the', 'beginner',
'.'], ['icosmos', ':', 'cosmology', 'calculator', '(', 'with', 'graph',
'generation', ')', 'the', 'expanding', 'universe', '(', 'american',
'institute', 'of', 'physics', ')']]
I want to apply the pos_tag.
What I have tried up to now is as follows.
import os, nltk, re
from nltk.corpus import stopwords
from unidecode import unidecode
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
def read_data():
global tokenized_raw_data
with open("path//merge_text_results_pu.txt", 'r', encoding='utf-8', errors = 'replace') as f:
raw_data = f.read()
tokenized_raw_data = '\n'.join(nltk.line_tokenize(raw_data))
read_data()
def function1():
tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
tagged_tokens = nltk.pos_tag(unfiltered_tokens)
nouns = [word.encode('utf-8') for word,pos in tagged_tokens
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
joined_nouns_text = (' '.join(map(bytes.decode, nouns))).strip()
noun_tokens = [t for t in wordpunct_tokenize(joined_nouns_text)]
stop_words = set(stopwords.words("english"))
function1()
I am getting the following error.
> AttributeError: 'list' object has no attribute 'isdigit'
Please help how to overcome this error in time-efficient manner? Where I am going wrong?
Note: I am using Python 3.7 on Windows 10.
Try this-
word_list=[]
for i in range(len(unfiltered_tokens)):
word_list.append([])
for i in range(len(unfiltered_tokens)):
for word in unfiltered_tokens[i]:
if word[1:].isalpha():
word_list[i].append(word[1:])
then after do
tagged_tokens=[]
for token in word_list:
tagged_tokens.append(nltk.pos_tag(token))
You will get your desired results! Hope this helped.

Python Identifier Identification

I'm reading a Python file in a Python program and I want to get the list of all identifiers, literals, separators and terminator in the Python file being read. Using identifiers as example:
one_var = "something"
two_var = "something else"
other_var = "something different"
Assuming the variables above are in the file being read, the result should be:
list_of_identifiers = [one_var, two_var, other_var]
Same thing goes for literals, terminators and separators. Thanks
I already wrote code for all operators and keywords:
import keyword, operator
list_of_operators = []
list_of_keywords = []
more_operators = ['+', '-', '/', '*', '%', '**', '//', '==', '!=', '>', '<', '>=', '<=', '=', '+=', '-=', '*=', '/=', '%=', '**=', '//=', '&', '|', '^', '~', '<<', '>>', 'in', 'not in', 'is', 'is not', 'not', 'or', 'and']
with open('file.py') as data_source:
for each_line in data_source:
new_string = str(each_line).split(' ')
for each_word in new_string:
if each_word in keyword.kwlist:
list_of_keywords.append(each_word)
elif each_word in operator.__all__ or each_word in more_operators:
list_of_operators.append(each_word)
print("Operators found:\n", list_of_operators)
print("Keywords found:\n", list_of_keywords)
import ast
with open('file.py') as data_source:
ast_root = ast.parse(data_source.read())
identifiers = set()
for node in ast.walk(ast_root):
if isinstance(node, ast.Name):
identifiers.add(node.id)
print(identifiers)

Categories

Resources