How to plot per minute word frequency from a python dataframe - python

I have the a dataframe constructed from voice transcription of multiple audio files (one for each person):
# Name Start_Time Duration Transcript
# Person A 12:12:2018 12:12:00 3.5 Transcript from Person A
# Person B 12:12:2018 12:14:00 5.5 Transcript from Person B
# .........................
# .........................
# Person N 12:12:2018 13:00:00 9.0 Transcript from Person N
Is there a way to:
Find out the 'n' most frequent words spoken per 'x' minutes of the complete conversation.
Is there a way to plot the 'n' most frequent words per 'x' minutes of the complete duration of the conversation.
For part 2. a per 'x' minute bar plot with the height of the bar scaled to the sum of the occurrences of the 'n' most common words? Is there a more intuitive way of graphically showing this information?
EDIT:
I am attaching a basic minimal Ipython notebook of what I have right now
Ipython notebook
Problems:
Resampling the dataframe per 60s doesn't do the trick as some of the conversations can be more than 60 second long. e.g The first and 4th row in the dataframe below have conversations lasting 114 seconds. Not sure if these be split in to exact 60s, even if they can the split may cascade in to the next 1 min time-slot and and make its duration more than 60s, as in case of the first and second row in the dataframe below.
Start Time Name Start Time End Time Duration Transcript
2019-04-13 18:51:22.567532 Person A 2019-04-13 18:51:22.567532 2019-04-13 18:53:16.567532 114 A dude meows on this cool guy my gardener met yesterday for no apparent reason.
2019-04-13 18:53:24.567532 Person D 2019-04-13 18:53:24.567532 2019-04-13 18:54:05.567532 41 Your homie flees from the king for a disease.
2019-04-13 18:57:14.567532 Person B 2019-04-13 18:57:14.567532 2019-04-13 18:57:55.567532 41 The king hacks some guy because the sky is green.
2019-04-13 18:59:32.567532 Person D 2019-04-13 18:59:32.567532 2019-04-13 19:01:26.567532 114 A cat with rabies spies on a cat with rabies for a disease.
As each minute interval has a different set of top 'n' frequency words, the bar plot didn't seem possible. I settled for a Dot plot 2 which looks confusing because of the y-axis height of different words. Is there a better plot to visualise this data?
Also listing the complete code here:
import pandas as pd
import random
import urllib
import plotly
import plotly.graph_objs as go
from datetime import datetime,timedelta
from collections import Counter
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
def printdfhtml(df):
old_width = pd.get_option('display.max_colwidth')
pd.set_option('display.max_colwidth', -1)
display(HTML(df.to_html(index=True)))
pd.set_option('display.max_colwidth', old_width)
def removeStopwords(wordlist, stopwords):
return [w for w in wordlist if w not in stopwords]
def stripNonAlphaNum(text):
import re
return re.compile(r'\W+', re.UNICODE).split(text)
def sortFreqDict(freqdict):
aux = [(freqdict[key], key) for key in freqdict]
aux.sort()
aux.reverse()
return aux
def sortDictKeepTopN(freqdict,keepN):
return dict(Counter(freqdict).most_common(keepN))
def wordListToFreqDict(wordlist):
wordfreq = [wordlist.count(p) for p in wordlist]
return dict(zip(wordlist,wordfreq))
s_nouns = ["A dude", "My bat", "The king", "Some guy", "A cat with rabies", "A sloth", "Your homie", "This cool guy my gardener met yesterday", "Superman"]
p_nouns = ["These dudes", "Both of my cars", "All the kings of the world", "Some guys", "All of a cattery's cats", "The multitude of sloths living under your bed", "Your homies", "Like, these, like, all these people", "Supermen"]
s_verbs = ["eats", "kicks", "gives", "treats", "meets with", "creates", "hacks", "configures", "spies on", "retards", "meows on", "flees from", "tries to automate", "explodes"]
p_verbs = ["eat", "kick", "give", "treat", "meet with", "create", "hack", "configure", "spy on", "retard", "meow on", "flee from", "try to automate", "explode"]
infinitives = ["to make a pie.", "for no apparent reason.", "because the sky is green.", "for a disease.", "to be able to make toast explode.", "to know more about archeology."]
people = ["Person A","Person B","Person C","Person D"]
start_time = datetime.now() - timedelta(minutes = 10)
complete_transcript = pd.DataFrame(columns=['Name','Start Time','End Time','Duration','Transcript'])
for i in range(1,10):
start_time = start_time + timedelta(seconds = random.randint(10,240)) # random delay bw ppl talking 10sec to 4 mins
curr_transcript = " ".join([random.choice(s_nouns), random.choice(s_verbs), random.choice(s_nouns).lower() or random.choice(p_nouns).lower(), random.choice(infinitives)])
talk_duration = random.randint(5,120) # 5 sec to 2 min talk
end_time = start_time + timedelta(seconds = talk_duration)
complete_transcript.loc[i] = [random.choice(people),
start_time,
end_time,
talk_duration,
curr_transcript]
df = complete_transcript.copy()
df = df.sort_values(['Start Time'])
df.index=df['Start Time']
printdfhtml(df)
re_df = df.copy()
re_df = re_df.drop("Name", axis=1)
re_df = re_df.drop("End Time", axis=1)
re_df = re_df.drop("Start Time", axis=1)
re_df = re_df.resample('60S').sum()
printdfhtml(re_df)
stopwords = ['a', 'about', 'above', 'across', 'after', 'afterwards']
stopwords += ['again', 'against', 'all', 'almost', 'alone', 'along']
stopwords += ['already', 'also', 'although', 'always', 'am', 'among']
stopwords += ['amongst', 'amoungst', 'amount', 'an', 'and', 'another']
stopwords += ['any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere']
stopwords += ['are', 'around', 'as', 'at', 'back', 'be', 'became']
stopwords += ['because', 'become', 'becomes', 'becoming', 'been']
stopwords += ['before', 'beforehand', 'behind', 'being', 'below']
stopwords += ['beside', 'besides', 'between', 'beyond', 'bill', 'both']
stopwords += ['bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant']
stopwords += ['co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de']
stopwords += ['describe', 'detail', 'did', 'do', 'done', 'down', 'due']
stopwords += ['during', 'each', 'eg', 'eight', 'either', 'eleven', 'else']
stopwords += ['elsewhere', 'empty', 'enough', 'etc', 'even', 'ever']
stopwords += ['every', 'everyone', 'everything', 'everywhere', 'except']
stopwords += ['few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first']
stopwords += ['five', 'for', 'former', 'formerly', 'forty', 'found']
stopwords += ['four', 'from', 'front', 'full', 'further', 'get', 'give']
stopwords += ['go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her']
stopwords += ['here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers']
stopwords += ['herself', 'him', 'himself', 'his', 'how', 'however']
stopwords += ['hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed']
stopwords += ['interest', 'into', 'is', 'it', 'its', 'itself', 'keep']
stopwords += ['last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made']
stopwords += ['many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine']
stopwords += ['more', 'moreover', 'most', 'mostly', 'move', 'much']
stopwords += ['must', 'my', 'myself', 'name', 'namely', 'neither', 'never']
stopwords += ['nevertheless', 'next', 'nine', 'no', 'nobody', 'none']
stopwords += ['noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of']
stopwords += ['off', 'often', 'on','once', 'one', 'only', 'onto', 'or']
stopwords += ['other', 'others', 'otherwise', 'our', 'ours', 'ourselves']
stopwords += ['out', 'over', 'own', 'part', 'per', 'perhaps', 'please']
stopwords += ['put', 'rather', 're', 's', 'same', 'see', 'seem', 'seemed']
stopwords += ['seeming', 'seems', 'serious', 'several', 'she', 'should']
stopwords += ['show', 'side', 'since', 'sincere', 'six', 'sixty', 'so']
stopwords += ['some', 'somehow', 'someone', 'something', 'sometime']
stopwords += ['sometimes', 'somewhere', 'still', 'such', 'system', 'take']
stopwords += ['ten', 'than', 'that', 'the', 'their', 'them', 'themselves']
stopwords += ['then', 'thence', 'there', 'thereafter', 'thereby']
stopwords += ['therefore', 'therein', 'thereupon', 'these', 'they']
stopwords += ['thick', 'thin', 'third', 'this', 'those', 'though', 'three']
stopwords += ['three', 'through', 'throughout', 'thru', 'thus', 'to']
stopwords += ['together', 'too', 'top', 'toward', 'towards', 'twelve']
stopwords += ['twenty', 'two', 'un', 'under', 'until', 'up', 'upon']
stopwords += ['us', 'very', 'via', 'was', 'we', 'well', 'were', 'what']
stopwords += ['whatever', 'when', 'whence', 'whenever', 'where']
stopwords += ['whereafter', 'whereas', 'whereby', 'wherein', 'whereupon']
stopwords += ['wherever', 'whether', 'which', 'while', 'whither', 'who']
stopwords += ['whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with']
stopwords += ['within', 'without', 'would', 'yet', 'you', 'your']
stopwords += ['yours', 'yourself', 'yourselves','']
x_trace = np.linspace(1,len(re_df.index),len(re_df.index))
n_top_words = 3
y_trace1 = []
y_trace2 = []
y_trace3 = []
for index, row in re_df.iterrows():
str_to_check=str(row['Transcript']).lower()
if(str_to_check!='0') and (str_to_check!=''):
print('-----------------------------')
wordlist = stripNonAlphaNum(str_to_check)
wordlist = removeStopwords(wordlist, stopwords)
dictionary = wordListToFreqDict(wordlist)
print('text: ',str_to_check)
print('words dropped dict: ',dictionary)
sorteddict = sortDictKeepTopN(dictionary,n_top_words)
cnt=0
for s in sorteddict:
print(str(s))
if cnt==0:
y_trace1.append(s)
elif cnt==1:
y_trace2.append(s)
elif cnt==2:
y_trace3.append(s)
cnt+=1
trace1 = {"x": x_trace,
"y": y_trace1,
"marker": {"color": "pink", "size": 12},
"mode": "markers",
"name": "1st",
"type": "scatter"
}
trace2 = {"x": x_trace,
"y": y_trace2,
"marker": {"color": "blue", "size": 12},
"mode": "markers",
"name": "2nd",
"type": "scatter",
}
trace3 = {"x": x_trace,
"y": y_trace3,
"marker": {"color": "grey", "size": 12},
"mode": "markers",
"name": "3rd",
"type": "scatter",
}
data = [trace3, trace2, trace1]
layout = {"title": "Most Frequent Words per Minute",
"xaxis": {"title": "Time (in Minutes)", },
"yaxis": {"title": "Words"}}
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

Related

Select all string in list and append a string - python

I wanna add "." to the end of any item of doing variable.
and I want output like this => I am watching.
import random
main = ['I', 'You', 'He', 'She']
main0 = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
rd = random.choice(main)
rd0 = random.choice(doing)
result = []
'''
if rd == main[0]:
result.append(rd)
result.append(main0[0])
if rd == main[1]:
result.append(rd)
result.append(main0[1])
if rd == main[2]:
result.append(rd)
result.append(main0[2])
if rd == main[3]:
result.append(rd)
result.append(main0[3])
'''
result.append(rd0)
print(result)
well, I tried those codes.
'.'.append(doing)
'.'.append(doing[0])
'.'.append(rd0)
but no one of them works, and only returns an error that:
Traceback (most recent call last):
File "c:\Users\----\Documents\Codes\Testing\s.py", line 21, in <module>
'.'.append(rd0)
AttributeError: 'str' object has no attribute 'append'
Why not just select a random string from each list and then build the output as you want:
import random
main = ['I', 'You', 'He', 'She']
verb = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
p1 = random.choice(main)
p2 = random.choice(verb)
p3 = random.choice(doing)
output = ' '.join([p1, p2, p3]) + '.'
print(output) # You is playing.
Bad English, but the logic seems to be what you want here.
Just add a period after making the string:
import random
main = ['I', 'You', 'He', 'She']
main0 = ['am', 'are', 'is', 'is']
doing = ['playing', 'watching', 'reading', 'listening']
' '.join([random.choice(i) for i in [main, main0, doing]]) + '.'

Find most common word in a list of sets

I'm currently working in my university projects in NLP. I'd like to display the most common words contained in this list of sets:
[{'allow', 'feel', 'fear', 'situat', 'properti', 'despit', 'face', 'ani'}, {'unpleas', 'someth', 'fear', 'make', 'abil', 'face', 'scar', 'us', 'feel'}]
This is what I've accomplished until now:
def word_list(sent):
if isinstance(sent, str):
tokens = set(word_tokenize(sent.lower()))
else:
tokens = set([t for s in sent for t in word_tokenize(s.lower())])
tokens = set([stemmer.stem(t) for t in tokens])
for w in stopword_final:
tokens.discard(w)
return tokens
def get_most_relevant_words(definitions):
list_of_words = list()
most_common_word_dict = dict()
for d1 in definitions:
list_of_words.append(word_list(d1))
for elem in list_of_words:
for word in elem:
print(word)
word_counter = Counter(word)
most_occurrences = word_counter.most_common(3)
most_common_word_dict.update({word: most_occurrences})
return most_common_word_dict
The desired output should be: {fear: 2, feel: 2}
The output that it prints is: {'feel': [('e', 2), ('f', 1), ('l', 1)]}
Use collections.Counter:
from collections import Counter
list_of_sets = [{'allow', 'feel', 'fear', 'situat', 'properti', 'despit', 'face', 'ani'}, {'unpleas', 'someth', 'fear', 'make', 'abil', 'face', 'scar', 'us', 'feel'}]
words = [word for my_set in list_of_sets for word in my_set]
c = Counter(words)
print(c)
output:
Counter({
'fear': 2,
'face': 2,
'feel': 2,
'properti': 1,
'despit': 1,
'allow': 1,
'situat': 1,
'ani': 1,
'someth': 1,
'unpleas': 1,
'make': 1,
'abil': 1,
'us': 1,
'scar': 1
})
You can simply iterate through the 2 sets, find common terms, and update the count in a dictionary. By the way, 'face' should also be included in your result.
lst = [{'allow', 'feel', 'fear', 'situat', 'properti', 'despit', 'face', 'ani'}, {'unpleas', 'someth', 'fear', 'make', 'abil', 'face', 'scar', 'us', 'feel'}]
dic = {}
for word1 in lst[0]:
for word2 in lst[1]:
if word1 == word2:
dic[word1] = dic.get(word1, 0) + 2
print(dic)
#{'fear': 2, 'feel': 2, 'face': 2}

I want to find the length of my each words in text file

I'm trying to find the length of words individually in my text file. I tried it by following code but this code is showing me up the count of words that how many times this word is used in file.
text = open(r"C:\Users\israr\Desktop\counter\Bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word in d:
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
for key in list(d.keys()):
print(key, ":", d[key])
And the output is something like this
china : 14
emerged : 1
as : 16
one : 5
of : 44
the : 108
world's : 7
first : 2
civilizations, : 1
in : 26
fertile : 1
basin : 1
yellow : 1
river : 1
north : 1
plain. : 1
Basically i want a list of words having same length for example china , first , world :5 this 5 is the length of all this words and so on the words having different length in other list
İf you need all word's total lengths seperatly, you can find them using this formula:
len(word) * count(word) for all word in words
equalivent in python:
d[key] * len(key)
Change last 2 lines with below:
for key in list(d.keys()):
print(key, ":", d[key] * len(key))
----EDIT----
It ıs what you asked in comments, I guess. Below code gives you groups whose members are the same length.
for word in words:
if len(word) in d:
if word not in d[len(word)]:
d[len(word)].append(word)
else:
# Add the word to dictionary with count 1
d[len(word)] = [word]
for key in list(d.keys()):
print(key, ":", d[key])
Output of this code:
3 : ['the', 'bc,', '(c.', 'who', 'was', '100', 'bc)', 'and', 'xia', 'but', 'not', 'one', 'due', '8th', '221', 'qin', 'shi', 'for', 'his', 'han', '220', '206', 'has', 'war', 'all', 'far']
8 : ['earliest', 'describe', 'writings', 'indicate', 'commonly', 'however,', 'cultural', 'history,', 'regarded', 'external', 'internal', 'culture,', 'troubled', 'imperial', 'selected', 'replaced', 'republic', 'mainland', "people's", 'peoples,', 'multiple', 'kingdoms', 'xinjiang', 'present.', '(carried']
5 : ['known', 'china', 'early', 'shang', 'texts', 'grand', 'ruled', 'river', 'which', 'along', 'these', 'arose', 'years', 'their', 'rule.', 'began', 'first', 'those', 'huang', 'title', 'after', 'until', '1912,', 'tasks', 'elite', 'young', '1949.', 'unity', 'being', 'civil', 'parts', 'other', 'world', 'waves', 'basis']
7 : ['written', 'records', 'history', 'dynasty', 'ancient', 'century', 'mention', 'writing', 'period,', 'xia.[5]', 'valley,', 'chinese', 'various', 'centers', 'yangtze', "world's", 'cradles', 'concept', 'mandate', 'justify', 'central', 'country', 'smaller', 'period.', 'another', 'warring', 'created', 'himself', 'huangdi', 'marking', 'systems', 'enabled', 'emperor', 'control', 'routine', 'handled', 'special', 'through', "china's", 'between', 'periods', 'culture', 'western', 'foreign']
2 : ['of', 'as', 'wu', 'by', 'no', 'is', 'do', 'in', 'to', 'be', 'at', 'or', 'bc', '21', 'ad']
4 : ['date', 'from', '1250', 'bc),', 'king', 'such', 'book', '11th', '(296', 'held', 'both', 'with', 'zhou', 'into', 'much', 'qin,', 'fell', 'soon', '(206', 'ad).', 'that', 'vast', 'were', 'men,', 'last', 'qing', 'then', 'most', 'whom', 'eras', 'have', 'some', 'asia', 'form']
9 : ['1600–1046', 'mentioned', 'documents', 'chapters,', 'historian', '2070–1600', 'existence', 'neolithic', 'millennia', 'thousands', '(1046–256', 'pressures', 'following', 'developed', 'conquered', '"emperor"', 'beginning', 'dynasties', 'directly.', 'centuries', 'carefully', 'difficult', 'political', 'dominated', 'stretched', 'contact),']
6 : ['during', "ding's", '(early', 'bamboo', 'annals', 'before', 'shang,', 'yellow', 'cradle', 'river.', 'shang.', 'oldest', 'heaven', 'weaken', 'states', 'spring', 'autumn', 'became', 'warred', 'times.', 'china.', 'death,', 'peace,', 'failed', 'recent', 'steppe', 'china;', 'tibet,', 'modern']
12 : ['reign,[1][2]', 'twenty-first', 'longer-lived', 'bureaucratic', 'calligraphy,', '(1644–1912),', '(1927–1949).', 'occasionally', 'immigration,']
11 : ['same.[3][4]', 'independent', 'traditional', 'territories', 'well-versed', 'literature,', 'philosophy,', 'assimilated', 'population.', 'warlordism,']
10 : ['historical', 'originated', 'continuous', 'supplanted', 'introduced', 'government', 'eventually', 'splintered', 'literature', 'philosophy', 'oppressive', 'successive', 'alternated', 'influences', 'expansion,']
1 : ['a', '–']
13 : ['civilization.', 'civilizations', 'examinations.', 'statehood—the', 'assimilation,']
17 : ['civilizations,[6]']
16 : ['civilization.[7]']
0 : ['']
14 : ['administrative']
18 : ['scholar-officials.']
Below is full version of code.
text = open("bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if len(word) in d:
if word not in d[len(word)]:
d[len(word)].append(word)
else:
d[len(word)] = [word]
for key in list(d.keys()):
print(key, ":", d[key])
When you look at the code for dealing with each word, you will see your problem..
for word in words:
if word in d:
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
Here you are checking if a word is in the dictionary. If it is, add 1 to its key when we find it. If it is not, initialize it at 1. This is the core concept for counting repetitions.
If you want to count the length of the word, you could simply do.
for word in words:
if word not in d:
d[word] = len(word)
And to output your dict, you can do
for k, v in d.items():
print(k, ":", v)
You can create a list of word lengths and then process them through python's built-in Counter:
from collections import Counter
with open("mytext.txt", "r") as f:
words = f.read().split()
words_lengths = [len(word) for word in words]
counter = Counter(words_lengths)
The output would be smth like:
In[1]:counter
Out[1]:Counter({7: 146, 9: 73, 5: 73, 4: 146, 1: 73})
Where keys are words lengths, and values are the number of times they occurred.
You can work with that as with usual dictionary.

AttributeError: 'list' object has no attribute 'isdigit'. Specifying POS of each and every word in sentences list efficiently?

Suppose I am having lists of list of sentences (in a large corpus) as collections of tokenized words. The sample format is as follows:
The format of tokenized_raw_data is as follows:
[['arxiv', ':', 'astro-ph/9505066', '.'], ['seds', 'page', 'on', '``',
'globular', 'star', 'clusters', "''", 'douglas', 'scott', '``', 'independent',
'age', 'estimates', "''", 'krysstal', '``', 'the', 'scale', 'of', 'the',
'universe', "''", 'space', 'and', 'time', 'scaled', 'for', 'the', 'beginner',
'.'], ['icosmos', ':', 'cosmology', 'calculator', '(', 'with', 'graph',
'generation', ')', 'the', 'expanding', 'universe', '(', 'american',
'institute', 'of', 'physics', ')']]
I want to apply the pos_tag.
What I have tried up to now is as follows.
import os, nltk, re
from nltk.corpus import stopwords
from unidecode import unidecode
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
def read_data():
global tokenized_raw_data
with open("path//merge_text_results_pu.txt", 'r', encoding='utf-8', errors = 'replace') as f:
raw_data = f.read()
tokenized_raw_data = '\n'.join(nltk.line_tokenize(raw_data))
read_data()
def function1():
tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
tagged_tokens = nltk.pos_tag(unfiltered_tokens)
nouns = [word.encode('utf-8') for word,pos in tagged_tokens
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
joined_nouns_text = (' '.join(map(bytes.decode, nouns))).strip()
noun_tokens = [t for t in wordpunct_tokenize(joined_nouns_text)]
stop_words = set(stopwords.words("english"))
function1()
I am getting the following error.
> AttributeError: 'list' object has no attribute 'isdigit'
Please help how to overcome this error in time-efficient manner? Where I am going wrong?
Note: I am using Python 3.7 on Windows 10.
Try this-
word_list=[]
for i in range(len(unfiltered_tokens)):
word_list.append([])
for i in range(len(unfiltered_tokens)):
for word in unfiltered_tokens[i]:
if word[1:].isalpha():
word_list[i].append(word[1:])
then after do
tagged_tokens=[]
for token in word_list:
tagged_tokens.append(nltk.pos_tag(token))
You will get your desired results! Hope this helped.

Find the start of the index of a phrase in a list

Assuming I have this list:
text = ['Malte', 'ex', 'precio', 'empcionis', 'bovum', 'septem', 'laborancium', 'et', 'unius', 'thaurj', 'et', 'unius', 'vacce', 'cum', 'vitulo', 'sequenti', 'et', 'pecudum', 'fetancium', 'sexdecim', 'et', 'duarum', 'caprarum', 'cum', 'duobus', 'et', 'cum', 'vitulo']
And I want to find every index of the beginning of 'cum vitulo' ie: 13 and 26.
At the moment I am getting the start of 'cum' but sometimes this is followed by another word ex: 'duobus' in this case
One way to do it would be like this:
text = ['Malte', 'ex', 'precio', 'empcionis', 'bovum', 'septem', 'laborancium', 'et', 'unius', 'thaurj', 'et', 'unius', 'vacce', 'cum', 'vitulo', 'sequenti', 'et', 'pecudum', 'fetancium', 'sexdecim', 'et', 'duarum', 'caprarum', 'cum', 'duobus', 'et', 'cum', 'vitulo']
target = 'cum vitulo'
target = tuple(target.split())
hits = [i for i, x in enumerate(zip(text, text[1:])) if x == target]
print(hits) # -> [13, 26]
This is the simplest way IMHO (python 2.7 and 3):
text = ['Malte', 'ex', 'precio', 'empcionis', 'bovum', 'septem', 'laborancium', 'et', 'unius', 'thaurj', 'et', 'unius', 'vacce', 'cum', 'vitulo', 'sequenti', 'et', 'pecudum', 'fetancium', 'sexdecim', 'et', 'duarum', 'caprarum', 'cum', 'duobus', 'et', 'cum', 'vitulo'];
result = [i for i, item in enumerate(text[:-1]) if item == 'cum' and text[i+1] == 'vitulo']
print(result) # >>> [13, 26]
text = ['Malte', 'ex', 'precio', 'empcionis', 'bovum', 'septem', 'laborancium', 'et', 'unius', 'thaurj', 'et', 'unius', 'vacce', 'cum', 'vitulo', 'sequenti', 'et', 'pecudum', 'fetancium', 'sexdecim', 'et', 'duarum', 'caprarum', 'cum', 'duobus', 'et', 'cum', 'vitulo']
result = []
for index, value in enumerate(text):
if value == 'cum':
try:
if text[index + 1] == 'vitulo':
result.append(index)
except IndexError: # reached end
break
output
result == [13, 26]
A fancy way:
from itertools import tee
# recipe from itertools
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
[index for (index, (first, second)) in enumerate(pairwise(text)) if first == 'cum' and second == 'vitulo']

Categories

Resources