Split file every three words & create lists of triplets - python

I am having a txt file with a text that I import in Python and I want to separate it at every 3 words.
For example,
Python is an interpreted, high-level and general-purpose programming language
I want to be,
[['Python', 'is', 'an'],['interpreted,', 'high-level','and'],['general-purpose','programming','language']].
My code so far,
lines = [word.split() for word in open(r"c:\\python\4_TRIPLETS\Sample.txt", "r")]
print(lines)
gives me this output,
[['Python', 'is', 'an', 'interpreted,', 'high-level', 'and', 'general-purpose', 'programming', 'language.', "Python's", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace.', 'Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear,', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects.']]
Any ideas?

Use list comprehension to convert list into chunks of n items
with open('c:\\python\4_TRIPLETS\Sample.txt', 'r') as file:
data = file.read().replace('\n', '').split()
lines = [data[i:i + 3] for i in range(0, len(data), 3)]
print(lines)

You can use a split string to separate each word and then go through the list and group them into pairs of 3 words.
final = = [None] * math.ceil(lines/3)
temp = [None] * 3
i = 0
for x in lines:
if(i % 3 == 0)
final.append(temp)
temp = [None] * 3
temp.append(x)

Related

How can I split a txt file into a list by word but including commas on the elements

I have a big txt file and I want to split it into a list where every word is a element of the list. I want to commas to be included on the elements like the example.
txt file
Hi, my name is Mick and I want to split this with commas included, like this.
list ['Hi,','my','name','is','Mick' etc. ]
Thank you very much for the help
Just use str.split() without any pattern, it'll split on space(s)
value = 'Hi, my name is Mick and I want to split this with commas included, like this.'
res = value.split()
print(res) # ['Hi,', 'my', 'name', 'is', 'Mick', 'and', 'I', 'want', 'to', 'split', 'this', 'with', 'commas', 'included,', 'like', 'this.']
res = [r for r in value.split() if ',' not in r]
print(res) # ['my', 'name', 'is', 'Mick', 'and', 'I', 'want', 'to', 'split', 'this', 'with', 'commas', 'like', 'this.']

Structure sentences from words of triplets saved in a 2D list

I am currently having a text with its words saved as triplets in an 2D list.
The code so far:
with open(r'c:\\python\4_TRIPLETS\Sample.txt', 'r') as file:
data = file.read().replace('\n', '').split()
lines = [data[i:i + 3] for i in range(0, len(data), 3)]
print(lines)
My 2D List:
[['Python', 'is', 'an'], ['interpreted,', 'high-level', 'and'], ['general-purpose', 'programming', 'language.'], ["Python's", 'design', 'philosophy'], ['emphasizes', 'code', 'readability'], ['with', 'its', 'notable'], ['use', 'of', 'significant'], ['whitespace.', 'Its', 'language'], ['constructs', 'and', 'object-oriented'], ['approach', 'aim', 'to'], ['help', 'programmers', 'write'], ['clear,', 'logical', 'code'], ['for', 'small', 'and'], ['large-scale', 'projects.']]
I want to create a Python code which picks one random set of these triplets, then tries to create a new random text by using the last 2 words and by choosing a triplet that starts with these two. Finally my program ends when 200 words are being written or none other triplet set can be chosen.
Any ideas?
To pick a random triplet:
import random
triplet = random.choice(lines)
last_two = triplet[1:3]
To then continue picking:
while True:
candidates = [t for t in lines if t[0:2] == last_two]
if not candidates:
break
triplet = random.choice(candidates)
last_two = triplet[1:3]
I'll leave the saving of output and the length stopping criterion to you.

removing common words from a text file

I am trying to remove common words from a text. for example the sentence
"It is not a commonplace river, but on the contrary is in all ways remarkable."
I want to turn it into just unique words. This means removing "it", "but", "a" etc. I have a text file that has all the common words and another text file that contains a paragraph. How can I delete the common words in the paragraph text file?
For example:
['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
How do I remove the common words from the file efficiently. I have a text file called common.txt that has all the common words listed. How do I use that list to remove identical words in the sentence above. End output I want:
['commonplace', 'river', 'contrary', 'remarkable']
Does that make sense?
Thanks.
you would want to use "set" objects in python.
If order and number of occurrence are not important:
str_list = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
common_words = ['It', 'is', 'not', 'a', 'but', 'on', 'the', 'in', 'all', 'ways','other_words']
set(str_list) - set(common_words)
>>> {'contrary', 'commonplace', 'river', 'remarkable'}
If both are important:
#Using "set" is so much faster
common_set = set(common_words)
[s for s in str_list if not s in common_set]
>>> ['commonplace', 'river', 'contrary', 'remarkable']
Here's an example that you can use:
l = text.replace(",","").replace(".","").split(" ")
occurs = {}
for word in l:
occurs[word] = l.count(word)
resultx = ''
for word in occurs.keys()
if occurs[word] < 3:
resultx += word + " "
resultx = resultx[:-1]
you can change 3 with what you think suited or based it on the average using :
occurs.values()/len(occurs)
Additional
if you want it to be Case insensitive change the 1st line with :
l = text.replace(",","").replace(".","").lower().split(" ")
Most simple method would be just to read() your common.txt and then use list comprehension and only take the words that are not in the file we read
with open('common.txt') as f:
content = f.read()
s = ['It', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
res = [i for i in s if i not in content]
print(res)
# ['commonplace', 'river', 'contrary', 'remarkable']
filter also works here
res = list(filter(lambda x: x not in content, s))

Python - Count elements of a list within a range of specified values

I have a large list of words:
my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
I would like to be able to count the number of elements in between (and including) the [tag] elements across the whole list. The goal is to be able to see the frequency distribution.
Can I use range() to start and stop on a string match?
First, find all indices of [tag], the diff between adjacent indices is the number of words.
my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
indices = [i for i, x in enumerate(my_list) if x == "[tag]"]
nums = []
for i in range(1,len(indices)):
nums.append(indices[i] - indices[i-1])
A faster way to find all indices is using numpy, like shown below:
import numpy as np
values = np.array(my_list)
searchval = '[tag]'
ii = np.where(values == searchval)[0]
print ii
Another way to get diff between adjacent indices is using itertools,
import itertools
diffs = [y-x for x, y in itertools.izip (indices, indices[1:])]
You can use .index(value, [start, [stop]]) to search through the list.
my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
my_list.index('[tag']) # will return 0, as it occurs at the zero-eth element
my_list.index('[/tag]') # will return 6
That will get you your first group length, then on the next iteration you just need to remember what the last closing tag's index was, and use that as the start point, plus 1
my_list.index('[tag]', 7) # will return 7
my_list.index(['[/tag]'), 7) # will return 11
And do that in a loop till you've reached your last closing tag in your list.
Also remember, that .index will raise a ValueError if the value is not present, so you'll need to handle that exception when it occurs.
Solution using list comprehension and string manipulation.
my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
# string together your list
my_str = ','.join(mylist)
# split the giant string by tag, gives you a list of comma-separated strings
my_tags = my_str.split('[tag]')
# split for each word in each tag string
my_words = [w.split(',') for w in my_tags]
# count up each list to get a list of counts for each tag, adding one since the first split removed [tag]
my_cnt = [1+len(w) for w in my_words]
Do it one line:
# all as one list comprehension starting with just the string
[1+len(t.split(',')) for t in my_str.split('[tag]')]
This should allow you to find the number of words between and including you tags:
MY_LIST = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]',
'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
def main():
ranges = find_ranges(MY_LIST, '[tag]', '[/tag]')
for index, pair in enumerate(ranges, 1):
print('Range {}: Start = {}, Stop = {}'.format(index, *pair))
start, stop = pair
print(' Size of Range =', stop - start + 1)
def find_ranges(iterable, start, stop):
range_start = None
for index, value in enumerate(iterable):
if value == start:
if range_start is None:
range_start = index
else:
raise ValueError('a start was duplicated before a stop')
elif value == stop:
if range_start is None:
raise ValueError('a stop was seen before a start')
else:
yield range_start, index
range_start = None
if __name__ == '__main__':
main()
This example will print out the following text so you can see how it works:
Range 1: Start = 0, Stop = 6
Size of Range = 7
Range 2: Start = 7, Stop = 11
Size of Range = 5
Range 3: Start = 12, Stop = 15
Size of Range = 4
I would go with the following since the OP wants to count the actual values. (No doubt he has figured out how to do that by now.)
i = [k for k, i in enumerate(my_list) if i == '[tag]']
j = [k for k, p in enumerate(my_list) if p == '[/tag]']
for z in zip(i,j):
print (z[1]-z[0])
Borrowing and slightly modifying the generator code from the selected answer to this question:
my_list = ['[tag]', 'there', 'are', 'many', 'words', 'here', '[/tag]', '[tag]', 'some', 'more', 'here', '[/tag]', '[tag]', 'and', 'more', '[/tag]']
def group(seq, sep):
g = []
for el in seq:
g.append(el)
if el == sep:
yield g
g = []
counts = [len(x) for x in group(my_list,'[/tag]')]
I changed the generator they gave in that answer to not return the empty list at the end and to include the separator in the list instead of putting it in the next list. Note that this assumes there will always be a matching '[tag]' '[/tag'] pair in that order, and that all the elements in the list are between a pair.
After running this, counts will be [7,5,4]

How do I remove duplicate words from a list in python without using sets?

I have the following python code which almost works for me (I'm SO close!). I have text file from one Shakespeare's plays that I'm opening:
Original text file:
"But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief"
And the result of the code I worte gives me is this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and',
'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill',
'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the',
'through', 'what', 'window', 'with', 'yonder']
So this is almost what I want: It's already in a list sorted the way I want it, but how do I remove the duplicate words? I'm trying to create a new ResultsList and append the words to it, but it gives me the above result without getting rid of the duplicate words. If I "print ResultsList" it just dumps a ton of words out. They way I have it now is close, but I want to get rid of the extra "and's", "is's", "sun's" and "the's".... I want to keep it simple and use append(), but I'm not sure how I can get it to work. I don't want to do anything crazy with the code. What simple thing am I missing from my code inorder to remove the duplicate words?
fname = raw_input("Enter file name: ")
fhand = open(fname)
NewList = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
NewList.extend(words) #make the list from 4 lists to 1 list
for word in line.split(): #for each word in line.split()
if words not in line.split(): #if a word isn't in line.split
NewList.sort() #sort it
ResultList.append(words) #append it, but this doesn't work.
print NewList
#print ResultList (doesn't work the way I want it to)
mylist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(set(mylist), key=lambda x:mylist.index(x))
print(newlist)
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist contains a list of the set of unique values from mylist, sorted by each item's index in mylist.
You did have a couple logic error with your code. I fixed them, hope it helps.
fname = "stuff.txt"
fhand = open(fname)
AllWords = list() #create new list
ResultList = list() #create new results list I want to append words to
for line in fhand:
line.rstrip() #strip white space
words = line.split() #split lines of words and make list
AllWords.extend(words) #make the list from 4 lists to 1 list
AllWords.sort() #sort list
for word in AllWords: #for each word in line.split()
if word not in ResultList: #if a word isn't in line.split
ResultList.append(word) #append it.
print(ResultList)
Tested on Python 3.4, no importing.
Below function might help.
def remove_duplicate_from_list(temp_list):
if temp_list:
my_list_temp = []
for word in temp_list:
if word not in my_list_temp:
my_list_temp.append(word)
return my_list_temp
else: return []
This should work, it walks the list and adds elements to a new list if they are not the same as the last element added to the new list.
def unique(lst):
""" Assumes lst is already sorted """
unique_list = []
for el in lst:
if el != unique_list[-1]:
unique_list.append(el)
return unique_list
You could also use collections.groupby which works similarly
from collections import groupby
# lst must already be sorted
unique_list = [key for key, _ in groupby(lst)]
A good alternative to using a set would be to use a dictionary. The collections module contains a class called Counter which is specialized dictionary for counting the number of times each of its keys are seen. Using it you could do something like this:
from collections import Counter
wordlist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and',
'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is',
'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun',
'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(Counter(wordlist),
key=lambda w: w.lower()) # case insensitive sort
print(newlist)
Output:
['already', 'and', 'Arise', 'breaks', 'But', 'east', 'envious', 'fair',
'grief', 'is', 'It', 'Juliet', 'kill', 'light', 'moon', 'pale', 'sick',
'soft', 'sun', 'the', 'through', 'what', 'Who', 'window', 'with', 'yonder']
There is a problem with your code.
I think you mean:
for word in line.split(): #for each word in line.split()
if words not in ResultList: #if a word isn't in ResultList
Use plain old lists. Almost certainly not as efficient as Counter.
fname = raw_input("Enter file name: ")
Words = []
with open(fname) as fhand:
for line in fhand:
line = line.strip()
# lines probably not needed
#if line.startswith('"'):
# line = line[1:]
#if line.endswith('"'):
# line = line[:-1]
Words.extend(line.split())
UniqueWords = []
for word in Words:
if word.lower() not in UniqueWords:
UniqueWords.append(word.lower())
print Words
UniqueWords.sort()
print UniqueWords
This always checks against the lowercase version of the word, to ensure the same word but in a different case configuration is not counted as 2 different words.
I added checks to remove the double quotes at the start and end of the file, but if they are not present in the actual file. These lines could be disregarded.
This should do the job:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
lst.sort()
print(lst)

Categories

Resources