Related
I am having a txt file with a text that I import in Python and I want to separate it at every 3 words.
For example,
Python is an interpreted, high-level and general-purpose programming language
I want to be,
[['Python', 'is', 'an'],['interpreted,', 'high-level','and'],['general-purpose','programming','language']].
My code so far,
lines = [word.split() for word in open(r"c:\\python\4_TRIPLETS\Sample.txt", "r")]
print(lines)
gives me this output,
[['Python', 'is', 'an', 'interpreted,', 'high-level', 'and', 'general-purpose', 'programming', 'language.', "Python's", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace.', 'Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear,', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects.']]
Any ideas?
Use list comprehension to convert list into chunks of n items
with open('c:\\python\4_TRIPLETS\Sample.txt', 'r') as file:
data = file.read().replace('\n', '').split()
lines = [data[i:i + 3] for i in range(0, len(data), 3)]
print(lines)
You can use a split string to separate each word and then go through the list and group them into pairs of 3 words.
final = = [None] * math.ceil(lines/3)
temp = [None] * 3
i = 0
for x in lines:
if(i % 3 == 0)
final.append(temp)
temp = [None] * 3
temp.append(x)
I would like to separate a list in different lists at '\n'. For example, if I have a list like this one:
l = ['hi', 'my', 'name', 'is', 'john', '\n', '\n', 'nice', 'to', 'meet', 'you']
I'd like to separate the items this way:
l = [['hi', 'my', 'name', 'is', 'john'], ['nice', 'to', 'meet', 'you']]
Can someone help me?
Some code that I tried to write:
l = ['hi', 'my', 'name', 'is', 'john', '\n', '\n', 'nice', 'to', 'meet', 'you']
lst = []
ls = []
for word in l:
if word != '\n':
ls.append(l)
else:
lst.append(ls)
print(lst)
I think you just wanted to append word to the list ls. Also, clear the partial list at the newlines like so:
lst = []
ls = []
for word in l:
if word != '\n':
ls.append(word)
else:
if len(ls) > 0:
lst.append(ls)
ls = []
if len(ls) > 0:
lst.append(ls)
print(lst)
resulting in
[['hi', 'my', 'name', 'is', 'john'], ['nice', 'to', 'meet', 'you']]
You could use itertools.groupby:
>>> from itertools import groupby
>>> l = ['hi', 'my', 'name', 'is', 'john', '\n', '\n', 'nice', 'to', 'meet', 'you']
>>> l = [list(group) for key, group in groupby(l, lambda s: s != '\n') if key]
>>> l
[['hi', 'my', 'name', 'is', 'john'], ['nice', 'to', 'meet', 'you']]
I am having a hard time merging the elements within the Python list according to a given number.
I already found the solution to work on with a certain number. But I want to work with a variety of given number(N).
i.e. When I have a list
['There', 'was', 'a', 'farmer', 'who', 'had', 'a', 'dog', 'and', 'cat', '.']
Result,
When N = 2
['There was', 'a farmer', 'who had', 'a dog', 'and cat', '.']
or N = 3
['There was a', 'farmer who had', 'a dog and', 'cat .']
I would much prefer that it modified the existing list directly, not used any module or library.
Any help is greatly appreciated!!
Here's the sensible way to do it. it does create a new list. Since that is more efficient than trying to modify the original
a = ['There', 'was', 'a', 'farmer', 'who', 'had', 'a', 'dog', 'and', 'cat', '.']
n = 2
print([' '.join(a[i:i+n]) for i in range(0, len(a), n)])
n = 3
print([' '.join(a[i:i+n]) for i in range(0, len(a), n)])
Output:
['There was', 'a farmer', 'who had', 'a dog', 'and cat', '.']
['There was a', 'farmer who had', 'a dog and', 'cat .']
This is the simplest way I can remember of:
my_list = ['a','b','c','d','e','f','g','h','i']
N = 3
my_list[0:N] = [''.join(my_list[0:N])]
word_list = ['There', 'was', 'a', 'farmer', 'who', 'had', 'a', 'dog', 'and', 'cat', '.']
new_word_list = []
N = int(input())
for i in range(0, len(word_list), N):
string = ''
for j in (range(len(word_list) - i) if len(word_list) - i < N else range(N)):
string = string + word_list[i + j] + ' '
new_word_list.append(string)
print(new_word_list)
Here I implemented it using basic for loops to iterate over the list, although new list had to be created.
I am currently doing a data analysis project involving text mining. As of now, I am stuck on filtering out certain phrases.
Suppose I have this tokenized array of words
arr = ['hello' ',' , 'how', 'is' , 'your', 'day', 'going', '?' , '#', 'HelloWorld']
(hello, how is your day going? #HelloWorld)
and I want to remove the #HelloWorld from the sentence.
My original logic was traverse through the array and check for the # , once it the # has been found, I would replace the # and the element after the # with a blank space as followed:
N = 0
for index to arr:
if arr[N] == '#':
arr[N] = (' ')
arr[N+1] = (' ')
N += 1
unfortunately, I got the error list assignment index out of range at line 5. I tried to use the .append() but it only allows modification at N .
Is there another approach to this?
This should work, like the others said, you need to check when you are at the end of the list.
EDIT: simplify !
arr = ['a', 'b', '#', 'aa']
indices = [idx for idx, elt in enumerate(arr) if elt == '#']
for idx in indices:
if idx != len(arr): arr[idx+1] = ' ' # Check if not at the end of the list
arr[idx] = ' '
Your code will try to access outside the array when the last element is #, so you need to check for that.
There's also no need to use a separate variable for iteration and indexing, just iterate over the range of indexes.
for i in range(len(arr)):
if arr[i] == '#':
arr[i] = ' '
if i < len(arr)-2:
arr[i+1] = ' '
The root cause of your codes is 'N+1' will be out of range when loop to the end of the list.
If one element must exist following one '#', try below:
arr = ['hello' ',' , 'how', 'is' , 'your', 'day', 'going', '?' , '#', 'HelloWorld']
for index in range(0, len(arr)):
if arr[index] == '#':
arr[index:index+2] = ['', '']
print (arr)
Output:
['hello,', 'how', 'is', 'your', 'day', 'going', '?', '', '']
[Finished in 0.133s]
if the array is end with '#', it will still replace '#' with ['',''] ( I am not sure whether this result is as you expected.
arr = ['hello' ',' , 'how', 'is' , 'your', 'day', 'going', '?' , '#', 'HelloWorld', '#']
for index in range(0, len(arr)):
if arr[index] == '#':
arr[index:index+2] = ['', '']
print (arr)
Output:
['hello,', 'how', 'is', 'your', 'day', 'going', '?', '', '', '', '']
[Finished in 0.179s]
I tried to split a list into new list. Here's the initial list:
initList =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
If I want to split the list based on the PTExyz to new list which looks:
newList = ['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']
How should I develop proper algorithm for general case with repeated item PTExyz?
Thank You!
The algorithm will be something like this.
Iterate over the list. Find a the string s that starts with PTE. Assign it to a temp string which is initialized as an empty string. Add every next string s with temp unless that string starts with PTE. In that case, if the temp string is not empty then append it with your result list else add the string with temp.
ls = ['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word', 'title', 'PTE427', 'how', 'are', 'you']
result = []
temp = ''
for s in ls:
if s.startswith('PTE'):
if temp != '':
result.append(temp)
temp = s
else:
if temp == '':
continue
temp += ' ' + s
result.append(temp)
print(result)
Edit
For handling the pattern PTExyz you can use regular expression. In that case the code will be like this where the line is s.startswith('PTE'):
re.match(r'PTE\w{3}$', s)
I think it will work
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
resultlist = []
s = ' '.join(l)
str = s.split('PTE')
for i in str:
resultlist.append('PTE'+i)
resultlist.remove('PTE')
print resultlist
It works on a regular expression PTExyz
import re
l =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word',
'title', 'PTE427', 'how', 'are', 'you']
pattern = re.compile(r'[P][T][E]\d\d\d')
k = []
for i in l:
if pattern.match(i) is not None:
k.append(i)
s = ' '.join(l)
str = re.split(pattern, s)
str.remove('')
for i in range(len(k)):
str[i] = k[i] + str[i]
print str
>>> list =['PTE123', '', 'I', 'am', 'programmer', 'PTE345', 'based', 'word','title', 'PTE427', 'how', 'are', 'you']
>>> index_list =[ list.index(item) for item in list if "PTE" in item]
>>> index_list.append(len(list))
>>> index_list
[0, 5, 9, 13]
>>> [' '.join(list[index_list[i-1]:index_list[i]]) for i,item in enumerate(index_list) if item > 0 ]
Output
['PTE123 I am programmer', 'PTE345 based word title', 'PTE427 how are you']