Group list of 4-strings into list of pairs - python

I have following list of strings:
['word1 word2 word3 word4', 'word5 word6 word7 word8']
(I have shown only two strings, but there can be many.)
I want to create new list which should look like this:
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']
I tried following:
lines = ['word1 word2 word3 word4', 'word5 word6 word7 word8']
[[word1 + ' ' + word2, word3 + ' ' + word4] for line in lines for word1, word2, word3, word4 in line.split()]
But it gives following error:
ValueError: too many values to unpack (expected 4)
How do I do this in most pythonic way?

With short regex matching:
import re
lst = ['word1 word2 word3 word4', 'word5 word6 word7 word8']
res = [pair for words in lst for pair in re.findall(r'\S+ \S+', words)]
\S+ \S+ - matches 2 consecutive "words"
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']

Modified #jsbueno's ealier answer which was slightly incorrect:
>>> words = [item for line in lines for item in line.split()]
>>> words
['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8']
>>> [l[i] + ' ' + l[i+1] for i in range(0, len(words), 2)]
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']

Pythonic doesn't mean "fewer lines". This is easily done with a simple for loop:
result = []
for line in lines:
words = line.split()
result.append(' '.join(words[:2]))
result.append(' '.join(words[2:]))
Which gives your desired result:
['word1 word2', 'word3 word4', 'word5 word6', 'word7 word8']
Try it online!
If you want to make this more general for strings with more words, you can write a function that will yield chunks of the desired size, and use that with str.join:
def chunks(iterable, chunk_size):
c = []
for item in iterable:
c.append(item)
if len(c) == chunk_size:
yield c
c = []
if c: yield c
result = []
for line in lines:
words = line.split()
for chunk in chunks(words, 2):
result.append(' '.join(chunk))
Try it online!

An optimized solution that pushes all the per-item work to the C layer:
from itertools import chain
lines = ['word1 word2 word3 word4', 'word5 word6 word7 word8']
words = chain.from_iterable(map(str.split, lines))
paired = list(map('{} {}'.format, words, words))
print(paired)
Try it online!
chain.from_iterable(map(str.split, lines)) creates an iterator of the individual words. map('{} {}'.format, words, words) maps the same iterator twice to put them back together in pairs (map(' '.join, zip(words, words)) would get the same effect, but with an additional intermediate product; feel free to test which is faster in practice). The list wrapper consumes it to produce the final result.
This beats the existing answers by avoiding all per-item work at the Python layer (no additional bytecode executed as the input grows), and avoids one of the weirdly high overhead aspects of Python (indexing and simple integer math).

Related

Dividing the string with multiple matches in python

I have a string that has to be split for words that are present in "words"
words = ['word1', 'word2', 'word3']
text = " long statement word1 statement1 word2 statement2 word3 statement3 " # a single lined string
The code I'm using, is there any simple way for this?
for l in words:
if l == "word1": t1 = text.split(l)
if l == "word2": t2 = str(t1[1]).split(l)
if l == "word3": t3 = str(t2[1]).split(l)
print(t1[0])
print(t2[0])
print(t3[0])
The output is like:
statement
statement1
statement2
statement3
How about using itertools.groupby:
from itertools import groupby
words = ['word1', 'word2', 'word3']
text = " long statement word1 statement1 word2 statement2 word3 statement "
delimiters = set(words)
statements = [
' '.join(g) for k, g in groupby(text.split(), lambda w: w in delimiters)
if not k
]
print(statements)
Output:
['long statement', 'statement1', 'statement2', 'statement3']
You could Regex for solving your problem in this way.
import re
words = ['word1', 'word2', 'word3']
text = " long statement word1 statement1 word2 statement2 word3 statement3 "
print(*re.split('|'.join(words),text), sep="\n")

Pythonic and efficient usage of logical and membership operators

What is the best way to check if some words in combination using logical operators (or,and) exist in a list of strings ?
Say you have a list of strings:
list_of_str = ['some phrase with word1','another phrase','other phrase with word2']
I have two cases 1), and 2) where I would like to get the strings that contain or do not contain some words. however I would prefer not to repeat as I do now if 'word1' not in i and 'word2' not in i and 'word3' not in i
I would like to get 1)
list_1 = [i for i in list_of_str if 'word1' not in i and 'word2' not in i and 'word3' not in i]
output: ['another phrase']
and 2)
list_2 = [i for i in list_of_str if 'word1' in i or 'word2' in i or 'word3' in i]
output: ['some phrase with word1', 'other phrase with word2']
I did find that I can do this for 2), but couldn't use the all for case 1)
list_2 = [i for i in list_of_str if any(word in ['word1','word2','word3'] for word in i.split())]
output: ['some phrase with word1', 'other phrase with word2']
Also is this the most efficient way of doing things ?
you can use:
words = ['word1', 'word2', 'word23']
list_1 = [i for i in list_of_str if all(w not in i for w in words)]
list_2 = [i for i in list_of_str if any(w in i for w in words)]
I think this is a good use-case for regex alternation, if efficiency matters:
>>> import re
>>> words = ['word1', 'word2', 'word23']
>>> regex = re.compile('|'.join([re.escape(w) for w in words]))
>>> regex
re.compile('word1|word2|word23')
>>> list_of_str = ['some phrase with word1','another phrase','other phrase with word2']
>>> [phrase for phrase in list_of_str if not regex.search(phrase)]
['another phrase']
>>> [phrase for phrase in list_of_str if regex.search(phrase)]
['some phrase with word1', 'other phrase with word2']
>>>
If you think about it in sets, you want sentences from that list where the set of search words and the set of words in the sentence are either disjoint or intersect.
E.g.:
set('some phrase with word1'.split()).isdisjoint({'word1', 'word2', 'word23'})
not set('some phrase with word1'.split()).isdisjoint({'word1', 'word2', 'word23'})
# or:
set('some phrase with word1'.split()) & {'word1', 'word2', 'word23'}
So:
search_terms = {'word1', 'word2', 'word23'}
list1 = [i for i in list_of_str if set(i.split()).isdisjoint(search_terms)]
list2 = [i for i in list_of_str if not set(i.split()).isdisjoint(search_terms)]

Column lists into string

I have a dataset that looks like this:
id keyPhrases
0 [word1, word2]
1 [word4, word 5 and 6, word7]
2 [word8, etc, etc
Each value in 'keyPhrases' is a list.
I'd like to expand each list into a new row (string)
The 'id' column is not important right now.
Already tried df.values, from_records, etc
Expected:
keyPhrases
word1
word2
word3
word4
You can use itertools.chain in combination with dataframe column selection:
import itertools
df = pd.DataFrame({
'keyPhrases': [
['word1', 'word2'],
['word4', 'word5', 'word7'],
['word8', 'word9']
],
'id': [1,2,3]
})
for elem in itertools.chain.from_iterable(df['keyPhrases'].values):
print(elem)
will print:
word1
word2
word4
word5
word7
word8
word9
np.concatenate()
np.concatenate(df.keyPhrases) #data courtesy vurmux
array(['word1', 'word2', 'word4', 'word5', 'word7', 'word8', 'word9'],
dtype='<U5')
Another way:
import functools
import operator
functools.reduce(operator.iadd, df.keyPhrases, [])
#['word1', 'word2', 'word4', 'word5', 'word7', 'word8', 'word9']
A fun way but not recommended
df.keyPhrases.sum()
Out[520]: ['word1', 'word2', 'word4', 'word5', 'word7', 'word8', 'word9']
keyPhrases = df.keyPhrases.tolist()
reduce(lambda x, y: x+y, keyPhrases)
Both the numpy and the itertools methods worked pretty fine.
I ended up using the itertools method and used the for to write each line to a file.
It saved me a lot of time and code.
Thanks a lot!!
for elem in itertools.chain.from_iterable(df['keyPhrases'].values):
textfile.write(elem + "\n")
I am not sure about any existing functions which could do this in single line of code. The work around code below can solve your requirement. If there are any other built-in functions that can get this done without struggle, I will be glad to know.
import pandas as pd
#Existing DF where the data is in the form of list
df = pd.DataFrame(columns=['ID', 'value_list'])
#New DF where the data should be atomic
df_new = pd.DataFrame(columns=['ID', 'value_single'])
#Sample Data
row_1 = ['A', 'B', 'C', 'D']
row_2 = ['D', 'E', 'F']
row_3 = ['F', 'G']
row_4 = ['H', 'I']
row_5 = ['J']
#Data Push to existing DF
row_ = "row_"
for i in range(5):
df.loc[i, 'ID'] = i
df.loc[i, 'value_list'] = eval(row_+str(i+1))
#Data Push to new DF where list is pushed as atomic data
counter = 0
i=0
while(i<len(df)):
j=0
while(j<len(df['value_list'][i])):
df_new.loc[counter, 'ID'] = df['ID'][i]
df_new.loc[counter, 'value_single'] = df['value_list'][i][j]
counter = counter + 1
j = j+1
i = i+1
print(df_new)
This link could help with your requirement.
Found another way to do:
df['keyPhrases'] = df['keyPhrases'].str.split(',') #to make arrays
df['keyPhrases'] = df['keyPhrases'].astype(str) #back to strings
s=''.join(df.keyPhrases).replace('[','').replace(']','\n').replace(',','\n') #replace magic
print(s)
word1
word2
word4
word 5 and 6
word7
word8
etc
etc
The answer given above for the numpy library really is very good, but I participate by putting a code trellis, not performatic, but in the simplest way to understand.
import pandas as pd
lista = [[['word1', 'word2']], [['word4', 'word5', 'word6', 'word7']], [['word8', 'word9', 'word10']]]
df = pd.DataFrame(lista, columns=['keyPhrases'])
list = []
for key in df.keyPhrases:
for element in key:
list.append(element)
list

List Comprehension for items in list

list comprehension to check for presence of any of the items.
I have some text and would like to check on some keywords. It should return me the sentence if it contains any of the keywords.
An example:
text = [t for t in string.split('. ')
if 'drink' in t or 'eat' in t
or 'sleep' in t]
This works. However, I am thinking if there is a better way, as the list of keywords may grow.
I tried putting the keywords in a list but it would not work in this list comprehension.
OR using if any
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any (l in pattern for l in t)]
You were almost there:
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any(word in t for word in pattern)]
The key is to check for each word in pattern if that work is inside the sentence:
any(word in t for word in pattern)
Your use of any is backwards. This is what you want:
[t for t in string.split('. ') if any(l in t for l in pattern)]
An alternative approach is using a regex:
import re
regex = '|'.join(pattern)
[t for t in string.split('. ') if regex.search(t)]

How to return the count of words from a list of words that appear in a list of lists?

I have a very large list of strings like this:
list_strings = ['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
and a very large list of lists like this:
lis_of_lis = [['the storm was good blight'],['this is overcloud'],...,[there was a plague stormicide]]
How can I return a list of counts of all the words that appear in list_strings on each sub-list of lis_of_lis. For instance for the above example this will be the desired output: [2,1,1]
For example:
['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
['the storm was good blight']
The count is 2, since storm and blight appear in the first sublist (lis_of_lis)
['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
['this is overcloud stormicide']
The count is 1, since overcloud appear in the first sublist (lis_of_lis)
since stormicide doesnt appear in the first list
['storm', 'squall', 'overcloud',...,'cloud_up', 'cloud_over', 'plague', 'blight', 'fog_up', 'haze']
[there was a plague]
The count is 1, since plague appear in the first sublist (lis_of_lis)
Hence is the desired output [2,1,1]
The problem with all the answers is that are counting all the substrings in a word instead of the full word
You can use sum function within a list comprehension :
[sum(1 for i in list_strings if i in sub[0]) for sub in lis_of_lis]
result = []
for sentence in lis_of_lis:
result.append(0)
for word in list_strings:
if word in sentence[0]:
result[-1]+=1
print(result)
which is the long version of
result = [sum(1 for word in list_strings if word in sentence[0]) for sentence in lis_of_lis]
This will return [2,2,1] for your example.
If you want only whole words, add spaces before and after the words / sentences:
result = []
for sentence in lis_of_lis:
result.append(0)
for word in list_strings:
if ' '+word+' ' in ' '+sentence[0]+' ':
result[-1]+=1
print(result)
or short version:
result = [sum(1 for word in list_strings if ' '+word+' ' in ' '+sentence[0]+' ') for sentence in lis_of_lis]
This will return [2,1,1] for your example.
This creates a dictionary with the words in list_string as keys, and the values starting at 0. It then iterates through the lis_of_lis, splits the phrase up into a list of words, iterates through that, and checks to see if they are in the dictionary. If they are, 1 is added to the corresponding value.
word_count = dict()
for word in list_string:
word_count[word] = 0
for phrase in lis_of_lis:
words_in_phrase = phrase.split()
for word in words_in_phrase:
if word in word_count:
word_count[word] += 1
This will create a dictionary with the words as keys, and the frequency as values. I'll leave it to you to get the correct output out of that data structure.

Categories

Resources