Splitting a string based on a certain set of words

Splitting a string based on a certain set of words - python

I have a list of strings like such,
['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
Given a keyword list like ['for', 'or', 'and'] I want to be able to parse the list into another list where if the keyword list occurs in the string, split that string into multiple parts.
For example, the above set would be split into
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Currently I've split each inner string by underscore and have a for loop looking for an index of a key word, then recombining the strings by underscore. Is there a quicker way to do this?

>>> [re.split(r"_(?:f?or|and)_", s) for s in l]
[['happy_feet'],
['happy_hats', 'cats'],
['sad_fox', 'mad_banana'],
['sad_pandas', 'happy_cats', 'people']]
To combine them into a single list, you can use
result = []
for s in l:
result.extend(re.split(r"_(?:f?or|and)_", s))

>>> pat = re.compile("_(?:%s)_"%"|".join(sorted(split_list,key=len)))
>>> list(itertools.chain(pat.split(line) for line in data))
will give you the desired output for the example dataset provided
actually with the _ delimiters you dont really need to sort it by length so you could just do
>>> pat = re.compile("_(?:%s)_"%"|".join(split_list))
>>> list(itertools.chain(pat.split(line) for line in data))

You could use a regular expression:
from itertools import chain
import re
pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
result = list(chain.from_iterable(pattern.split(w) for w in input_list))
The pattern is dynamically created from your list of keywords. The string 'happy_hats_for_cats' is split on '_for_':
>>> re.split(r'_for_', 'happy_hats_for_cats')
['happy_hats', 'cats']
but because we actually produced a set of alternatives (using the | metacharacter) you get to split on any of the keywords:
>>> re.split(r'_(?:for|or|and)_', 'sad_pandas_and_happy_cats_for_people')
['sad_pandas', 'happy_cats', 'people']
Each split result gives you a list of strings (just one if there was nothing to split on); using itertools.chain.from_iterable() lets us treat all those lists as one long iterable.
Demo:
>>> from itertools import chain
>>> import re
>>> keywords = ['for', 'or', 'and']
>>> input_list = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
>>> pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
>>> list(chain.from_iterable(pattern.split(w) for w in input_list))
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']

Another way of doing this, using only built-in method, is to replace all occurrence of what's in ['for', 'or', 'and'] in every string with a replacement string, say for example _1_ (it could be any string), then at then end of each iteration, to split over this replacement string:
l = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
replacement_s = '_1_'
lookup = ['for', 'or', 'and']
lookup = [x.join('_'*2) for x in lookup] #Changing to: ['_for_', '_or_', '_and_']
results = []
for i,item in enumerate(l):
for s in lookup:
if s in item:
l[i] = l[i].replace(s,'_1_')
results.extend(l[i].split('_1_'))
OUTPUT:
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']

Related

I am able to parse the log file but not getting output in correct format in python [duplicate]

How do I concatenate a list of strings into a single string?
For example, given ['this', 'is', 'a', 'sentence'], how do I get "this-is-a-sentence"?
For handling a few strings in separate variables, see How do I append one string to another in Python?.
For the opposite process - creating a list from a string - see How do I split a string into a list of characters? or How do I split a string into a list of words? as appropriate.

Use str.join:
>>> words = ['this', 'is', 'a', 'sentence']
>>> '-'.join(words)
'this-is-a-sentence'
>>> ' '.join(words)
'this is a sentence'

A more generic way (covering also lists of numbers) to convert a list to a string would be:
>>> my_lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> my_lst_str = ''.join(map(str, my_lst))
>>> print(my_lst_str)
12345678910

It's very useful for beginners to know
why join is a string method.
It's very strange at the beginning, but very useful after this.
The result of join is always a string, but the object to be joined can be of many types (generators, list, tuples, etc).
.join is faster because it allocates memory only once. Better than classical concatenation (see, extended explanation).
Once you learn it, it's very comfortable and you can do tricks like this to add parentheses.
>>> ",".join("12345").join(("(",")"))
Out:
'(1,2,3,4,5)'
>>> list = ["(",")"]
>>> ",".join("12345").join(list)
Out:
'(1,2,3,4,5)'

Edit from the future: Please don't use the answer below. This function was removed in Python 3 and Python 2 is dead. Even if you are still using Python 2 you should write Python 3 ready code to make the inevitable upgrade easier.
Although #Burhan Khalid's answer is good, I think it's more understandable like this:
from str import join
sentence = ['this','is','a','sentence']
join(sentence, "-")
The second argument to join() is optional and defaults to " ".

list_abc = ['aaa', 'bbb', 'ccc']
string = ''.join(list_abc)
print(string)
>>> aaabbbccc
string = ','.join(list_abc)
print(string)
>>> aaa,bbb,ccc
string = '-'.join(list_abc)
print(string)
>>> aaa-bbb-ccc
string = '\n'.join(list_abc)
print(string)
>>> aaa
>>> bbb
>>> ccc

We can also use Python's reduce function:
from functools import reduce
sentence = ['this','is','a','sentence']
out_str = str(reduce(lambda x,y: x+"-"+y, sentence))
print(out_str)

We can specify how we join the string. Instead of '-', we can use ' ':
sentence = ['this','is','a','sentence']
s=(" ".join(sentence))
print(s)

If you have a mixed content list and want to stringify it, here is one way:
Consider this list:
>>> aa
[None, 10, 'hello']
Convert it to string:
>>> st = ', '.join(map(str, map(lambda x: f'"{x}"' if isinstance(x, str) else x, aa)))
>>> st = '[' + st + ']'
>>> st
'[None, 10, "hello"]'
If required, convert back to the list:
>>> ast.literal_eval(st)
[None, 10, 'hello']

If you want to generate a string of strings separated by commas in final result, you can use something like this:
sentence = ['this','is','a','sentence']
sentences_strings = "'" + "','".join(sentence) + "'"
print (sentences_strings) # you will get "'this','is','a','sentence'"

def eggs(someParameter):
del spam[3]
someParameter.insert(3, ' and cats.')
spam = ['apples', 'bananas', 'tofu', 'cats']
eggs(spam)
spam =(','.join(spam))
print(spam)

Without .join() method you can use this method:
my_list=["this","is","a","sentence"]
concenated_string=""
for string in range(len(my_list)):
if string == len(my_list)-1:
concenated_string+=my_list[string]
else:
concenated_string+=f'{my_list[string]}-'
print([concenated_string])
>>> ['this-is-a-sentence']
So, range based for loop in this example , when the python reach the last word of your list, it should'nt add "-" to your concenated_string. If its not last word of your string always append "-" string to your concenated_string variable.

nltk how to give multiple separated sentences

I have list of sentences (each sentence is a list) in English and I would like to fetch ngrams.
For example:
sentences = [['this', 'is', 'sentence', 'one'], ['hello','again']]
In order to run
nltk.utils.ngram
I need to flat the list to:
sentences = ['this','is','sentence','one','hello','again']
But then I get a fault bgram in
('one','hello')
.
What is the best way to deal with it?
Thanks!

Try this:
from itertools import chain
sentences = list(chain(*sentences))
chain return a chain object whose .__next__() method returns elements from the first iterable until it is exhausted, then elements from the next
iterable, until all of the iterables are exhausted.
or you can do:
sentences = [i for s in sentences for i in s]

you can also use list comprehension
f = []
[f.extend(_l) for _l in sentences]
f = ['this', 'is', 'sentence', 'one', 'hello', 'again']

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??

Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']

This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.

For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']

In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Finding groups of letters, numbers, or symbols

How can I split a string into substrings based on the characters contained in the substrings. For example, given a string "ABC12345..::", I would like to get a list like ['ABC', '12345', '..::']. I know the valid characters for each substring, but I don't know the lengths. So the string could also look like "CC123:....:", in which case I would like to have ['CC', '123', ':....:'] as the result.

By your example you don't seem to have anything to split with (e.g. nothing between C and 1), but what you do have is a well-formed pattern that you can match. So just simply create a pattern that groups the strings you want matched:
>>> import re
>>> s = "ABC12345..::"
>>> re.match('([A-Z]*)([0-9]*)([\.:]*)', s).groups()
('ABC', '12345', '..::')
Alternative, compile the pattern into a reusable regex object and do this:
>>> patt = re.compile('([A-Z]*)([0-9]*)([\.:]*)')
>>> patt.match(s).groups()
('ABC', '12345', '..::')
>>> patt.match("CC123:....:").groups()
('CC', '123', ':....:')

Match each group with the following regex
[0-9]+|[a-zA-Z]+|[.:]+
[0-9]+ any digits repeated any times, or
[a-zA-Z]+ any letters repeated any times, or
[.:]+ any dots or colons repeated any times
This will allow you to match groups in any order, ie: "123...xy::ab..98765PQRS".
import re
print(re.findall( r'[0-9]+|[a-zA-Z]+|[.:]+', "ABC12345..::"))
# => ['ABC', '12345', '..::']
ideone demo

If you want a non-regex approach:
value = 'ABC12345..::'
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['ABC', '12345', '..::']
Another string:
value = "CC123:....:"
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['CC', '123', ':....:']
EDIT:
Just did a benchmark, metatoaster's method is slightly faster than this :)

How to concatenate (join) items in a list to a single string

How do I concatenate a list of strings into a single string?
For example, given ['this', 'is', 'a', 'sentence'], how do I get "this-is-a-sentence"?
For handling a few strings in separate variables, see How do I append one string to another in Python?.
For the opposite process - creating a list from a string - see How do I split a string into a list of characters? or How do I split a string into a list of words? as appropriate.

Use str.join:
>>> words = ['this', 'is', 'a', 'sentence']
>>> '-'.join(words)
'this-is-a-sentence'
>>> ' '.join(words)
'this is a sentence'

A more generic way (covering also lists of numbers) to convert a list to a string would be:
>>> my_lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> my_lst_str = ''.join(map(str, my_lst))
>>> print(my_lst_str)
12345678910

It's very useful for beginners to know
why join is a string method.
It's very strange at the beginning, but very useful after this.
The result of join is always a string, but the object to be joined can be of many types (generators, list, tuples, etc).
.join is faster because it allocates memory only once. Better than classical concatenation (see, extended explanation).
Once you learn it, it's very comfortable and you can do tricks like this to add parentheses.
>>> ",".join("12345").join(("(",")"))
Out:
'(1,2,3,4,5)'
>>> list = ["(",")"]
>>> ",".join("12345").join(list)
Out:
'(1,2,3,4,5)'

Edit from the future: Please don't use the answer below. This function was removed in Python 3 and Python 2 is dead. Even if you are still using Python 2 you should write Python 3 ready code to make the inevitable upgrade easier.
Although #Burhan Khalid's answer is good, I think it's more understandable like this:
from str import join
sentence = ['this','is','a','sentence']
join(sentence, "-")
The second argument to join() is optional and defaults to " ".

list_abc = ['aaa', 'bbb', 'ccc']
string = ''.join(list_abc)
print(string)
>>> aaabbbccc
string = ','.join(list_abc)
print(string)
>>> aaa,bbb,ccc
string = '-'.join(list_abc)
print(string)
>>> aaa-bbb-ccc
string = '\n'.join(list_abc)
print(string)
>>> aaa
>>> bbb
>>> ccc

We can also use Python's reduce function:
from functools import reduce
sentence = ['this','is','a','sentence']
out_str = str(reduce(lambda x,y: x+"-"+y, sentence))
print(out_str)

We can specify how we join the string. Instead of '-', we can use ' ':
sentence = ['this','is','a','sentence']
s=(" ".join(sentence))
print(s)

If you have a mixed content list and want to stringify it, here is one way:
Consider this list:
>>> aa
[None, 10, 'hello']
Convert it to string:
>>> st = ', '.join(map(str, map(lambda x: f'"{x}"' if isinstance(x, str) else x, aa)))
>>> st = '[' + st + ']'
>>> st
'[None, 10, "hello"]'
If required, convert back to the list:
>>> ast.literal_eval(st)
[None, 10, 'hello']

If you want to generate a string of strings separated by commas in final result, you can use something like this:
sentence = ['this','is','a','sentence']
sentences_strings = "'" + "','".join(sentence) + "'"
print (sentences_strings) # you will get "'this','is','a','sentence'"

def eggs(someParameter):
del spam[3]
someParameter.insert(3, ' and cats.')
spam = ['apples', 'bananas', 'tofu', 'cats']
eggs(spam)
spam =(','.join(spam))
print(spam)

Without .join() method you can use this method:
my_list=["this","is","a","sentence"]
concenated_string=""
for string in range(len(my_list)):
if string == len(my_list)-1:
concenated_string+=my_list[string]
else:
concenated_string+=f'{my_list[string]}-'
print([concenated_string])
>>> ['this-is-a-sentence']
So, range based for loop in this example , when the python reach the last word of your list, it should'nt add "-" to your concenated_string. If its not last word of your string always append "-" string to your concenated_string variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a string based on a certain set of words - python

>>> [re.split(r"_(?:f?or|and)_", s) for s in l] [['happy_feet'], ['happy_hats', 'cats'], ['sad_fox', 'mad_banana'], ['sad_pandas', 'happy_cats', 'people']] To combine them into a single list, you can use result = [] for s in l: result.extend(re.split(r"_(?:f?or|and)_", s))

Related

I am able to parse the log file but not getting output in correct format in python [duplicate]

nltk how to give multiple separated sentences

Splitting string using different scenarios using regex

Finding groups of letters, numbers, or symbols

How to concatenate (join) items in a list to a single string

Categories

Resources