Suppose you have a string:
text = "coding in python is a lot of fun"
And character positions:
positions = [(0,6),(10,16),(29,32)]
These are intervals, which cover certain words within text, i.e. coding, python and fun, respectively.
Using the character positions, how could you split the text on those words, to get this output:
['coding','in','python','is a lot of','fun']
This is just an example, but it should work for any string and any list of character positions.
I'm not looking for this:
[text[i:j] for i,j in positions]
I'd flatten positions to be [0,6,10,16,29,32] and then do something like
positions.append(-1)
prev_positions = [0] + positions
words = []
for begin, end in zip(prev_positions, positions):
words.append(text[begin:end])
This exact code produces ['', 'coding', ' in ', 'python', ' is a lot of ', 'fun', ''], so it needs some additional work to strip the whitespace
Below code works as expected
text = "coding in python is a lot of fun"
positions = [(0,6),(10,16),(29,32)]
textList = []
lastIndex = 0
for indexes in positions:
s = slice(indexes[0], indexes[1])
if positions.index(indexes) > 0:
print(lastIndex)
textList.append(text[lastIndex: indexes[0]])
textList.append(text[indexes[0]: indexes[1]])
lastIndex = indexes[1] + 1
print(textList)
Output: ['coding', 'in ', 'python', 'is a lot of ', 'fun']
Note: If space are not needed you can trim them
Related
I have a list of strings in python, where I need to preserve order and split some strings.
The condition to split a string is that after first match of : there is a none space/new line/tab char.
For example, this must be split:
example: Test to ['example':, 'Test']
While this stays the same: example: , IGNORE_ME_EXAMPLE
Given an input like this:
['example: Test', 'example: ', 'IGNORE_ME_EXAMPLE']
I'm expecting:
['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
Please Note that split strings are yet stick to each other and follow original order.
Plus, whenever I split a string I don't want to check split parts again. In other words, I don't want to check 'Test' after I split it.
To make it more clear, Given an input like this:
['example: Test::YES']
I'm expecting:
['example:', 'Test::YES']
You can use regular expressions for that:
import re
pattern = re.compile(r"(.+:)\s+([^\s].+)")
result = []
for line in lines:
match = pattern.match(line)
if match:
result.append(match.group(1))
result.append(match.group(2))
else:
result.append(line)
You can use nested loop comprehension for the input list:
l = ['example: Test::YES']
l1 = [j.lower().strip() for i in l for j in i.split(":", 1) if j.strip().lower() != '']
print(l1)
Output:
['example', 'Test::YES']
you need to iterate over your list of words, for each word, you need to check if : present or not. if present the then split the word in 2 parts, pre : and post part. append these pre and post to final list and if there is no : in word add that word in the result list and skip other operation for that word
# your code goes here
wordlist = ['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
result = []
for word in wordlist:
index = -1
part1, part2 = None, None
if ':' in word:
index = word.index(':')
else:
result.append(word)
continue
part1, part2 = word[:index+1], word[index+1:]
if part1 is not None and len(part1)>0:
result.append(part1)
if part2 is not None and len(part2)>0:
result.append(part2)
print(result)
output
['example:', 'Test', 'example:', ' ', 'IGNORE_ME_EXAMPLE']
I am trying to split a string such as the one below, with all of the delimiters below, but only once.
string = 'it; seems; like\ta good\tday to watch\va\vmovie.'
delimiters = '\t \v ;'
The output, in this case, would be:
['it', ' seems; like', 'a good\tday to watch', 'a\vmovie.']
Obviously the example above is a nonsense example, but I am trying to learn whether or not this is possible. Would a fairly involved regex be in order?
Apologies if this question had been asked before. I did a fair bit of searching and could not find something quite like my example. Thanks for your time!
This should do the trick:
import re
def split_once_by(s, delims):
delims = set(delims)
parts = []
while delims:
delim_re = '({})'.format('|'.join(re.escape(d) for d in delims))
result = re.split(delim_re, s, maxsplit=1)
if len(result) == 3:
first, delim, s = result
parts.append(first)
delims.remove(delim)
else:
break
parts.append(s)
return parts
Example:
>>> split_once_by('it; seems; like\ta good\tday to watch\va\vmovie.', '\t\v;')
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
Burning Alcohol's answer inspired me to write this (IMO) better function:
def split_once_by(s, delims):
split_points = sorted((s.find(d), -len(d), d) for d in delims)
start = 0
for stop, _longest_first, d in split_points:
if stop < start: continue
yield s[start:stop]
start = stop + len(d)
yield s[start:]
with usage:
>>> list(split_once_by('it; seems; like\ta good\tday to watch\va\vmovie.', '\t\v;'))
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
A simple algorithm would do,
test_string = r'it; seems; like\ta good\tday to watch\va\vmovie.'
delimiters = [r'\t', r'\v', ';']
# find the index of each first occurence and sort it
delimiters = sorted(delimiters, key=lambda delimiter: test_string.find(delimiter))
splitted_string = [test_string]
# perform split with option maxsplit
for index, delimiter in enumerate(delimiters):
if delimiter in splitted_string[-1]:
splitted_string += splitted_string[-1].split(delimiter, maxsplit=1)
splitted_string.pop(index)
print(splitted_string)
# ['it', ' seems; like', 'a good\\tday to watch', 'a\\vmovie.']
Just create a list of patterns and apply them once:
string = 'it; seems; like\ta good\tday to watch\va\vmovie.'
patterns = ['\t', '\v', ';']
for pattern in patterns:
string = '*****'.join(string.split(pattern, maxsplit=1))
print(string.split('*****'))
Output:
['it', ' seems; like', 'a good\tday to watch', 'a\x0bmovie.']
So, what is "*****" ??
On each iteration, when you apply the split method you get a list. So, in the next iteration, You can't apply the .split () method (because you have a list), so you have to join each value of that list with some weird character like "****" or "###" or "^^^^^^^" or whatever you want, in order to re-apply the split () in the next iteration.
Finally, for each "*****" on your string, you will have one pattern of the list, so you can use this to make a final split.
I have an arbitrarily nested array of values that looks like this:
['"multiply"', 'ALAssn ', ['ACmp ', ['Ge ', ['Var "n"'], ' ', ['Num 0']]], ['ALAssn ', ['ACmp ', ['Eq ', ['Var "p"'], ' ', ['Mul ', ['Var "n"'], ' ', ['Var "m"']]]]]
and I need to try and figure out a way to parse through the every value in the array and format it so that:
Each array of length 1 is split into two separate values:
-- Example: ['Var "n"'] should now become ["Var", "n"] and ['Num 0'] now becomes ["Num", 0].
All instances of empty list values are removed.
-- Example: ['Ge ', ['Var "n"'], ' ', ['Num 0']] now becomes ['Ge ', ['Var "n"'], ['Num 0']]
The whitespace in any string is removed.
-- Example: 'Ge ' now becomes 'Ge'
The given snippet is a portion of a much larger string that needs to parsed. I understand what needs to be done at a high level..ie:
Once I get to an list of length 1, list.split(" ") to split into two separate elements, then trim arr[1] to get rid of the extra quotation marks
If el is an empty string for every element in the list, list.remove(el)
Check if isinstance(el, string) of every element when traversing, and if true, el.replace(" ", "") to rid of the whitespace.
My only issue comes when traversing through every single element in the list. I've tried doing so recursively and iteratively, but so far haven't been able to crack it.
Ideally, I traverse through every single element, and then once I hit an element that meets the criteria, set that element equal to the change that I want to make on it. This is only really the case for points 1 and 3.
EDIT:
Thank you so much for the answers given. I have one more addition I would like to make.
Assume too I have a nested identifiers like 'Reads "a"' as the first value of an array, with the possibility of having addition identifiers like Write "a" in the same level. These also needs to be converted to the format ["Read", "a"]. See the change in the large list below. How would I go about doing this?
['Read "a"', ['Add', ['Var', 'i'], ['Num', '1']]], 'Write "a"', ['Add', ['Var', 'i'], ['Num', '1']], ['Var', 't']]
The point of these values 'Read' and 'Write' is so that, when traversing the list, we know the "type" of the next n elements of the list corresponding to that identifier. We can distinguish them basically by saying they are are the only values in the nested list that will not be lists themselves.
For example: ['identifier', [], [], []]
Assume it is known that the identifier type contains 3 lists, first, second, third. The goal is to read identifier and then store first, second, and third as nodes in a tree, for example.
This problem seems like it would be easiest to deal with by constructing a new list with the fixed-up items, rather than trying to modify the existing list in place. This would let you use recursion to deal with the nesting, while using iteration over the flat parts of each list.
I'd structure the code like this:
def process(lst):
if len(lst) == 1: # special case for one-element lists
result = lst[0].split()
result[1] = result[1].strip('"') # strip quotation marks
return result
result = []
for item in lst:
if isinstance(item, list):
result.append(process(item)) # recurse on nested lists
else: # item is a string
stripped = item.strip() # remove leading and trailing whitespace
if stripped:
result.append(stripped) # keep only non-empty strings
return result
Seems you can collapse 1 and 3 into one operation:
def sanitize(item):
if isinstance(item, list):
if len(item) == 1:
item = item[0].split()
return [output for i in item if (output := sanitize(i))]
return item.strip('" ') # Strips both '"' and ' '.
item = ['"multiply"', 'ALAssn ', ['ACmp ', ['Ge ', ['Var "n"'], ' ', ['Num 0']]], ['ALAssn ', ['ACmp ', ['Eq ', ['Var "p"'], ' ', ['Mul ', ['Var "n"'], ' ', ['Var "m"']]]]]]
sanitize(item)
# Returns: ['multiply', 'ALAssn', ['ACmp', ['Ge', ['Var', 'n'], ['Num', '0']]], ['ALAssn', ['ACmp', ['Eq', ['Var', 'p'], ['Mul', ['Var', 'n'], ['Var', 'm']]]]]]
I want to edit my text like this:
arr = []
# arr is full of tokenized words from my text
For example:
"Abraham Lincoln Hotel is very beautiful place and i want to go there with
Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
Edit: Basically I want to detect Proper Names and group them by using istitle() and isAlpha() in for statement like:
for i in arr:
if arr[i].istitle() and arr[i].isAlpha
In the example arr appened until the next word hasn't his first letter upper case.
arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel
This is what i want with my new arr:
['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with ['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'], ['Reebok'].
"Also" is not problem for me it will be usefull when i try to match with my dataset.
You could do something like this:
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
if(word.istitle() and word.isalpha()):
if(last_word_index == idx-1):
proper_nouns[-1] = proper_nouns[-1] + " " + word
else:
proper_nouns.append(word)
last_word_index = idx
print(proper_nouns)
This code will:
Split all the words into a list
Iterate over all of the words and
If the last capitalized word was the previous word, it will append it to the last entry in the list
else it will store the word as a new entry in the list
Record the last index that a capitalized word was found
Is this what you are asking?
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
chars = ".!?," # Characters you want to remove from the words in the array
table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table) # Replace characters with spaces
arr = sentence.split() # Split the string into an array whereever a space occurs
print(arr)
The output is:
['Abraham',
'Lincoln',
'Hotel',
'is',
'very',
'beautiful',
'place',
'and',
'i',
'want',
'to',
'go',
'there',
'with',
'Barbara',
'Palvin',
'Also',
'there',
'are',
'stores',
'like',
'Adidas',
'Nike',
'Reebok']
Note about this code: any character that is in the chars variable will be removed from the strings in the array. Explenation is in the code.
To remove the non-names just do this:
import string
new_arr = []
for i in arr:
if i[0] in string.ascii_uppercase:
new_arr.append(i)
This code will include ALL words that start with a capital letter.
To fix that you will need to change chars to:
chars = ","
And change the above code to:
import string
new_arr = []
end = ".!?"
b = 1
for i in arr:
if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
new_arr.append(i)
b += 1
And that will output:
['Abraham',
'Lincoln',
'Hotel',
'Barbara',
'Palvin.',
'Adidas',
'Nike',
'Reebok.']
I have a complicated string and would like to try to extract multiple substring from it.
The string consists of a set of items, separated by commas. Each item has an identifier (id-n) for a pair of words inside which is enclosed by brackets. I want to get only the word inside the bracket which has a number attached to its end (e.g. 'This-1'). The number actually indicates the position of how the words should be arrannged after extraction.
#Example of how the individual items would look like
id1(attr1, is-2) #The number 2 here indicates word 'is' should be in position 2
id2(attr2, This-1) #The number 1 here indicates word 'This' should be in position 1
id3(attr3, an-3) #The number 3 here indicates word 'an' should be in position 3
id4(attr4, example-4) #The number 4 here indicates word 'example' should be in position 4
id5(attr5, example-4) #This is a duplicate of the word 'example'
#Example of string - this is how the string with the items looks like
string = "id1(attr1, is-1), id2(attr2, This-2), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
#This is how the result should look after extraction
result = 'This is an example'
Is there an easier way to do this? Regex doesn't work for me.
A trivial/naive approach:
>>> z = [x.split(',')[1].strip().strip(')') for x in s.split('),')]
>>> d = defaultdict(list)
>>> for i in z:
... b = i.split('-')
... d[b[1]].append(b[0])
...
>>> ' '.join(' '.join(d[t]) for t in sorted(d.keys(), key=int))
'is This an example example'
You have duplicated positions for example in your sample string, which is why example is repeated in the code.
However, your sample is not matching your requirements either - but this results is as per your description. Words arranged as per their position indicators.
Now, if you want to get rid of duplicates:
>>> ' '.join(e for t in sorted(d.keys(), key=int) for e in set(d[t]))
'is This an example'
Why not regex? This works.
In [44]: s = "id1(attr1, is-2), id2(attr2, This-1), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
In [45]: z = [(m.group(2), m.group(1)) for m in re.finditer(r'(\w+)-(\d+)\)', s)]
In [46]: [x for y, x in sorted(set(z))]
Out[46]: ['This', 'is', 'an', 'example']
OK, how about this:
sample = "id1(attr1, is-2), id2(attr2, This-1),
id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
def make_cryssie_happy(s):
words = {} # we will use this dict later
ll = s.split(',')[1::2]
# we only want items like This-1, an-3, etc.
for item in ll:
tt = item.replace(')','').lstrip()
(word, pos) = tt.split('-')
words[pos] = word
# there can only be one word at a particular position
# using a dict with the numbers as positions keys
# is an alternative to using sets
res = [words[i] for i in sorted(words)]
# sort the keys, dicts are unsorted!
# create a list of the values of the dict in sorted order
return ' '.join(res)
# return a nice string
print make_cryssie_happy(sample)