I have a complicated string and would like to try to extract multiple substring from it.
The string consists of a set of items, separated by commas. Each item has an identifier (id-n) for a pair of words inside which is enclosed by brackets. I want to get only the word inside the bracket which has a number attached to its end (e.g. 'This-1'). The number actually indicates the position of how the words should be arrannged after extraction.
#Example of how the individual items would look like
id1(attr1, is-2) #The number 2 here indicates word 'is' should be in position 2
id2(attr2, This-1) #The number 1 here indicates word 'This' should be in position 1
id3(attr3, an-3) #The number 3 here indicates word 'an' should be in position 3
id4(attr4, example-4) #The number 4 here indicates word 'example' should be in position 4
id5(attr5, example-4) #This is a duplicate of the word 'example'
#Example of string - this is how the string with the items looks like
string = "id1(attr1, is-1), id2(attr2, This-2), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
#This is how the result should look after extraction
result = 'This is an example'
Is there an easier way to do this? Regex doesn't work for me.
A trivial/naive approach:
>>> z = [x.split(',')[1].strip().strip(')') for x in s.split('),')]
>>> d = defaultdict(list)
>>> for i in z:
... b = i.split('-')
... d[b[1]].append(b[0])
...
>>> ' '.join(' '.join(d[t]) for t in sorted(d.keys(), key=int))
'is This an example example'
You have duplicated positions for example in your sample string, which is why example is repeated in the code.
However, your sample is not matching your requirements either - but this results is as per your description. Words arranged as per their position indicators.
Now, if you want to get rid of duplicates:
>>> ' '.join(e for t in sorted(d.keys(), key=int) for e in set(d[t]))
'is This an example'
Why not regex? This works.
In [44]: s = "id1(attr1, is-2), id2(attr2, This-1), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
In [45]: z = [(m.group(2), m.group(1)) for m in re.finditer(r'(\w+)-(\d+)\)', s)]
In [46]: [x for y, x in sorted(set(z))]
Out[46]: ['This', 'is', 'an', 'example']
OK, how about this:
sample = "id1(attr1, is-2), id2(attr2, This-1),
id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
def make_cryssie_happy(s):
words = {} # we will use this dict later
ll = s.split(',')[1::2]
# we only want items like This-1, an-3, etc.
for item in ll:
tt = item.replace(')','').lstrip()
(word, pos) = tt.split('-')
words[pos] = word
# there can only be one word at a particular position
# using a dict with the numbers as positions keys
# is an alternative to using sets
res = [words[i] for i in sorted(words)]
# sort the keys, dicts are unsorted!
# create a list of the values of the dict in sorted order
return ' '.join(res)
# return a nice string
print make_cryssie_happy(sample)
Related
I have a dataframe and a list of words. Now I want to count how often all words in the list occur in each cell of a dataframe.
text
this is a test sentence
another sentence
list = ["this", "test", "break"]
Result:
text
occurence_count
this is a test sentence
2
another sentence
0
My code does not work:
df["occurence_count"] = [df["text"].count(x) for x in list]
Perhaps you can do this:
a = ['this', 'test', 'break'] # 'list' shouldn't be used as a variable name
df['occurence_count'] = (
df['text'].str.split().explode()
.isin(set(a)).groupby(level=0).sum()
)
>>> df
text occurence_count
0 this is a test sentence 2
1 another sentence 0
You can do :
import re
l = ['this', 'test', 'break']
s = set(l)
df['occurence_count'] =df['text'].apply(
lambda x:len(set(re.split('\s+',x)).intersection(s)))
So you split them into words, get a set and look for an intersection in your list and get len
(BTW don't use list as variable name, its a keyword in python)
output:
text occurence_count
0 this is a test sentence 2
1 another sentence 0
I have a list of strings in python, where I need to preserve order and split some strings.
The condition to split a string is that after first match of : there is a none space/new line/tab char.
For example, this must be split:
example: Test to ['example':, 'Test']
While this stays the same: example: , IGNORE_ME_EXAMPLE
Given an input like this:
['example: Test', 'example: ', 'IGNORE_ME_EXAMPLE']
I'm expecting:
['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
Please Note that split strings are yet stick to each other and follow original order.
Plus, whenever I split a string I don't want to check split parts again. In other words, I don't want to check 'Test' after I split it.
To make it more clear, Given an input like this:
['example: Test::YES']
I'm expecting:
['example:', 'Test::YES']
You can use regular expressions for that:
import re
pattern = re.compile(r"(.+:)\s+([^\s].+)")
result = []
for line in lines:
match = pattern.match(line)
if match:
result.append(match.group(1))
result.append(match.group(2))
else:
result.append(line)
You can use nested loop comprehension for the input list:
l = ['example: Test::YES']
l1 = [j.lower().strip() for i in l for j in i.split(":", 1) if j.strip().lower() != '']
print(l1)
Output:
['example', 'Test::YES']
you need to iterate over your list of words, for each word, you need to check if : present or not. if present the then split the word in 2 parts, pre : and post part. append these pre and post to final list and if there is no : in word add that word in the result list and skip other operation for that word
# your code goes here
wordlist = ['example:', 'Test', 'example: ', 'IGNORE_ME_EXAMPLE']
result = []
for word in wordlist:
index = -1
part1, part2 = None, None
if ':' in word:
index = word.index(':')
else:
result.append(word)
continue
part1, part2 = word[:index+1], word[index+1:]
if part1 is not None and len(part1)>0:
result.append(part1)
if part2 is not None and len(part2)>0:
result.append(part2)
print(result)
output
['example:', 'Test', 'example:', ' ', 'IGNORE_ME_EXAMPLE']
Input a given string and check if any word in that string matches with its reverse in the same string then print that word else print $
I split the string and put the words in a list and then I reversed the words in that list. After that, I couldn't able to compare both the lists.
str = input()
x = str.split()
for i in x: # printing i shows the words in the list
str1 = i[::-1] # printing str1 shows the reverse of words in a new list
# now how to check if any word of the new list matches to any word of the old list
if(i==str):
print(i)
break
else:
print('$)
Input: suman is a si boy.
Output: is ( since reverse of 'is' is present in the same string)
You almost have it, just need to add another loop to compare each word against each inverted word. Try using the following
str = input()
x = str.split()
for i in x:
str1 = i[::-1]
for j in x: # <-- this is the new nested loop you are missing
if j == str1: # compare each inverted word against each regular word
if len(str1) > 1: # Potential condition if you would like to not include single letter words
print(i)
Update
To only print the first occurrence of a match, you could, in the second loop, only check the elements that come after. We can do this by keeping track of the index:
str = input()
x = str.split()
for index, i in enumerate(x):
str1 = i[::-1]
for j in x[index+1:]: # <-- only consider words that are ahead
if j == str1:
if len(str1) > 1:
print(i)
Note that I used index+1 in order to not consider single word palindromes a match.
a = 'suman is a si boy'
# Construct the list of words
words = a.split(' ')
# Construct the list of reversed words
reversed_words = [word[::-1] for word in words]
# Get an intersection of these lists converted to sets
print(set(words) & set(reversed_words))
will print:
{'si', 'is', 'a'}
Another way to do this is just in a list comprehension:
string = 'suman is a si boy'
output = [x for x in string.split() if x[::-1] in string.split()]
print(output)
The split on string creates a list split on spaces. Then the word is included only if the reverse is in the string.
Output is:
['is', 'a', 'si']
One note, you have a variable name str. Best not to do that as str is a Python thing and could cause other issues in your code later on.
If you want word more than one letter long then you can do:
string = 'suman is a si boy'
output = [x for x in string.split() if x[::-1] in string.split() and len(x) > 1]
print(output)
this gives:
['is', 'si']
Final Answer...
And for the final thought, in order to get just the 'is':
string = 'suman is a si boy'
seen = []
output = [x for x in string.split() if x[::-1] not in seen and not seen.append(x) and x[::-1] in string.split() and len(x) > 1]
print(output)
output is:
['is']
BUT, this is not necessarily a good way to do it, I don't believe. Basically you are storing information in seen during the list comprehension AND referencing that same list. :)
This answer wouldn't show you 'a' and won't output 'is' with 'si'.
str = input() #get input string
x = str.split() #returns list of words
y = [] #list of words
while len(x) > 0 :
a = x.pop(0) #removes first item from list and returns it, then assigns it to a
if a[::-1] in x: #checks if the reversed word is in the list of words
#the list doesn't contain that word anymore so 'a' that doesn't show twice wouldn't be returned
#and 'is' that is present with 'si' will be evaluated once
y.append(a)
print(y) # ['is']
I have a list of lists, in which I store sentences as strings. What I want to do is to get only the words starting with #. In order to do that, I split the sentences into words and now trying to pick only the words that start with # and exclude all the other words.
# to create the empty list:
lst = []
# to iterate through the columns:
for i in range(0,len(df)):
lst.append(df['col1'][i].split())
If I am mistaken you just need flat list containing all words starting with particular character. For doing that I would employ list flattening (via itertools):
import itertools
first = 'f' #look for words starting with f letter
nested_list = [['This is first sentence'],['This is following sentence']]
flat_list = list(itertools.chain.from_iterable(nested_list))
nested_words = [i.split(' ') for i in flat_list]
words = list(itertools.chain.from_iterable(nested_words))
lst = [i for i in words if i[0]==first]
print(lst) #output: ['first', 'following']
If i have a list strings:
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
how would i append the numbers flanking the .. to their corresponding lists ? to get:
first = [1,34,407]
last = [23,405,4032]
i wouldn't mind strings either because i can convert to int later
first = ['1','34','407']
last = ['23','405','4032']
Use re.search to match the numbers between .. and store them in two different groups:
import re
first = []
last = []
for s in my_list:
match = re.search(r'(\d+)\.\.(\d+)', s)
first.append(match.group(1))
last.append(match.group(2))
DEMO.
I'd use a regular expression:
import re
num_range = re.compile(r'(\d+)\.\.(\d+)')
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
for entry in my_list:
match = num_range.search(entry)
if match is not None:
f, l = match.groups()
first.append(int(f))
last.append(int(l))
This outputs integers:
>>> first
[1, 34, 407]
>>> last
[23, 405, 4032]
One more solution.
for string in my_list:
numbers = string.split(" ")[-1]
first_num, last_num = numbers.split("..")
first.append(first_num)
last.append(last_num)
It will throw a ValueError if there is a string with no spaces in my_list or there is no ".." after the last space in some of the strings (or there is more than one ".." after the last space of the string).
In fact, this is a good thing if you want to be sure that values were really obtained from all the strings, and all of them were placed after the last space. You can even add a try…catch block to do something in case the string it tries to process is in an unexpected format.
first=[(i.split()[1]).split("..")[0] for i in my_list]
second=[(i.split()[1]).split("..")[1] for i in my_list]