extract substring between multiple words in a pandas dataframe

extract substring between multiple words in a pandas dataframe - python

I have a pandas data frame where I need to extract sub-string from each row of a column based on the following conditions
We have start_list ('one','once I','he') and end_list ('fine','one','well').
The sub-string should be preceded by any of the elements of the start_list.
The sub-string may be succeeded by any of the elements of the end_list.
When any of the elements of the start_list is available then the succeeding sub string should be extracted with/without the presence of the elements of the end_list.
Example Problem:
df = pd.DataFrame({'a' : ['one was fine today', 'we had to drive', ' ','I
think once I was fine eating ham ', 'he studies really
well
and is polite ', 'one had to live well and prosper',
'43948785943one by onej89044809', '827364hjdfvbfv',
'&^%$&*+++===========one kfnv dkfjn uuoiu fine', 'they
is one who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '', '88',
'99']})
Expected Result:
df = pd.DataFrame({'a' : ['was', '','','was ','studies really','had to live',
'by','','kfnv dkfjn uuoiu','who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '',
'88','99']})

I think this should work for you. This solution requires Pandas of course and also the built-in library functools.
Function: remove_preceders
This function takes as input a collection of words start_list and str string. It looks to see if any of the items in start_list are in string, and if so returns only the piece of string that occurs after said items. Otherwise, it returns the original string.
def remove_preceders(start_list, string):
for word in start_list:
if word in string:
string = string[string.find(word) + len(word):]
return string
Function: remove_succeders
This function is very similar to the first, except it returns only the piece of string that occurs before the items in end_list.
def remove_succeeders(end_list, string):
for word in end_list:
if word in string:
string = string[:string.find(word)]
return string
Function: to_apply
How do you actually run the above functions? The apply method allows you to run complex functions on a DataFrame or Series, but it will then look for as input either a full row or single value, respectively (based on whether you're running on a DF or S).
This function takes as input a function to run & a collection of words to check, and we can use it to run the above two functions:
def to_apply(func, words_to_check):
return functools.partial(func, words_to_check)
How to Run
df['no_preceders'] = df.a.apply(
to_apply(remove_preceders,
('one', 'once I', 'he'))
)
df['no_succeders'] = df.a.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
df['substring'] = df.no_preceders.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
And then there's one final step to remove the items from the substring column that were not affected by the filtering:
def final_cleanup(row):
if len(row['a']) == len(row['substring']):
return ''
else:
return row['substring']
df['substring'] = df.apply(final_cleanup, axis=1)
Results
Hope this works.

Related

Sort values for both str and int by ranking appearance in a string

I have to sort keywords and values in a string.
This is my attempt:
import re
phrase='$1000 is the price of the car, it is 10 years old. And this sandwish cost me 10.34£'
list1 = (re.findall('\d*\.?\d+', phrase)) #this is to make a list that find all the ints in my phrase and sort them (1000, 10, 10.34)
list2= ['car', 'year', 'sandwish'] #this is to make a list of all the keywords in the phrase I need to find.
joinedlist = list1 + list2 #This is the combination of the 2 lists int and str that are in my sentence (the key elements)
filter1 = (sorted(joinedlist, key=phrase.find)) #This is to find all the key elements in my phrase and sort them by order of appearance.
print(filter1)
Unfortunately, in some cases, because the "sorted" function works by lexical sorting, integrals would be printed in the wrong order. This means that in some cases like this one, the output will be:
['1000', '10', 'car', 'year', 'sandwich', '10.34']
instead of:
['1000', 'car', '10', 'year', 'sandwich', '10.34']
as the car appears before 10 in the initial phrase.

Lexical sorting has nothing to do with it, because your sorting key is the position in the original phrase; all the sorting is done by numeric values (the indices returned by find). The reason that the '10' is appearing "out of order" is that phrase.find returns the first occurrence of it, which is inside the 1000 part of the string!
Rather than breaking the sentence apart into two lists and then trying to reassemble them with a sort, why not just use a single regex that selects the different kinds of things you want to keep? That way you don't need to re-sort them at all:
>>> re.findall('\d*\.?\d+|car|year|sandwish', phrase)
['1000', 'car', '10', 'year', 'sandwish', '10.34']

The issue is that 10 and 1000 each have the same value from Python's default string lookup. Both are found at the start of the string since 10 is a substring of 1000.
You can implement a regex lookup into phrase to implement the method you are attempting by using \b word boundaries so that 10 only matches 10 in your string:
def finder(s):
if m:=re.search(rf'\b{s}\b', phrase):
return m.span()[0]
elif m:=re.search(rf'\b{s}', phrase):
return m.span()[0]
return -1
Test it:
>>> sorted(joinedlist, key=finder)
['1000', 'car', '10', 'year', 'sandwish', '10.34']
It is easier if you turn phrase into a look up list of your keywords however. You will need some treatment for year as a keyword vs years in phrase; you can just use the regex r'\d+\.\d+|\w+' as a regex to find the words and then str.startswith() to test if it is close enough:
pl=re.findall(r'\d+\.\d+|\w+', phrase)
def finder2(s):
try: # first try an exact match
return pl.index(s)
except ValueError:
pass # not found; now try .startswith()
try:
return next(i for i,w in enumerate(pl) if w.startswith(s))
except StopIteration:
return -1
>>> sorted(joinedlist, key=finder2)
['1000', 'car', '10', 'year', 'sandwish', '10.34']

Regex "AND" in an expression extract this and that

I'm struggling to write a regex that extracts the following numbers in bold below.
I set up 3 different regex for each value, but since the last value might have a space in between I don't know how to accommodate an "AND" here.
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
I have tried this and it is working for the first 2 but not for the last one. I'd like to have the last one in a single regex.
p1 = re.compile(r'(\d+)/')
p2 = re.compile(r'/(\d+)')
p3 = re.compile(r'(?=.*[R](\d+))(?=.*[R]\s(\d+))')
I've tried different stuff and this is the last code I tried with unsuccessful results
if I do this
p1.findall(tire), p2.findall(tire), p3.findall(tire)
I would like to see this:
(['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17'])

You were almost there! You don't need three separate regular expressions.
Instead, use multiple capturing groups in a single regex.
(\d{3})\/(\d{2})R\s?(\d{2})
Try it: https://regex101.com/r/Xn6bry/1
Explanation:
(\d{3}): Capture three digits
\/: Match a forward-slash
(\d{2}): Capture two digits
R\s?: Match an R followed by an optional whitespace
(\d{2}): Capture two digits.
In Python, do:
p1 = re.compile(r'(\d{3})\/(\d{2})R\s?(\d{2})')
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
matches = re.findall(p1, tire)
Now if you look at matches, you get
[('275', '65', '18'), ('275', '65', '18'), ('265', '70', '17')]
Rearranging this to the format you want should be pretty straightforward:
# Make an empty list-of-list with three entries - one per group
groups = [[], [], []]
for match in matches:
for groupnum, item in enumerate(match):
groups[groupnum].append(item)
Now groups is [['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17']]

removing multiple pipes from a list

so i have some data i have been trying to clean up, its a list and it looks like this
a = [\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain]
i have tried to clean it up by doing this
a.replace("\n", "|")
the output turn out like this :
[london||18||20||30||||japan||6||80||2|||Spain]
if i do this:
a.replace("\n","")
i get this:
[london,"", "", 18,"","",20"","",30,"","","",""japan,"",""6,"","",80,"","",2"","","","",Spain]
can anyone explain why i am having multiple pipes, spaces and whats the best way to clean the data.

Assuming that your input is:
s = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
The issue is that there are multiple '\n' in-between data, therefore just replacing each '\n' with another character (say '|') will give you as many of the new characters as there were '\n'.
The simplest approach is to use str.split() to get the non-blank data:
l = list(s.split())
print(l)
# ['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
or, combine it with str.join(), if you want to have it separated by '|':
t = '|'.join(s.split())
print(t)
# london|18|20|30|japan|6|80|2|Spain

I tried it and got this:
a = ['\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain']
print(a[0].replace("\n", ""))
Output:
london182030japan6802Spain
Could you please clarify the exact input and the expected output? it does not seem correct yet and I have taken some liberties.
If your input was a string you can use split():
a = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
print(a.split())
Output:
['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']

Extract alphabetic strings values from a list in Python

If have a list of different types of strings and from that I want to combine all the alphabetic strings in the list into one single value.
For example:
['000000001', 'Aaron', 'Appindangoye', '26', '183', '84.8']
Here, I want to get Aaron Appindangoye together.

You can access to the 2 names by index:
items = ['000000001', 'Aaron', 'Appindangoye', '26', '183', '84.8']
name = ' '.join(items[1:3])
print(name)
-> Aaron Appindangoye
See list in the doc

Arrange list of strings that are divided into 4 parts by the different parts?

I have a list comprised of strings that all follow the same format 'Name%Department%Age'
I would like to order the list by age, then name, then department.
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
after sorting would output:
['Sarah%English%50, 'John%English%31', 'George%Maths%30', 'John%English%30, 'John%Maths%30']
The closest I have found to what I want is the following (found here: How to sort a list by Number then Letter in python?)
import re
def sorter(s):
match = re.search('([a-zA-Z]*)(\d+)', s)
return int(match.group(2)), match.group(1)
sorted(alist, key=sorter)
Out[13]: ['1', 'A1', '2', '3', '12', 'A12', 'B12', '17', 'A17', '25', '29', '122']
This however only sorted my layout of input by straight alphabetical.
Any help appreciated,
Thanks.

You are on the right track.
Personally, I:
would first use string.split() to chop the string up into its constituent parts;
would then make the sort key produce a tuple that reflects the desired sort order.
For example:
def key(name_dept_age):
name, dept, age = name_dept_age.split('%')
return -int(age), name, dept
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
print(sorted(alist, key=key))

Use name, department, age = item.split('%') on each item.
Make a dict out of them {'name': name, 'department': department, 'age': age}
Then sort them using this code
https://stackoverflow.com/a/1144405/277267
sorted_items = multikeysort(items, ['-age', 'name', 'department'])
Experiment once with that multikeysort function, you will see that it will come in handy in a couple of situations in your programming career.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract substring between multiple words in a pandas dataframe - python

Related

Sort values for both str and int by ranking appearance in a string

Regex "AND" in an expression extract this and that

removing multiple pipes from a list

Extract alphabetic strings values from a list in Python

Arrange list of strings that are divided into 4 parts by the different parts?

Categories

Resources