Using the .split() function based on conditions? - python

How would you be able to use the .split() function based on conditions?
Lets say I have the raw data:
Apples,Oranges,Strawberries Green beans,Tomatoes,Broccoli
My intended result is:
['Apples','Oranges','Strawberries','Green beans','Tomatoes','Brocolli']
Would it be able to have it split at commas and if there is a space and a capital letter following it?

The literal interpretation of what you asked for, using re.split:
import re
pat = re.compile(r'\s(?=[A-Z])|,')
pat.split(my_str)
This is more simply done, in your case:
pat = re.compile(r'.(?=[A-Z])')
Basically, split on any character that is followed by a capital letter.

Using regex will make the code simpler than a complicated split statement.
import re
...
re.findall(", [A-Z]",data)
Note you asked for a split for a command, space, capital, but in your example there are no spaces after commas.

Related

Extract names of a sentence with regex

I'm very new with the syntax of regex, I already read some about the libary. I'm trying extract names from a simple sentence, but I found myself in trouble, below I show a exemple of what I've done.
x = 'Fred used to play with his brother, Billy, both are 10 and their parents Jude and Edde have two more kids.'
import re
re.findall('^[A-Za-z ]+$',x)
Anyone can explain me what is wrong and how to proceed?
Use
re.findall(r'\b[A-Z]\w*', x)
See proof. It matches words starting with uppercase letter and having any amount of letters, digits or underscores.
I think your regex has two problems.
You want to extract names of sentence. You need to remove ^ start of line and $ end of line.
Name starts with uppercase and does not have space. You should remove in your regex.
You could use following regex.
\b[A-Z][A-Za-z]+\b
I also tried to test result on python.
x = 'Fred used to play with his brother, Billy, both are 10 and their parents Jude and Edde have two more kids.'
import re
result = re.findall('\\b[A-Z][A-Za-z]+\\b',x)
print(result)
Result.
['Fred', 'Billy', 'Jude', 'Edde']

Python: Replacing alphanumeric values in Dataframe

I have words with \t and \r at the beginning of the words that I am trying to strip out without stripping the actual words.
For example "\tWant to go to the mall.\rTo eat something."
I have tried a few things from SO over three days. Its a Pandas Dataframe so I thought this answer pertained the best:
Pandas DataFrame: remove unwanted parts from strings in a column
But formulating from that for my own solution is not working.
i = df['Column'].replace(regex=False,inplace=False,to_replace='\t',value='')
I did not want to use regex since the expression has been difficult to make being that I am attempting to strip out '\t' and if possible also '\r'.
Here is my regular expression: https://regex101.com/r/92CUV5/5
Try the following code:
def remove_chars(text):
return str(re.sub(r'[\t\r]','',text))
i = df['Column'].map(remove_chars)

Detect abbreviations in the text in python

I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.
But I see that it does not remove abbreviations such as MOOCs, M.O.O.C, M.O.O.Cs. Is there an easy way of doing this in python? Or are there any libraries that I can use instead?
The re regex library is probably the tool for the job.
In order to remove every string of consecutive uppercase letters, the following code can be used:
import re
mytext = "hello, look an ACRONYM"
mytext = re.sub(r"\b[A-Z]{2,}\b", "", mytext)
Here, the regex "\b[A-Z]{2,}\b" searches for multiple consecutive (indicated by [...]{2,}) capital letters (A-Z), forming a complete word (\b...\b). It then replaces them with the second string, "".
The convenient thing about regex is how easily it can be modified for more complex cases. For example:
mytext = re.sub(r"\b[A-Z\.]{2,}\b", "", mytext)
Will replace consecutive uppercase letters and full stops, removing acronyms like A.B.C.D. as well as ABCD. The \ before the . is necessary as . otherwise is used by regex as a kind of wildcard.
The ? specifier could also be used to remove acronyms that end in s, for example:
mytext = re.sub(r"\b[A-Z\.]{2,}s?\b", "", mytext)
This regex will remove acronyms like ABCD, A.B.C.D, and even A.B.C.Ds. If other forms of acronym need to be removed, the regex can easily be modified to accommodate them.
The re library also includes functions like findall, or the match function, which allow for programs to locate and process each acronym individually. This might come in handy if you want to, for example, look at a list of the acronyms being removed and check there are no legitimate words there.
An intuitive way would be the use of regex
This regular expression does the job :([A-Z]\.*){2,}s?
Which gives in python :
import re
re.sub("([A-Z]\.*){2,}s?","", your_text)
Please visit regex documentation in case of doubt
https://docs.python.org/2/library/re.html#re.sub

Splitting on regex without removing delimiters

So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!) which complicates the problem.
You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.
If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
(?<=[.!?])
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
(?<=[.!?])\s+
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the re.findall solution suggested by #Psidom is the best one, I believe.
If you prefer use split method rather than match, one solution split with group
splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
splitted = filter(None, re.split( r'([\.!\?])', s))
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter
Easiest way is to use nltk.
import nltk
nltk.sent_tokenize(s)
It will return a list of all your sentences without loosing delimiters.

python regex and replace

I am trying to learn python and regex at the same time and I am having some trouble in finding how to match till end of string and make a replacement on the fly.
So, I have a string like so:
ss="this_is_my_awesome_string/mysuperid=687y98jhAlsji"
What I'd want is to first find 687y98jhAlsji (I do not know this content before hand) and then replace it to myreplacedstuff like so:
ss="this_is_my_awesome_string/mysuperid=myreplacedstuff"
Ideally, I'd want to do a regex and replace by first finding the contents after mysuperid= (till the end of string) and then perform a .replace or .sub if this makes sense.
I would appreciate any guidance on this.
You can try this:
re.sub(r'[^=]+$', 'myreplacedstuff', ss)
The idea is to use a character class that exclude the delimiter (here =) and to anchor the pattern with $
explanation:
[^=] is a character class and means all characters that are not =
[^=]+ one or more characters from this class
$ end of the string
Since the regex engine works from the left to the right, only characters that are not an = at the end of the string are matched.
You can use regular expressions:
>>> import re
>>> mymatch = re.search(r'mysuperid=(.*)', ss)
>>> ss.replace(mymatch.group(1), 'replacing_stuff')
'this_is_my_awesome_string/mysuperid=replacing_stuff'
You should probably use #Casimir's answer though. It looks cleaner, and I'm not that good at regex :p.

Categories

Resources