Insert string in pandas column using regex if pattern is found - python

I have a string column in a dataframe and I'd like to insert a # to the begging of my pattern.
For example:
My pattern is the letters 'pr' followed by any amount of numbers. If in my column there is a value 'problem in pr123', I would change it to 'problem in #pr123'.
I'm trying a bunch of code snippets but nothing is working for me.
Tried to change the solution to replace for 'pr#123' but this didn't work either.
df['desc_clean'] = df['desc_clean'].str.replace(r'([p][r])(\d+)', r'\1#\2', regex=True)
What's the best way I can replace all values in this column when I find this pattern?

If you need pr#123 you can use
df['desc_clean'] = df['desc_clean'].str.replace(r'(pr)(\d+)', r'\1#\2')
To get #pr123, you can use
df['desc_clean'].str.replace(r'pr\d+', r'#\g<0>')
To match pr as a whole word, you can add a word boundary, \b, in front of pr:
df['desc_clean'].str.replace(r'\bpr\d+', r'#\g<0>')
See the regex demo.

Related

Why does this pandas str.extract pattern work?

I have a dataframe "movies" with column "title", which contains movie titles and their release year in the following format:
The Pirates (2014)
I'm testing different ways to extract just the title portion, which in the example above would be "The Pirates", into a new column.
I used pandas Series.str.extract() and found a regex pattern that works, but I'm not sure why it works.
movies['title_only'] = movies['title'].str.extract('(.*)[\s]', expand=True)
The above code correctly extracts the "The Pirates" into a new column, but why doesn't it extract only "The" (everything before the first whitespace)?
is a greedy quantifier, meaning it will match as far into the string as possible. To only match the first word, you can switch it to a lazy quantifier *?. Also, note that you don't need square brackets around the \s. [\s] == \s
According to CAustin

Regex lookbehind and lookahead doesn't find any match

I have a lot of data that I need to parse and output in different format. The data looks something like this:
tag="001">utb20181009818<
tag="003">CZ PrNK<
...
And now, I want to extract 'utb20181009818' after after 'tag="001">' and before the last '<'
This is my code in python:
regex_pattern = re.compile(r'''(?=(tag="001(.*?)">)).*?(?<=[<])''')
ID = regex_pattern.match(one_line)
print(ID)
My variable one_line already contains the necessary data and I just need to extract the value, but it doesn't seem to match no matter what I do. I looked at it for hours, but doesn't seem to find out what I'm doing wrong.
Try regex tag=\"001\">(.*?)< and capture the first group ID.group(1)
Regex
The issue is that lookaheads don't move the match position to the right because they don't match anything - they only look.
Obviously, utilizing a match group as suggested would be the simplest way to go here, as you wouldn't have to take pains to avoid matching the parts you don't want.
But if your "001" isn't variable length, I think what you want is actually a lookbehind/lookahead (not lookahead/lookbehind):
(?<=tag="001">).*(?=<)
https://regex101.com/r/rMQnna/3/

Python: Replacing alphanumeric values in Dataframe

I have words with \t and \r at the beginning of the words that I am trying to strip out without stripping the actual words.
For example "\tWant to go to the mall.\rTo eat something."
I have tried a few things from SO over three days. Its a Pandas Dataframe so I thought this answer pertained the best:
Pandas DataFrame: remove unwanted parts from strings in a column
But formulating from that for my own solution is not working.
i = df['Column'].replace(regex=False,inplace=False,to_replace='\t',value='')
I did not want to use regex since the expression has been difficult to make being that I am attempting to strip out '\t' and if possible also '\r'.
Here is my regular expression: https://regex101.com/r/92CUV5/5
Try the following code:
def remove_chars(text):
return str(re.sub(r'[\t\r]','',text))
i = df['Column'].map(remove_chars)

alternative regex to match all text in between first two dashes

I'm trying to use the following regex \-(.*?)-|\-(.*?)* it seems to work fine on regexr but python says there's nothing to repeat?
I'm trying to match all text in between the first two dashes or if a second dash does not exist after the first all text from the first - onwards.
Also, the regex above includes the dashes, but would preferrably like to exclude these so I don't have to do an extra replace etc.
You can use re.search with this pattern:
-([^-]*)
Note that - doesn't need to be escaped.
An other way consists to only search the positions of the two first dashes, and to extract the substring between these positions. Or you can use split:
>>> 'aaaaa-bbbbbb-ccccc-ddddd'.split('-')[1]
'bbbbbb'

Python regular expression to pull text inside of HTML quotation marks

I'm attempting to pull ticker symbols from corporations' 10-K filings on EDGAR. The ticker symbol typically appears between a pair of HTML quotation marks, e.g., "‘" or "’". An example of a typical portion of relevant text:
Our common stock has been listed on the New York Stock Exchange (“NYSE”) under the symbol “RXN”
At this point I am just trying to figure out how to deal with the occurrence of one or more of a variety of quotation marks. I can write a regex that matches one particular type of quotation mark:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*[^<]*\n',fileText)
However, I can't write a regex that looks for more than one type of quotation mark. This regex produces nothing:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*‘*’*“*[^<]*\n',fileText)
Any help would be appreciated.
Your regex looks for all of the quotes occurring together. If you're looking for any one of the possibilities, you need to put parentheses around each string and or them:
(?:“)*|(?:‘)*|(?:’)*|(?:“)*
The ?: makes the paren groups non-capturing. I.e., the parser won't save each one as important text. As an aside, you'll probably want to use group-capturing to save the ticker symbol -- what you're actually looking for. Very quick-and-dirty (and ugly) expression that will return ['NYSE', 'RXN'] from the given string:
re.findall(r'(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))(.+?)(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))', fileText)
You'd probably want to only include left-quotes in the first group and right-quotes in the last group. Plus either-or quotes in both.
You can use
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))), text)
this works because you can use search/replace providing a callable for the replace part. The number after "#" is the unicode point for the character and Python chr function can convert it to text.
For example:
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))),
"this is a “test“")
results in
'this is a “test“'

Categories

Resources