Extracting only characters from list items REGEX

Extracting only characters from list items REGEX - python

I am practising regex and I would like to extract only characters from this list
text=['aQx12', 'aub 6 5']
I want to ignore the numbers and the white spaces and only keep the letters. The desired output is as follows
text=['aQx', 'aub']
I tried the below code but it is not working properly
import re
text=['aQx12', 'aub 6 5']
r = re.compile("\D")
newlist = list(filter(r.match, text))
print(newlist)
Can someone tell me what I need to fix

You're testing the entire string, not individual characters. You need to filter the characters in the strings.
Also, \D matches anything that isn't a digit, so it will include whitespace in the result. You want to match only letters, which is [a-z].
r = re.compile(r'[a-z]', re.I)
newlist = ["".join(filter(r.match, s)) for s in text]

You can use re.findall then join the matches instead of using re.match and filter, also use [a-zA-Z] to get only the alphabets.
>>> [''.join(re.findall('[a-zA-Z]', t)) for t in text]
['aQx', 'aub']

You can do this without a regex as well:
from string import ascii_letters
text=['aQx12', 'aub 6 5']
>>> [''.join([c for c in sl if c in ascii_letters]) for sl in text]
['aQx', 'aub']

You can remove any chars other than letters in a list comprehension.
No regex solution:
print( [''.join(filter(str.isalpha, s)) for s in ['aQx12', 'aub 6 5']] )
See the Python demo. Here is a regex based demo:
import re
text=['aQx12', 'aub 6 5']
newlist = [re.sub(r'[^a-zA-Z]+', '', x) for x in text]
print(newlist)
# => ['aQx', 'aub']
See the Python demo
If you need to handle any Unicode letters, use
re.sub(r'[\W\d_]+', '', x)
See the regex demo.

Related

Can string replace be written in list comprehension?

I have a text and a list.
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
I want to replace all elements from list in the string. So, I can do this:
for element in remove_list:
text = text.replace(element, '')
I can also use regex.
But can this be done in list comprehension or any single liner?

You can use functools.reduce:
from functools import reduce
text = reduce(lambda x, y: x.replace(y, ''), remove_list, text)
# 'Some texts that I want to replace'

I would do this with re.sub to remove all the substrings in one pass:
>>> import re
>>> regex = '|'.join(map(re.escape, remove_list))
>>> re.sub(regex, '', text)
'Some texts that I want to replace'
Note that the result has two spaces instead of one where each part was removed. If you want each occurrence to leave just one space, you can use a slightly more complicated regex:
>>> re.sub(r'\s*(' + regex + r')', '', text)
'Some texts that I want to replace'
There are other ways to write similar regexes; this one will remove the space preceding a match, but you could alternatively remove the space following a match instead. Which behaviour you want will depend on your use-case.

You can do this with a regex by building a regex from an alternation of the words to remove, taking care to escape the strings so that the [ and ] in them don't get treated as special characters:
import re
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
regex = re.compile('|'.join(re.escape(r) for r in remove_list))
text = regex.sub('', text)
print(text)
Output:
Some texts that I want to replace
Since this may result in double spaces in the result string, you can remove them with replace e.g.
text = regex.sub('', text).replace(' ', ' ')
Output:
Some texts that I want to replace

Python Regex: Remove optional characters

I have a regex pattern with optional characters however at the output I want to remove those optional characters. Example:
string = 'a2017a12a'
pattern = re.compile("((20[0-9]{2})(.?)(0[1-9]|1[0-2]))")
result = pattern.search(string)
print(result)
I can have a match like this but what I want as an output is:
desired output = '201712'
Thank you.

You've already captured the intended data in groups and now you can use re.sub to replace the whole match with just contents of group1 and group2.
Try your modified Python code,
import re
string = 'a2017a12a'
pattern = re.compile(".*(20[0-9]{2}).?(0[1-9]|1[0-2]).*")
result = re.sub(pattern, r'\1\2', string)
print(result)
Notice, how I've added .* around the pattern, so any of the extra characters around your data is matched and gets removed. Also, removed extra parenthesis that were not needed. This will also work with strings where you may have other digits surrounding that text like this hello123 a2017a12a some other 99 numbers
Output,
201712
Regex Demo

You can just use re.sub with the pattern \D (=not a number):
>>> import re
>>> string = 'a2017a12a'
>>> re.sub(r'\D', '', string)
'201712'

Try this one:
import re
string = 'a2017a12a'
pattern = re.findall("(\d+)", string) # this regex will capture only digit
print("".join(p for p in pattern)) # combine all digits
Output:
201712

If you want to remove all character from string then you can do this
import re
string = 'a2017a12a'
re.sub('[A-Za-z]+','',string)
Output:
'201712'

You can use re module method to get required output, like:
import re
#method 1
string = 'a2017a12a'
print (re.sub(r'\D', '', string))
#method 2
pattern = re.findall("(\d+)", string)
print("".join(p for p in pattern))
You can also refer below doc for further knowledge.
https://docs.python.org/3/library/re.html

Substring regex from characters to end of word

I looking for a regex term that will capture a subset of a string beginning with a a certain sequence of characters (http in my case)up until a whitespace.
I am doing the problem in python, working over a list of strings and replacing the 'bad' substring with ''.
The difficulty stems from the characters not necessarily beginning the words within the substring. Example below, with bold being the part I am looking to capture:
"Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous "
Thank you

Use findall:
>>> text = '''Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous '''
>>> import re
>>> re.findall(r'http\S+', text)
['httpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php', 'httpswwwgooglecomsilvous']
For substitution (if memory not an issue):
>>> rep = re.compile(r'http\S+')
>>> rep.sub('', text)

You can try this:
strings = [] #your list of strings goes here
import re
new_strings = [re.sub("https.*?php|https.*?$", '.', i) for i in strings]

Regex to remove specific words in python

I want to do the some manipulation using regex in python.
So input is +1223,+12_remove_me,+222,+2223_remove_me
and
output should be +1223,+222
Output should only contain comma seperated words which don't contain _remove_me and only one comma between each word.
Note: REGEX which I tried \+([0-9|+]*)_ , \+([0-9|+]*) and some other combination using which I did not get required output.
Note 2 I can't use loop, need to do that without loop with regex only.

Your regex seems incomplete, but you were on the right track. Note that a pipe symbol inside a character class is treated as a literal and your [0-9|+] matches a digit or a | or a + symbols.
You may use
,?\+\d+_[^,]+
See the regex demo
Explanation:
,? - optional , (if the "word" is at the beginning of the string, it should be optional)
\+ - a literal +
\d+ - 1+ digits
_ - a literal underscore
[^,]+ - 1+ characters other than ,
Python demo:
import re
p = re.compile(r',?\+\d+_[^,]+')
test_str = "+1223,+12_remove_me,+222,+2223_remove_me"
result = p.sub("", test_str)
print(result)
# => +1223,+222

A non-regex approach would involve using str.split() and excluding items ending with _remove_me:
>>> s = "+1223,+12_remove_me,+222,+2223_remove_me"
>>> items = [item for item in s.split(",") if not item.endswith("_remove_me")]
>>> items
['+1223', '+222']
Or, if _remove_me can be present anywhere inside each item, use not in:
>>> items = [item for item in s.split(",") if "_remove_me" not in item]
>>> items
['+1223', '+222']
You can then use str.join() to join the items into a string again:
>>> ",".join(items)
'+1223,+222'

In your case you need regex with negotiation
[^(_remove_me)]
Demo

You could perform this without a regex, just using string manipulation. The following can be written as a one-liner, but has been expanded for readability.
my_string = '+1223,+12_remove_me,+222,+2223_remove_me' #define string
my_list = my_string.split(',') #create a list of words
my_list = [word for word in my_list if '_remove_me' not in word] #stop here if you want a list of words
output_string = ','.join(my_list)

Why can’t I get rid of the L with this python regular expression?

I’m trying to get rid of the Ls at the ends of integers with a regular expression in python:
import re
s = '3535L sadf ddsf df 23L 2323L'
s = re.sub(r'\w(\d+)L\w', '\1', s)
However, this regex doesn't even change the string. I've also tried s = re.sub(r'\w\d+(L)\w', '', s) since I thought that maybe the L could be captured and deleted, but that didn't work either.

I'm not sure what you're trying to do with those \ws in the first place, but to match a string of digits followed by an L, just use \d+L, and to remove the L you just need to put the \d+ part in a capture group so you can sub it for the whole thing:
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> re.sub(r'(\d+)L', r'\1', s)
'3535 sadf ddsf df 23 2323'
Here's the regex in action:
(\d+)L
Debuggex Demo
Of course this will also convert, e.g., 123LBQ into 123BQ, but I don't see anything in your examples or in your description of the problem that indicates that this is possible, or which possible result you want for that, so…

\w = [a-zA-Z0-9_]
In other words, \w does not include whitespace characters. Each L is at the end of the word and therefore doesn't have any "word characters" following it. Perhaps you were looking for word boundaries?
re.sub(r'\b(\d+)L\b', '\1', s)
Demo

You can use look behind assertion
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> s = re.sub(r'\w(?<=\d)L\b', '', s)
>>> s
'353 sadf ddsf df 2 232'
(?<=\d)L asserts that the L is presceded by a digit, in which case replace it with null''

Try this:
re.sub(r'(?<=\d)L', '\1', s)
This uses a lookbehind to find a digit followed by an "L".

Why not use a - IMO more readable - generator expression?
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> ' '.join(x.rstrip('L') if x[-1:] =='L' and x[:-1].isdigit() else x for x in s.split())
'3535 sadf ddsf df 23 2323'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting only characters from list items REGEX - python

You can use re.findall then join the matches instead of using re.match and filter, also use [a-zA-Z] to get only the alphabets. >>> [''.join(re.findall('[a-zA-Z]', t)) for t in text] ['aQx', 'aub']

You can do this without a regex as well: from string import ascii_letters text=['aQx12', 'aub 6 5'] >>> [''.join([c for c in sl if c in ascii_letters]) for sl in text] ['aQx', 'aub']

Related

Can string replace be written in list comprehension?

Python Regex: Remove optional characters

Substring regex from characters to end of word

Regex to remove specific words in python

Why can’t I get rid of the L with this python regular expression?

Categories

Resources