python regular expression, pulling all letters out

python regular expression, pulling all letters out - python

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length

You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)

I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']

Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']

You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>

If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

Related

How can i keep the dot in a string while removing letters from the alphabet

I have a string: lst = 'sbs1.23444nroen'
im using lst2 = ''.join(filter(str.isdigit, lst)) to remove all the letters so the result is: lst2 = '123444'
is there any way to include the "." so that the result would be '1.23444' without the letters but keeping the dot?

A more friendly to the eye solution and extendable if you want to include more characters.
s = 'sbs1.23444nroen'
toKeep = set('0123456789.')
s = ''.join(ch for ch in s if ch in toKeep)
print(s)

lst2 = ''.join(filter(lambda x: str.isdigit(x) or x=='.', lst))

An alternative solution would be to use a regular expression, although the set solution is the best so far.
>>> re.findall(r"\d+\.\d*", lst)
['1.23444']
With the added benefit of grabbing other groups of numbers as well:
>>> re.findall(r"\d+\.?\d*", "sbs1.23444nroe631n")
['1.23444', '631']

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??

Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']

This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.

For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']

In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Split a unicode string into components containing numbers and letters

I'd like to split the string u'123K into 123 and K. I've tried re.match("u'123K", "\d+") to match the number and re.match("u'123K", "K") to match the letter but they don't work. What is a Pythonic way to do this?

Use re.findall() to find all numbers and characters:
>>> s = u'123K'
>>> re.findall(r'\d+|[a-zA-Z]+', s) # or use r'\d+|\D+' as mentioned in comment in order to match all numbers and non-numbers.
['123', 'K']
If you are just dealing with this string or if you only want to split the string from the last character you can simply use a indexing:
num, charracter = s[:-1], s[-1:]

You can also use itertools.groupby method, grouping digits:
>>> import itertools as it
>>> for _,v in it.groupby(s, key=str.isdigit):
print(''.join(v))
123
K

python: keep char only if it is within this list

i have a list:
a = ['a','b','c'.........'A','B','C'.........'Z']
and i have string:
string1= 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
i want to keep ONLY those characters in string1 that exist in a
what is the most effecient way to do this? perhaps instead of having a be a list, i should just make it a string? like this a='abcdefg..........ABC..Z' ??

This should be faster.
>>> import re
>>> string1 = 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
>>> a = ['E', 'i', 'W']
>>> r = re.compile('[^%s]+' % ''.join(a))
>>> print r.sub('', string1)
EiiWW
This is even faster than that.
>>> all_else = ''.join( chr(i) for i in range(256) if chr(i) not in set(a) )
>>> string1.translate(None, all_else)
'EiiWW'
44 microsec vs 13 microsec on my laptop.
How about that?
(Edit: turned out, translate yields the best performance.)

''.join([s for s in string1 if s in a])
Explanation:
[s for s in string1 if s in a]
creates a list of all characters in string1, but only if they are also in the list a.
''.join([...])
turns it back into a string by joining it with nothing ('') in between the elements of the given list.

List comprehension to the rescue!
wanted = ''.join(letter for letter in string1 if letter in a)
(Note that when passing a list comprehension to a function you can omit the brackets so that the full list isn't generated prior to being evaluated. While semantically the same as a list comprehension, this is called a generator expression.)

If, you are going to do this with large strings, there is a faster solution using translate; see this answer.

#katrielalex: To spell it out:
import string
string1= 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
non_letters= ''.join(chr(i) for i in range(256) if chr(i) not in string.letters)
print string1.translate(None,non_letters)
print 'Simpler, but possibly less correct'
print string1.translate(None, string.punctuation+string.digits+string.whitespace)

How do I coalesce a sequence of identical characters into just one?

Suppose I have this:
My---sun--is------very-big---.
I want to replace all multiple hyphens with just one hyphen.

import re
astr='My---sun--is------very-big---.'
print(re.sub('-+','-',astr))
# My-sun-is-very-big-.

If you want to replace any run of consecutive characters, you can use
>>> import re
>>> a = "AA---BC++++DDDD-EE$$$$FF"
>>> print(re.sub(r"(.)\1+",r"\1",a))
A-BC+D-E$F
If you only want to coalesce non-word-characters, use
>>> print(re.sub(r"(\W)\1+",r"\1",a))
AA-BC+DDDD-EE$FF
If it's really just hyphens, I recommend unutbu's solution.

If you really only want to coalesce hyphens, use the other suggestions. Otherwise you can write your own function, something like this:
>>> def coalesce(x):
... n = []
... for c in x:
... if not n or c != n[-1]:
... n.append(c)
... return ''.join(n)
...
>>> coalesce('My---sun--is------very-big---.')
'My-sun-is-very-big-.'
>>> coalesce('aaabbbccc')
'abc'

As usual, there's a nice itertools solution, using groupby:
>>> from itertools import groupby
>>> s = 'aaaaa----bbb-----cccc----d-d-d'
>>> ''.join(key for key, group in groupby(s))
'a-b-c-d-d-d'

How about:
>>> import re
>>> re.sub("-+", "-", "My---sun--is------very-big---.")
'My-sun-is-very-big-.'
the regular expression "-+" will look for 1 or more "-".

re.sub('-+', '-', "My---sun--is------very-big---")

How about an alternate without the re module:
'-'.join(filter(lambda w: len(w) > 0, 'My---sun--is------very-big---.'.split("-")))
Or going with Tim and FogleBird's previous suggestion, here's a more general method:
def coalesce_factory(x):
return lambda sent: x.join(filter(lambda w: len(w) > 0, sent.split(x)))
hyphen_coalesce = coalesce_factory("-")
hyphen_coalesce('My---sun--is------very-big---.')
Though personally, I would use the re module first :)
mcpeterson

Another simple solution is the String object's replace function.
while '--' in astr:
astr = astr.replace('--','-')

if you don't want to use regular expressions:
my_string = my_string.split('-')
my_string = filter(None, my_string)
my_string = '-'.join(my_string)

I have
my_str = 'a, b,,,,, c, , , d'
I want
'a,b,c,d'
compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between:
my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regular expression, pulling all letters out - python

Is there a better way to pull A and F from this: A13:F20 a="A13:F20" import re pattern = re.compile(r'\D+\d+\D+') matches = re.search(pattern, a) num = matches.group(0) print num[0] print num[len(num)-1] output A F note: the digits are of unknown length

You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this: a = "A13:F20" a = filter(lambda x: x.isalpha(), a)

I'd do it like this: >>> re.findall(r'[a-z]', a, re.IGNORECASE) ['A', 'F']

Use a simple list comprehension, as a filter and get only the alphabets from the actual string. print [char for char in input_string if char.isalpha()] # ['A', 'F']

You could use re.sub: >>> a="A13.F20" >>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z 'AF' >>> re.sub(r'[A-Z]', '', a) # Remove A-Z '13.20' >>>

If you're working with strings that all have the same format, you can just cut out substrings: a="A13:F20" print a[0], a[4] More on python slicing in this answer: Is there a way to substring a string in Python?

Related

How can i keep the dot in a string while removing letters from the alphabet

Splitting string using different scenarios using regex

Split a unicode string into components containing numbers and letters

python: keep char only if it is within this list

How do I coalesce a sequence of identical characters into just one?

Categories

Resources