Check and remove particular char from string in python - python

I'm in a situation where I have a string and a special symbol that is consecutively repeating, such as:
s = 'a.b.c...d..e.g'
How can I check whether it is repeating or not and remove consecutive symbols, resulting in this:
s = 'a.b.c.d.e.g'

import re
result = re.sub(r'\.{2,}', '.', 'a.b.c...d..e.g')
A bit more generalized version:
import re
symbol = '.'
regex_pattern_to_replace = re.escape(symbol)+'{2,}'
# Note that escape sequences are processed in replace_to
# but this time we have no backslash characters in it.
# In case of more complex replacement we could use
# replace_to = replace_to.replace('\\', '\\\\')
# to defend against occasional escape sequences.
replace_to = symbol
result = re.sub(regex_pattern_to_replace, replace_to, 'a.b.c...d..e.g')
The same with compiled regex (added after Cristian Ciupitu's comment):
compiled_regex = re.compile(regex_pattern_to_replace)
# You can store the compiled_regex and reuse it multiple times.
result = compiled_regex.sub(replace_to, 'a.b.c...d..e.g')
Check out the docs for re.sub

Simple and clear:
>>> a = 'a.b.c...d..e.g'
>>> while '..' in a:
a = a.replace('..','.')
>>> a
'a.b.c.d.e.g'

Lot's of answers so why not throw another one into the mix.
You can zip the string with itself off by one and eliminate all matching '.'s:
''.join(x[0] for x in zip(s, s[1:]+' ') if x != ('.', '.'))
Certainly not the fastest, just interesting. It's trivial to turn this into eliminating all repeating elements:
''.join(a for a,b in zip(s, s[1:]+' ') if a != b)
Note: you can use izip_longest (py2) or zip_longest (py3) if ' ' as a filler causes an issue.

My previous answer was a dud so here's another attempt using reduce(). This is reasonably efficient with O(n) time complexity:
def remove_consecutive(s, symbol='.'):
def _remover(x, y):
if y == symbol and x[-1:] == y:
return x
else:
return x + y
return reduce(_remover, s, '')
for s in 'abcdefg', '.a.', '..aa..', '..aa...b...c.d.e.f.g.....', '.', '..', '...', '':
print remove_consecutive(s)
Output
abcdefg
.a.
.aa.
.aa.b.c.d.e.f.g.
.
.
.

Kind of complicated, but it works and it's being done in a single loop:
import itertools
def remove_consecutive(s, c='.'):
return ''.join(
itertools.chain.from_iterable(
c if k else g
for k, g in itertools.groupby(s, c.__eq__)
)
)

Related

How can we remove word with repeated single character?

I am trying to remove word with single repeated characters using regex in python, for example :
good => good
gggggggg => g
What I have tried so far is following
re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
Problem with above solution is that it changes good to god and I just want to remove words with single repeated characters.
A better approach here is to use a set
def modify(s):
#Create a set from the string
c = set(s)
#If you have only one character in the set, convert set to string
if len(c) == 1:
return ''.join(c)
#Else return original string
else:
return s
print(modify('good'))
print(modify('gggggggg'))
If you want to use regex, mark the start and end of the string in our regex by ^ and $ (inspired from #bobblebubble comment)
import re
def modify(s):
#Create the sub string with a regex which only matches if a single character is repeated
#Marking the start and end of string as well
out = re.sub(r'^([a-z])\1+$', r'\1', s)
return out
print(modify('good'))
print(modify('gggggggg'))
The output will be
good
g
If you do not want to use a set in your method, this should do the trick:
def simplify(s):
l = len(s)
if l>1 and s.count(s[0]) == l:
return s[0]
return s
print(simplify('good'))
print(simplify('abba'))
print(simplify('ggggg'))
print(simplify('g'))
print(simplify(''))
output:
good
abba
g
g
Explanations:
You compute the length of the string
you count the number of characters that are equal to the first one and you compare the count with the initial string length
depending on the result you return the first character or the whole string
You can use trim command:
take a look at this examples:
"ggggggg".Trim('g');
Update:
and for characters which are in the middle of the string use this function, thanks to this answer
in java:
public static string RemoveDuplicates(string input)
{
return new string(input.ToCharArray().Distinct().ToArray());
}
in python:
used = set()
unique = [x for x in mylist if x not in used and (used.add(x) or True)]
but I think all of these answers does not match situation like aaaaabbbbbcda, this string has an a at the end of string which does not appear in the result (abcd). for this kind of situation use this functions which I wrote:
In:
def unique(s):
used = set()
ret = list()
s = list(s)
for x in s:
if x not in used:
ret.append(x)
used = set()
used.add(x)
return ret
print(unique('aaaaabbbbbcda'))
out:
['a', 'b', 'c', 'd', 'a']

Regex for Transformations (without using multiple statements)

What is the best way to use Regex to extract and transform one statement to another?
Specifically, I have implemented the below to find and extract a sudent number from a block of text and transform it as follows: AB123CD to AB-123-CD
Right now, this is implemented as 3 statements as follows:
gg['student_num'] = gg['student_test'].str.extract('(\d{2})\w{3}\d{2}') + \
'-' + gg['student_num'].str.extract('\d{2}(\w{3})\d{2}') + \
'-' + gg['student_test'].str.extract('\d{2}\w{3}(\d{2})')
It doesn't feel right to me that I would need to have three statements -
one for each group - concatenated together below (or even more if this was more complicated) and wondered if there was a better way to find and transform some text?
You could get list of segments using regexp and then join them this way:
'-'.join(re.search(r'(\d{2})(\w{3})(\d{2})', string).groups())
You could get AttributeError if string doesn't contain needed pattern (re.search() returns None), so you might want to wrap this expression in try...except block.
This is not regex, but it is quick and concise:
s = "AB123CD"
first = [i for i, a in enumerate(s) if a.isdigit()][0]
second = [i for i, a in enumerate(s) if a.isdigit()][-1]
new_form = s[:first]+"-"+s[first:second+1]+"-"+s[second+1:]
Output:
AB-123-CD
Alternative regex solution:
letters = re.findall("[a-zA-Z]+", s)
numbers = re.findall("[0-9]+", s)
letters.insert(1, numbers[0])
final = '-'.join(letters)
print(final)
Output:
AB-123-CD
Try this. Hope that helps
>>> import re
>>> s = r'ABC123DEF'
>>> n = re.search(r'\d+',s).group()
>>> f = re.findall(r'[A-Za-z]+',s)
>>> new_s = f[0]+"-"+n+"-"+f[1]
>>> new_s
Output:
'ABC-123-DEF'

Retrieving a full number

Assume I have a string as follows: expression = '123 + 321'.
I am walking over the string character-by-character as follows: for p in expression. I am I am checking if p is a digit using p.isdigit(). If p is a digit, I'd like to grab the whole number (so grab 123 and 321, not just p which initially would be 1).
How can I do that in Python?
In C (coming from a C background), the equivalent would be:
int x = 0;
sscanf(p, "%d", &x);
// the full number is now in x
EDIT:
Basically, I am accepting a mathematical expression from a user that accepts positive integers, +,-,*,/ as well as brackets: '(' and ')'. I am walking the string character by character and I need to be able to determine whether the character is a digit or not. Using isdigit(), I can that. If it is a digit however, I need to grab the whole number. How can that be done?
>>> from itertools import groupby
>>> expression = '123 + 321'
>>> expression = ''.join(expression.split()) # strip whitespace
>>> for k, g in groupby(expression, str.isdigit):
if k: # it's a digit
print 'digit'
print list(g)
else:
print 'non-digit'
print list(g)
digit
['1', '2', '3']
non-digit
['+']
digit
['3', '2', '1']
This is one of those problems that can be approached from many different directions. Here's what I think is an elegant solution based on itertools.takewhile:
>>> from itertools import chain, takewhile
>>> def get_numbers(s):
... s = iter(s)
... for c in s:
... if c.isdigit():
... yield ''.join(chain(c, takewhile(str.isdigit, s)))
...
>>> list(get_numbers('123 + 456'))
['123', '456']
This even works inside a list comprehension:
>>> def get_numbers(s):
... s = iter(s)
... return [''.join(chain(c, takewhile(str.isdigit, s)))
... for c in s if c.isdigit()]
...
>>> get_numbers('123 + 456')
['123', '456']
Looking over other answers, I see that this is not dissimilar to jamylak's groupby solution. I would recommend that if you don't want to discard the extra symbols. But if you do want to discard them, I think this is a bit simpler.
The Python documentation includes a section on simulating scanf, which gives you some idea of how you can use regular expressions to simulate the behavior of scanf (or sscanf, it's all the same in Python). In particular, r'\-?\d+' is the Python string that corresponds to the regular expression for an integer. (r'\d+' for a nonnegative integer.) So you could embed this in your loop as
integer = re.compile(r'\-?\d+')
for p in expression:
if p.isdigit():
# somehow find the current position in the string
integer.match(expression, curpos)
But that still reflects a very C-like way of thinking. In Python, your iterator variable p is really just an individual character that has actually been pulled out of the original string and is standing on its own. So in the loop, you don't naturally have access to the current position within the string, and trying to calculate it is going to be less than optimal.
What I'd suggest instead is using Python's built in regexp matching iteration method:
integer = re.compile(r'\-?\d+') # only do this once in your program
all_the_numbers = integer.findall(expression)
and now all_the_numbers is a list of string representations of all the integers in the expression. If you wanted to actually convert them to integers, then you could do this instead of the last line:
all_the_numbers = [int(s) for s in integer.finditer(expression)]
Here I've used finditer instead of findall because you don't have to make a list of all the strings before iterating over them again to convert them to integers.
Though I'm not familiar with sscanf, I'm no C developer, it looks like it's using format strings in a way not dissimilar to what I'd use python's re module for. Something like this:
import re
nums = re.compile('\d+')
found = nums.findall('123 + 321')
# if you know you're only looking for two values.
left, right = found
You can use shlex http://docs.python.org/library/shlex.html
>>> from shlex import shlex
>>> expression = '123 + 321'
>>> for e in shlex(expression):
... print e
...
123
+
321
>>> expression = '(92831 * 948) / 32'
>>> for e in shlex(expression):
... print e
...
(
92831
*
948
)
/
32
I'd split the string up on the ' + ' string, giving you what's outside of them:
>>> expression = '123 + 321'
>>> ex = expression.split(' + ')
>>> ex
['123', '321']
>>> int_ex = map(int, ex)
>>> int_ex
[123, 321]
>>> sum(int_ex)
444
It's dangerous, but you could use eval:
>>> eval('123 + 321')
444
I'm just taking a stab at you parsing the string, and doing raw calculations on it.
e_array = expression.split('+')
i_array = map(int, e_array)
And i_array holds all integers in the expression.
UPDATE
If you already know all the special characters in your expression and you want to eliminate them all
import re
e_array = re.split('[*/+\-() ]', expression) # all characters here is mult, div, plus, minus, left- right- parathesis and space
i_array = map(int, filter(lambda x: len(x), e_array))

Analyzing string input until it reaches a certain letter on Python

I need help in trying to write a certain part of a program.
The idea is that a person would input a bunch of gibberish and the program will read it till it reaches an "!" (exclamation mark) so for example:
input("Type something: ")
Person types: wolfdo65gtornado!salmontiger223
If I ask the program to print the input it should only print wolfdo65gtornado and cut anything once it reaches the "!" The rest of the program is analyzing and counting the letters, but those part I already know how to do. I just need help with the first part. I been trying to look through the book but it seems I'm missing something.
I'm thinking, maybe utilizing a for loop and then placing restriction on it but I can't figure out how to make the random imputed string input be analyzed for a certain character and then get rid of the rest.
If you could help, I'll truly appreciate it. Thanks!
The built-in str.partition() method will do this for you. Unlike str.split() it won't bother to cut the rest of the str into different strs.
text = raw_input("Type something:")
left_text = text.partition("!")[0]
Explanation
str.partition() returns a 3-tuple containing the beginning, separator, and end of the string. The [0] gets the first item which is all you want in this case. Eg.:
"wolfdo65gtornado!salmontiger223".partition("!")
returns
('wolfdo65gtornado', '!', 'salmontiger223')
>>> s = "wolfdo65gtornado!salmontiger223"
>>> s.split('!')[0]
'wolfdo65gtornado'
>>> s = "wolfdo65gtornadosalmontiger223"
>>> s.split('!')[0]
'wolfdo65gtornadosalmontiger223'
if it doesnt encounter a "!" character, it will just grab the entire text though. if you would like to output an error if it doesn't match any "!" you can just do like this:
s = "something!something"
if "!" in s:
print "there is a '!' character in the context"
else:
print "blah, you aren't using it right :("
You want itertools.takewhile().
>>> s = "wolfdo65gtornado!salmontiger223"
>>> '-'.join(itertools.takewhile(lambda x: x != '!', s))
'w-o-l-f-d-o-6-5-g-t-o-r-n-a-d-o'
>>> s = "wolfdo65gtornado!salmontiger223!cvhegjkh54bgve8r7tg"
>>> i = iter(s)
>>> '-'.join(itertools.takewhile(lambda x: x != '!', i))
'w-o-l-f-d-o-6-5-g-t-o-r-n-a-d-o'
>>> '-'.join(itertools.takewhile(lambda x: x != '!', i))
's-a-l-m-o-n-t-i-g-e-r-2-2-3'
>>> '-'.join(itertools.takewhile(lambda x: x != '!', i))
'c-v-h-e-g-j-k-h-5-4-b-g-v-e-8-r-7-t-g'
Try this:
s = "wolfdo65gtornado!salmontiger223"
m = s.index('!')
l = s[:m]
To explain accepted answer.
Splitting
partition() function splits string in list with 3 elements:
mystring = "123splitABC"
x = mystring.partition("split")
print(x)
will give:
('123', 'split', 'ABC')
Access them like list elements:
print (x[0]) ==> 123
print (x[1]) ==> split
print (x[2]) ==> ABC
Suppose we have:
s = "wolfdo65gtornado!salmontiger223" + some_other_string
s.partition("!")[0] and s.split("!")[0] are both a problem if some_other_string contains a million strings, each a million characters long, separated by exclamation marks. I recommend the following instead. It's much more efficient.
import itertools as itts
get_start_of_string = lambda stryng, last, *, itts=itts:\
str(itts.takewhile(lambda ch: ch != last, stryng))
###########################################################
s = "wolfdo65gtornado!salmontiger223"
start_of_string = get_start_of_string(s, "!")
Why the itts=itts
Inside of the body of a function, such as get_start_of_string, itts is global.
itts is evaluated when the function is called, not when the function is defined.
Consider the following example:
color = "white"
get_fleece_color = lambda shoop: shoop + ", whose fleece was as " + color + " as snow."
print(get_fleece_color("Igor"))
# [... many lines of code later...]
color = "pink polka-dotted"
print(get_fleece_color("Igor's cousin, 3 times removed"))
The output is:
Igor, whose fleece was white as snow.
Igor's cousin, 3 times removed Igor, whose fleece was as pink polka-dotted as snow.
You can extract the beginning of a string, up until the first delimiter is encountered, by using regular expressions.
import re
slash_if_special = lambda ch:\
"\\" if ch in "\\^$.|?*+()[{" else ""
prefix_slash_if_special = lambda ch, *, _slash=slash_if_special: \
_slash(ch) + ch
make_pattern_from_char = lambda ch, *, c=prefix_slash_if_special:\
"^([^" + c(ch) + "]*)"
def get_string_up_untill(x_stryng, x_ch):
i_stryng = str(x_stryng)
i_ch = str(x_ch)
assert(len(i_ch) == 1)
pattern = make_pattern_from_char(ch)
m = re.match(pattern, x_stryng)
return m.groups()[0]
An example of the code above being used:
s = "wolfdo65gtornado!salmontiger223"
result = get_string_up_untill(s, "!")
print(result)
# wolfdo65gtornado
We can use itertools
s = "wolfdo65gtornado!salmontiger223"
result = "".join(itertools.takewhile(lambda x : x!='!' , s))
>>"wolfdo65gtornado"

python string manipulation [duplicate]

I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"
I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E
i tried doing this:
p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)
It gives output: AX&EUr)
Is there any way to correct this, rather than iterating each element in the string?
Another simple option is removing the innermost parentheses at every stage, until there are no more parentheses:
p = re.compile("\([^()]*\)")
count = 1
while count:
s, count = p.subn("", s)
Working example: http://ideone.com/WicDK
You can just use string manipulation without regular expression
>>> s = "AX(p>q)&E(qUr)"
>>> [ i.split("(")[0] for i in s.split(")") ]
['AX', '&E', '']
I leave it to you to join the strings up.
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> re.compile("""\([^\)]*\)""").sub('', s)
'AX&E'
Yeah, it should be:
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> p = re.compile("\(.*?\)", re.DOTALL)
>>> new_string = p.sub("", s)
>>> new_string
'AX&E'
Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.
It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)
You can use PyParsing to parse the string:
from pyparsing import nestedExpr
import sys
s = "AX(p>q)&E((-p)Ur)"
expr = nestedExpr('(', ')')
result = expr.parseString('(' + s + ')').asList()[0]
s = ''.join(filter(lambda x: isinstance(x, str), result))
print(s)
Most code is from: How can a recursive regexp be implemented in python?
You could use re.subn():
import re
s = 'AX(p>q)&E((-p)Ur)'
while True:
s, n = re.subn(r'\([^)(]*\)', '', s)
if n == 0:
break
print(s)
Output
AX&E
this is just how you do it:
# strings
# double and single quotes use in Python
"hey there! welcome to CIP"
'hey there! welcome to CIP'
"you'll understand python"
'i said, "python is awesome!"'
'i can\'t live without python'
# use of 'r' before string
print(r"\new code", "\n")
first = "code in"
last = "python"
first + last #concatenation
# slicing of strings
user = "code in python!"
print(user)
print(user[5]) # print an element
print(user[-3]) # print an element from rear end
print(user[2:6]) # slicing the string
print(user[:6])
print(user[2:])
print(len(user)) # length of the string
print(user.upper()) # convert to uppercase
print(user.lstrip())
print(user.rstrip())
print(max(user)) # max alphabet from user string
print(min(user)) # min alphabet from user string
print(user.join([1,2,3,4]))
input()

Categories

Resources