I have a text file which have multi lines in the same following pattern
Server:x.x.x # U:100 # P:100 # Pre:00 # Tel:xxxxxx
I built this code to get the value after Pre:
x2 = (re.findall(r'Pre:(\d+)',s))
I'm not so familiar with re patterns , but this code don't get the value if it is + or empty value ( a None value )
Any suggestions to generlize the code to get what ever value after Pre: until the next # without the space ?
How about this as the pattern? It will get everything until the next " #" but without being greedy (that's what the ? is for).
r"Pre:(.*?) #"
The example you've provided works just fine:
>>> import re
>>> s = 'Server:x.x.x # U:100 # P:100 # Pre:00 # Tel:xxxxxx'
>>> re.findall(r'Pre:(\d+)', s)
['00']
You may need to add handling of +/- and ., for negative numbers and decimals: (-?[\d.,]+).
If you need to match any string (not just numbers) you may want to use Pre:(.*?)\s*#.
Or you may avoid using regexps at all and split row by # separator:
>>> s.split('#')
['Server:x.x.x ', ' U:100 ', ' P:100 ', ' Pre:00 ', ' Tel:xxxxxx']
And then split rows by first ::
>>> for row in s.split('#'):
... k, v = row.split(':', 1)
... print(k.strip(), '=', v.strip())
...
Server = x.x.x
U = 100
P = 100
Pre = 00
Tel = xxxxxx
A non-regex approach would involve splitting by # and then by : forming a dictionary which would make accessing the parts of the string easy and readable:
>>> s = "Server:x.x.x # U:100 # P:100 # Pre:00 # Tel:xxxxxx"
>>> d = dict([key.split(":") for key in s.split(" # ")])
>>> d["Pre"]
'00'
x2 = (re.findall(r'Pre:(.*?) #',s))
Pre:(.*?) #
Match the character string “Pre:” literally «Pre:» Match the regex
below and capture its match into backreference number 1 «(.?)»
Match any single character that is NOT a line break character «.?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “ #”
literally « #»
Related
I am trying to split and then join the first alphanumeric occurrence by space and keep other occurrences as it is, but not getting the pattern to do that.
For ex:
string: Johnson12 is at club39
converted_string: Jhonson 12 is at club39
Desired Output:
input = "Brijesh Tiwari810663 A14082014RGUBWA"
output = Brijesh Tiwari 810663 A14082014RGUBWA
Code:
import re
regex = re.compile("[0-9]{1,}")
testString = "Brijesh Tiwari810663 A14082014RGUBWA" # fill this in
a = re.split('([^a-zA-Z0-9])', testString)
print(a)
>> ['Brijesh', ' ', 'Tiwari810663', ' ', 'A14082014RGUBWA']
Here is one way. We can use re.findall on the pattern [A-Za-z]+|[0-9]+, which will alternatively find all letter or all number words. Then, join that resulting list by space to get your output
inp = "Brijesh Tiwari810663 A14082014RGUBWA"
output = ' '.join(re.findall(r'[A-Za-z]+|[0-9]+', inp))
print(output) # Brijesh Tiwari 810663 A 14082014 RGUBWA
Edit: For your updated requirement, use re.sub with just one replacement:
inp = "Johnson12 is at club39"
output = re.sub(r'\b([A-Za-z]+)([0-9]+)\b', r'\1 \2', inp, 1)
print(output) # Johnson 12 is at club39
The POS tagger that I use processes the following string
3+2
as shown below.
3/num++/sign+2/num
I'd like to split this result as follows using python.
['3/num', '+/sign', '2/num']
How can I do that?
Use re.split -
>>> import re
>>> re.split(r'(?<!\+)\+', '3/num++/sign+2/num')
['3/num', '+/sign', '2/num']
The regex pattern will split on a + sign as long as no other + precedes it.
(?<! # negative lookbehind
\+ # plus sign
)
\+ # plus sign
Note that lookbehinds (in general) do not support varying length patterns.
The tricky part I believe is the double + sign. You can replace the signs with special characters and get it done.
This should work,
st = '3/num++/sign+2/num'
st = st.replace('++', '#$')
st = st.replace('+', '#')
st = st.replace('$', '+')
print (st.split('#'))
One issue with this is that, your original string cannot contain those special characters # & $. So you will need to carefully choose them for your use case.
Edit: This answer is naive. The one with regex is better
That is, as pointed out by COLDSPEED, you should use the following regex approach with lookbehind,
import re
print re.split(r'(?<!\+)\+', '3/num++/sign+2/num')
Although the ask was to use regex, here is an example on how to do this with standard .split():
my_string = '3/num++/sign+2/num'
my_list = []
result = []
# enumerate over the split string
for e in my_string.split('/'):
if '+' in e:
if '++' in e:
#split element on double + and add in + as well
my_list.append(e.split('++')[0])
my_list.append('+')
else:
#split element on single +
my_list.extend(e.split('+'))
else:
#add element
my_list.append(e)
# at this point my_list contains
# ['3', 'num', '+', 'sign', '2', 'num']
# enumerate on the list, steps of 2
for i in range(0, len(my_list), 2):
#add result
result.append(my_list[i] + '/' + my_list[i+1])
print('result', result)
# result returns
# ['3/num', '+/sign', '2/num']
I'm creating a Python/Django app and I need to clean up a string, but the main problem I have is that the string has too many line breaks in some parts. I don't want to delete all the line breaks, just the excess of them. How can I archive this in python? I'm using Python 2.7 and Django 1.6
A regexp is one way. Using your updated sample input:
>>> a = "This is my sample text.\r\n\r\n\r\n\r\n\r\n Here start another sample text"
>>> import re
>>> re.sub(r'(\r\n){2,}','\r\n', a)
'This is my sample text.\r\n Here start another sample text'
r'(\r\n)+' would work too, I just like using the 2+ lower bound to avoid some replacements of singleton \r\n substrings with the same substring.
Or you can use the splitlines method on the string and rejoin after filtering:
>>> '\r\n'.join(line for line in a.splitlines() if line)
As an example, if you know what you want to replace:
>>> a = 'string with \n a \n\n few too many\n\n\n lines'
>>> a.replace('\n'*2, '\n') # Replaces \n\n with just \n
'string with \n a \n few too many\n\n lines'
>>> a.replace('\n'*3, '') # Replaces \n\n\n with nothing...
'string with \n a \n\n few too many lines'
Or, using regular expression to find what you want
>>> import re
>>> re.findall(r'.*([\n]+).*', a)
['\n', '\n\n', '\n\n\n']
import re
a = 'string with \n a \n\n few too many\n\n\n lines'
re.sub('\n+', '\n', a)
To use a regex to replace multiple occurrences of newline with a single one (or something else you prefer such as a period, tab or whatever), try:
import re
testme = 'Some text.\nSome more text.\n\nEven more text.\n\n\n\n\nThe End'
print re.sub('\n+', '\n', testme)
Note that '\n' is a single-character (a newline), not two characters (literal backslash and 'n').
You can of course compile the regex in advance if you intend to re-use it:
pattern = re.compile('\n+')
print pattern.sub('\n', testme)
Best I could do, but Peter DeGlopper's was better.
import re
s = '\n' * 9 + 'abc' + '\n'*10
# s == '\n\n\n\n\n\n\n\n\nabc\n\n\n\n\n\n\n\n\n\n\n'
lines = re.compile('\n+')
excess_lines = lines.findall(s)
# excess_lines == ['\n' * 9, '\n' * 10]
# I feel as though there is a better way, but this works
def cmplen(first, second):
'''
Function to order strings in descending order by length
Needed so that we replace longer strings of new lines first
'''
if len(first) < len(second):
return 1
elif len(first) > len(second):
return -1
else:
return 0
excess_lines.sort(cmp=cmplen)
# excess_lines == ['\n' * 10, '\n' * 9]
for lines in excess_lines:
s = s.replace(lines, '\n')
# s = '\nabc\n'
This solution feels dirty and inelegant, but it works. You need to sort by string length because if you have a string '\n\n\n aaaaaaa \n\n\n\n' and do a replace(), the \n\n\n will replace \n\n\n\n with \n\n, and not be caught later on.
I want to remove some symbols from a string using a regular expression, for example:
== (that occur both at the beginning and at the end of a line),
* (at the beginning of a line ONLY).
def some_func():
clean = re.sub(r'= {2,}', '', clean) #Removes 2 or more occurrences of = at the beg and at the end of a line.
clean = re.sub(r'^\* {1,}', '', clean) #Removes 1 or more occurrences of * at the beginning of a line.
What's wrong with my code? It seems like expressions are wrong. How do I remove a character/symbol if it's at the beginning or at the end of the line (with one or more occurrences)?
If you only want to remove characters from the beginning and the end, you could use the string.strip() method. This would give some code like this:
>>> s1 = '== foo bar =='
>>> s1.strip('=')
' foo bar '
>>> s2 = '* foo bar'
>>> s2.lstrip('*')
' foo bar'
The strip method removes the characters given in the argument from the beginning and the end of the string, ltrip removes them from only the beginning, and rstrip removes them only from the end.
If you really want to use a regular expression, they would look something like this:
clean = re.sub(r'(^={2,})|(={2,}$)', '', clean)
clean = re.sub(r'^\*+', '', clean)
But IMHO, using strip/lstrip/rstrip would be the most appropriate for what you want to do.
Edit: On Nick's suggestion, here is a solution that would do all this in one line:
clean = clean.lstrip('*').strip('= ')
(A common mistake is to think that these methods remove characters in the order they're given in the argument, in fact, the argument is just a sequence of characters to remove, whatever their order is, that's why the .strip('= ') would remove every '=' and ' ' from the beginning and the end, and not just the string '= '.)
You have extra spaces in your regexs. Even a space counts as a character.
r'^(?:\*|==)|==$'
First of all you should pay attention to the spaces before "{" ... those are meaningful so the quantifier in your example applies to the space.
To remove "=" (two or more) only at begin or end also you need a different regexp... for example
clean = re.sub(r'^(==+)?(.*?)(==+)?$', r'\2', s)
If you don't put either "^" or "$" the expression can match anywhere (i.e. even in the middle of the string).
And not substituting but keeping ? :
tu = ('======constellation==' , '==constant=====' ,
'=flower===' , '===bingo=' ,
'***seashore***' , '*winter*' ,
'====***conditions=**' , '=***trees====***' ,
'***=information***=' , '*=informative***==' )
import re
RE = '((===*)|\**)?(([^=]|=(?!=+\Z))+)'
pat = re.compile(RE)
for ch in tu:
print ch,' ',pat.match(ch).group(3)
Result:
======constellation== constellation
==constant===== constant
=flower=== =flower
===bingo= bingo=
***seashore*** seashore***
*winter* winter*
====***conditions=** ***conditions=**
=***trees====*** =***trees====***
***=information***= =information***=
*=informative***== =informative***
Do you want in fact
====***conditions=** to give conditions=** ?
***====hundred====*** to give hundred====*** ?
for the beginning ?**
I think that the following code will do the job:
tu = ('======constellation==' , '==constant=====' ,
'=flower===' , '===bingo=' ,
'***seashore***' , '*winter*' ,
'====***conditions=**' , '=***trees====***' ,
'***=information***=' , '*=informative***==' )
import re,codecs
with codecs.open('testu.txt', encoding='utf-8', mode='w') as f:
pat = re.compile('(?:==+|\*+)?(.*?)(?:==+)?\Z')
xam = max(map(len,tu)) + 3
res = '\n'.join(ch.ljust(xam) + pat.match(ch).group(1)
for ch in tu)
f.write(res)
print res
Where was my brain when I wrote the RE in my earlier post ??! O!O
Non greedy quantifier .*? before ==+\Z is the real solution.
I need to convert an arbitrary string to a string that is a valid variable name in Python.
Here's a very basic example:
s1 = 'name/with/slashes'
s2 = 'name '
def clean(s):
s = s.replace('/', '')
s = s.strip()
return s
# the _ is there so I can see the end of the string
print clean(s1) + '_'
That is a very naive approach. I need to check if the string contains invalid variable name characters and replace them with ''
What would be a pythonic way to do this?
Well, I'd like to best Triptych's solution with ... a one-liner!
>>> def clean(varStr): return re.sub('\W|^(?=\d)','_', varStr)
...
>>> clean('32v2 g #Gmw845h$W b53wi ')
'_32v2_g__Gmw845h_W_b53wi_'
This substitution replaces any non-variable appropriate character with underscore and inserts underscore in front if the string starts with a digit. IMO, 'name/with/slashes' looks better as variable name name_with_slashes than as namewithslashes.
According to Python, an identifier is a letter or underscore, followed by an unlimited string of letters, numbers, and underscores:
import re
def clean(s):
# Remove invalid characters
s = re.sub('[^0-9a-zA-Z_]', '', s)
# Remove leading characters until we find a letter or underscore
s = re.sub('^[^a-zA-Z_]+', '', s)
return s
Use like this:
>>> clean(' 32v2 g #Gmw845h$W b53wi ')
'v2gGmw845hWb53wi'
You can use the built in func:str.isidentifier() in combination with filter().
This requires no imports such as re and works by iterating over each character and returning it if its an identifier. Then you just do a ''.join to convert the array to a string again.
s1 = 'name/with/slashes'
s2 = 'name '
def clean(s):
s = ''.join(filter(str.isidentifier, s))
return s
print f'{clean(s1)}_' #the _ is there so I can see the end of the string
EDIT:
If, like Hans Bouwmeester in the replies, want numeric values to be included as well, you can create a lambda which uses both the isIdentifier and the isdecimal functions to check the characters. Obviously this can be expanded as far as you want to take it. Code:
s1 = 'name/with/slashes'
s2 = 'name i2, i3 '
s3 = 'epng2 0-2g [ q4o 2-=2 t1 l32!##$%*(vqv[r 0-34 2]] '
def clean(s):
s = ''.join(filter(
lambda c: str.isidentifier(c) or str.isdecimal(c), s))
return s
#the _ is there so I can see the end of the string
print(f'{ clean(s1) }_')
print(f'{ clean(s2) }_')
print(f'{ clean(s3) }_')
Gives :
namewithslashes_
namei2i3_
epng202gq4o22t1l32vqvr0342_
You should build a regex that's a whitelist of permissible characters and replace everything that is not in that character class.