python regular expression split function issue

python regular expression split function issue - python

I'm using python2 and I want to get rid of these empty strings in the output of the following python regular expression:
import re
x = "010101000110100001100001"
print re.split("([0-1]{8})", x)
and the output is this :
['', '01010100', '', '01101000', '', '01100001', '']
I just want to get this output:
['01010100', '01101000', '01100001']

Regex probably isn't what you want to use in this case. It seems that you want to just split the string into groups of n (8) characters.
I poached an answer from this question.
def split_every(n, s):
return [ s[i:i+n] for i in xrange(0, len(s), n) ]
split_every(8, "010101000110100001100001")
Out[2]: ['01010100', '01101000', '01100001']

One possible way:
print filter(None, re.split("([0-1]{8})", x))

import re
x = "010101000110100001100001"
l = re.split("([0-1]{8})", x)
l2 = [i for i in l if i]
out:
['01010100', '01101000', '01100001']

This is exactly what is split for. It is split string using regular expression as separator.
If you need to find all matches try use findall instead:
import re
x = "010101000110100001100001"
print(re.findall("([0-1]{8})", x))

print([a for a in re.split("([0-1]{8})", x) if a != ''])

Following your regex approach, you can simply use a filter to get your desired output.
import re
x = "010101000110100001100001"
unfiltered_list = re.split("([0-1]{8})", x)
print filter(None, unfiltered_list)
If you run this, you should get:
['01010100', '01101000', '01100001']

Related

Need to remove quotes inside the array of a string using python

input = "'Siva', ['Aswin','latha'], 'Senthil',['Aswin','latha']"
expected output:
"'Siva', [Aswin,latha], 'Senthil',[Aswin,latha]"
I have used positive lookbehind and lookahead but its not working.
pattern
(?<=\[)\'+(?=\])

We can use re.sub here with a callback lambda function:
inp = "'Siva', ['Aswin','latha'], 'Senthil',['Aswin','latha']"
output = re.sub(r'\[.*?\]', lambda x: x.group().replace("'", ""), inp)
print(output)
This prints:
'Siva', [Aswin,latha], 'Senthil',[Aswin,latha]

import re
input = "'Siva', ['Aswin','latha'], 'Senthil',['Aswin','latha']"
print(re.sub(r"(?<=\[).*?(?=\])", lambda val: re.sub(r"'(\w+?)'", r"\1", val.group()), input))
# 'Siva', [Aswin,latha], 'Senthil',[Aswin,latha]

You can try something like this if you don't want to import re:
X = eval("'Siva', ['Aswin','latha'], 'Senthil',['Aswin','latha']")
Y = []
for x in X:
Y.append(f"[{', '.join(x)}]" if isinstance(x, list) else f"'{x}'")
print(", ".join(Y))

You can use re.findall with an alternation pattern to find either fragments between ] and [, or otherwise non-single-quote characters, and then join the fragments together with ''.join:
''.join(re.findall(r"[^\]]*\[|\][^\[]*|[^']+", input))
Demo: https://replit.com/#blhsing/ClassicFrostyBookmark
This is generally more efficient than using re.sub with a callback since there is overhead involved in making a callback for each match.

Not finding a good regex pattern to substitute the strings in a correct order(python)

I have a list of column names that are in string format like below:
lst = ["plug", "[plug+wallet]", "(wallet-phone)"]
Now I want to add df[] with " ' " to each column name using regex and I did it which does that when the list has (wallet-phone) this kind of string it gives an output like this df[('wallet']-df['phone')]. How do I get like this (df['wallet']-df['phone']), Is my pattern wrong. Please refer it below:
import re
lst = ["plug", "[plug+wallet]", "(wallet-phone)"]
x=[]
y=[]
for l in lst:
x.append(re.sub(r"([^+\-*\/'\d]+)", r"'\1'", l))
for f in x:
y.append(re.sub(r"('[^+\-*\/'\d]+')", r'df[\1]',f))
print(x)
print(y)
gives:
x:["'plug'", "'[plug'+'wallet]'", "'(wallet'-'phone)'"]
y:["df['plug']", "df['[plug']+df['wallet]']", "df['(wallet']-df['phone)']"]
Is the pattern wrong?
Expected output:
x:["'plug'", "['plug'+'wallet']", "('wallet'-'phone')"]
y:["df['plug']", "[df['plug']+df['wallet']]", "(df['wallet']-df['phone'])"]
I also tried ([^+\-*\/()[]'\d]+) this pattern but it isn't avoiding () or []

It might be easier to locate words and enclose them in the dictionary reference:
import re
lst = ["plug", "[plug+wallet]", "(wallet-phone)"]
z = [re.sub(r"(\w+)",r"df['\1']",w) for w in lst]
print(z)
["df['plug']", "[df['plug']+df['wallet']]", "(df['wallet']-df['phone'])"]

How to replace string to the other string in list (python)

What is the best way to replace every string in the list?
For example if I have a list:
a = ['123.txt', '1234.txt', '654.txt']
and I would like to have:
a = ['123', '1234', '654']

Assuming that sample input is similar to what you actually have, use os.path.splitext() to remove file extensions:
>>> import os
>>> a = ['123.txt', '1234.txt', '654.txt']
>>> [os.path.splitext(item)[0] for item in a]
['123', '1234', '654']

Use a list comprehension as follows:
a = ['123.txt', '1234.txt', '654.txt']
answer = [item.replace('.txt', '') for item in a]
print(answer)
Output
['123', '1234', '654']

Assuming that all your strings end with '.txt', just slice the last four characters off.
>>> a = ['123.txt', '1234.txt', '654.txt']
>>> a = [x[:-4] for x in a]
>>> a
['123', '1234', '654']
This will also work if you have some weird names like 'some.txtfile.txt'

You could split you with . separator and get first item:
In [486]: [x.split('.')[0] for x in a]
Out[486]: ['123', '1234', '654']

Another way to do this:
a = [x[: -len("txt")-1] for x in a]

What is the best way to replace every string in the list?
That completely depends on how you define 'best'. I, for example, like regular expressions:
import re
a = ['123.txt', '1234.txt', '654.txt']
answer = [re.sub('^(\w+)\..*', '\g<1>', item) for item in a]
#print(answer)
#['123', '1234', '654']
Depending on the content of the strings, you could adjust it:
\w+ vs [0-9]+ for only digits
\..* vs \.txt if all strings end with .txt

data.colname = [item.replace('anythingtoreplace', 'desiredoutput') for item in data.colname]
Please note here 'data' is the dataframe, 'colname' is the column name you might have in that dataframe. Even the spaces are accounted, if you want to remove them from a string or number. This was quite useful for me. Also this does not change the datatype of the column so you might have to do that separately if required.

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length

You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)

I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']

Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']

You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>

If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

How do I coalesce a sequence of identical characters into just one?

Suppose I have this:
My---sun--is------very-big---.
I want to replace all multiple hyphens with just one hyphen.

import re
astr='My---sun--is------very-big---.'
print(re.sub('-+','-',astr))
# My-sun-is-very-big-.

If you want to replace any run of consecutive characters, you can use
>>> import re
>>> a = "AA---BC++++DDDD-EE$$$$FF"
>>> print(re.sub(r"(.)\1+",r"\1",a))
A-BC+D-E$F
If you only want to coalesce non-word-characters, use
>>> print(re.sub(r"(\W)\1+",r"\1",a))
AA-BC+DDDD-EE$FF
If it's really just hyphens, I recommend unutbu's solution.

If you really only want to coalesce hyphens, use the other suggestions. Otherwise you can write your own function, something like this:
>>> def coalesce(x):
... n = []
... for c in x:
... if not n or c != n[-1]:
... n.append(c)
... return ''.join(n)
...
>>> coalesce('My---sun--is------very-big---.')
'My-sun-is-very-big-.'
>>> coalesce('aaabbbccc')
'abc'

As usual, there's a nice itertools solution, using groupby:
>>> from itertools import groupby
>>> s = 'aaaaa----bbb-----cccc----d-d-d'
>>> ''.join(key for key, group in groupby(s))
'a-b-c-d-d-d'

How about:
>>> import re
>>> re.sub("-+", "-", "My---sun--is------very-big---.")
'My-sun-is-very-big-.'
the regular expression "-+" will look for 1 or more "-".

re.sub('-+', '-', "My---sun--is------very-big---")

How about an alternate without the re module:
'-'.join(filter(lambda w: len(w) > 0, 'My---sun--is------very-big---.'.split("-")))
Or going with Tim and FogleBird's previous suggestion, here's a more general method:
def coalesce_factory(x):
return lambda sent: x.join(filter(lambda w: len(w) > 0, sent.split(x)))
hyphen_coalesce = coalesce_factory("-")
hyphen_coalesce('My---sun--is------very-big---.')
Though personally, I would use the re module first :)
mcpeterson

Another simple solution is the String object's replace function.
while '--' in astr:
astr = astr.replace('--','-')

if you don't want to use regular expressions:
my_string = my_string.split('-')
my_string = filter(None, my_string)
my_string = '-'.join(my_string)

I have
my_str = 'a, b,,,,, c, , , d'
I want
'a,b,c,d'
compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between:
my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regular expression split function issue - python

One possible way: print filter(None, re.split("([0-1]{8})", x))

import re x = "010101000110100001100001" l = re.split("([0-1]{8})", x) l2 = [i for i in l if i] out: ['01010100', '01101000', '01100001']

This is exactly what is split for. It is split string using regular expression as separator. If you need to find all matches try use findall instead: import re x = "010101000110100001100001" print(re.findall("([0-1]{8})", x))

print([a for a in re.split("([0-1]{8})", x) if a != ''])

Following your regex approach, you can simply use a filter to get your desired output. import re x = "010101000110100001100001" unfiltered_list = re.split("([0-1]{8})", x) print filter(None, unfiltered_list) If you run this, you should get: ['01010100', '01101000', '01100001']

Related

Need to remove quotes inside the array of a string using python

Not finding a good regex pattern to substitute the strings in a correct order(python)

How to replace string to the other string in list (python)

python regular expression, pulling all letters out

How do I coalesce a sequence of identical characters into just one?

Categories

Resources