Transforming pairwise string into tuples - python

I have the following string:
my_str = "StemCells(16.53),Bcells(13.59),Monocytes(11.58),abTcells(10.05),Macrophages(9.69), gdTCells(9.49),StromalCells(9.36),DendriticCells(9.20),NKCells(7.81),Neutrophils(2.71)"
What I want to do is to transform them in to this tuple
[ ('StemCells', 16.530000000000001),
('Bcells', 13.59),
('Monocytes', 11.58),
('abTcells', 10.050000000000001),
('Macrophages', 9.6899999999999995),
('gdTCells', 9.4900000000000002),
('StromalCells', 9.3599999999999994),
('DendriticCells', 9.1999999999999993),
('NKCells', 7.8099999999999996),
('Neutrophils', 2.71)]
How can I do that conveniently in Python

my_str = "StemCells(16.53),Bcells(13.59),Monocytes(11.58),abTcells(10.05),Macrophages(9.69), gdTCells(9.49),StromalCells(9.36),DendriticCells(9.20),NKCells(7.81),Neutrophils(2.71)"
import re
stuff = re.findall(r'(\w+)\(([0-9.]+)\)',my_str)
stuff
Out[4]:
[('StemCells', '16.53'),
('Bcells', '13.59'),
('Monocytes', '11.58'),
('abTcells', '10.05'),
('Macrophages', '9.69'),
('gdTCells', '9.49'),
('StromalCells', '9.36'),
('DendriticCells', '9.20'),
('NKCells', '7.81'),
('Neutrophils', '2.71')]
That gets you most of the way there, then it's just a bit of type conversion
[(s,float(f)) for s,f in stuff]
Out[7]:
[('StemCells', 16.53),
('Bcells', 13.59),
('Monocytes', 11.58),
('abTcells', 10.05),
('Macrophages', 9.69),
('gdTCells', 9.49),
('StromalCells', 9.36),
('DendriticCells', 9.2),
('NKCells', 7.81),
('Neutrophils', 2.71)]

well simplest solution, without using regex:
my_str = "StemCells(16.53),Bcells(13.59),Monocytes(11.58),abTcells(10.05),Macrophages(9.69), gdTCells(9.49),StromalCells(9.36),DendriticCells(9.20),NKCells(7.81),Neutrophils(2.71)"
that_str = map(lambda s: s.rstrip(')').split('(') ,my_str.split(','))
that_str = map(lambda s: (s[0], float(s[1])), that_str)
>>> that_str
[('StemCells', 16.53), ('Bcells', 13.59), ('Monocytes', 11.58), ('abTcells', 10.05), ('Macrophages', 9.69), (' gdTCells', 9.49), ('StromalCells', 9.36), ('DendriticCells', 9.2), ('NKCells', 7.81), ('Neutrophils', 2.71)]
you could do the job in one pass, using an external function, instead of a lambda:
def build_tuple(s):
t = s.rstrip(')').split('(')
return (t[0], float(t[1]))
that_str = map(build_tuple, my_str.split(','))

How about:
>>> zip(re.findall(r'([a-zA-Z]+)',my_str), map(float, re.findall(r'([0-9.]+)',my_str)))

Related

cleanest way to concatenate a single string to many other element contained in a list

What I have:
string = "string"
range_list = list(range(10))
What I want:
['string0',
'string1',
'string2',
'string3',
'string4',
'string5',
'string6',
'string7',
'string8',
'string9']
What I usually do:
import pandas as pd
(string+pd.Series(range_list).astype(str)).tolist()
What I would like to do:
obtain the same expected output from the same input, without importing libraries nor using loops
Since there is probably no way to do this complying my requests, any other solution cleaner and/or more performing than mine is well accepted. Thanks in advice.
You can do this using list comprehension and f-string.
[f"{string}{idx}" for idx in range_list]
You can use map with a function or a lambda to avoid using a loop.
def get_string(x):
return f'string{x}'
list(map(get_string, range(10)))
or with a lambda:
list(map(lambda x: f'string{x}', range(10)))
For your case, you could write:
list(map(lambda x: f'{string}{x}', range_list))
Since, you want solutions without loops:
string = "string"
range_list = list(range(10))
list(map(lambda x: string + x, np.array(range_list).astype(str)))
Or
list(map(lambda x: 'string{}'.format(x), range_list))
This works too
string = "string"
str_list = [string + str(i) for i in range(10)]

Python package for converting finite regex to a text array? [duplicate]

Suppose I have the following string:
trend = '(A|B|C)_STRING'
I want to expand this to:
A_STRING
B_STRING
C_STRING
The OR condition can be anywhere in the string. i.e STRING_(A|B)_STRING_(C|D)
would expand to
STRING_A_STRING_C
STRING_B_STRING C
STRING_A_STRING_D
STRING_B_STRING_D
I also want to cover the case of an empty conditional:
(|A_)STRING would expand to:
A_STRING
STRING
Here's what I've tried so far:
def expandOr(trend):
parenBegin = trend.index('(') + 1
parenEnd = trend.index(')')
orExpression = trend[parenBegin:parenEnd]
originalTrend = trend[0:parenBegin - 1]
expandedOrList = []
for oe in orExpression.split("|"):
expandedOrList.append(originalTrend + oe)
But this is obviously not working.
Is there any easy way to do this using regex?
Here's a pretty clean way. You'll have fun figuring out how it works :-)
def expander(s):
import re
from itertools import product
pat = r"\(([^)]*)\)"
pieces = re.split(pat, s)
pieces = [piece.split("|") for piece in pieces]
for p in product(*pieces):
yield "".join(p)
Then:
for s in ('(A|B|C)_STRING',
'(|A_)STRING',
'STRING_(A|B)_STRING_(C|D)'):
print s, "->"
for t in expander(s):
print " ", t
displays:
(A|B|C)_STRING ->
A_STRING
B_STRING
C_STRING
(|A_)STRING ->
STRING
A_STRING
STRING_(A|B)_STRING_(C|D) ->
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
import exrex
trend = '(A|B|C)_STRING'
trend2 = 'STRING_(A|B)_STRING_(C|D)'
>>> list(exrex.generate(trend))
[u'A_STRING', u'B_STRING', u'C_STRING']
>>> list(exrex.generate(trend2))
[u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']
I would do this to extract the groups:
def extract_groups(trend):
l_parens = [i for i,c in enumerate(trend) if c == '(']
r_parens = [i for i,c in enumerate(trend) if c == ')']
assert len(l_parens) == len(r_parens)
return [trend[l+1:r].split('|') for l,r in zip(l_parens,r_parens)]
And then you can evaluate the product of those extracted groups using itertools.product:
expr = 'STRING_(A|B)_STRING_(C|D)'
from itertools import product
list(product(*extract_groups(expr)))
Out[92]: [('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Now it's just a question of splicing those back onto your original expression. I'll use re for that :)
#python3.3+
def _gen(it):
yield from it
p = re.compile('\(.*?\)')
for tup in product(*extract_groups(trend)):
gen = _gen(tup)
print(p.sub(lambda x: next(gen),trend))
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
There's probably a more readable way to get re.sub to sequentially substitute things from an iterable, but this is what came off the top of my head.
It is easy to achieve with sre_yield module:
>>> import sre_yield
>>> trend = '(A|B|C)_STRING'
>>> strings = list(sre_yield.AllStrings(trend))
>>> print(strings)
['A_STRING', 'B_STRING', 'C_STRING']
The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently... It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

Split string into strings by length?

Is there a way to take a string that is 4*x characters long, and cut it into 4 strings, each x characters long, without knowing the length of the string?
For example:
>>>x = "qwertyui"
>>>split(x, one, two, three, four)
>>>two
'er'
>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']
I tried Alexanders answer but got this error in Python3:
TypeError: 'float' object cannot be interpreted as an integer
This is because the division operator in Python3 is returning a float. This works for me:
>>> x = "qwertyui"
>>> chunks, chunk_size = len(x), len(x)//4
>>> [ x[i:i+chunk_size] for i in range(0, chunks, chunk_size) ]
['qw', 'er', 'ty', 'ui']
Notice the // at the end of line 2, to ensure truncation to an integer.
:param s: str; source string
:param w: int; width to split on
Using the textwrap module:
PyDocs-textwrap
import textwrap
def wrap(s, w):
return textwrap.fill(s, w)
:return str:
Inspired by Alexander's Answer
PyDocs-data structures
def wrap(s, w):
return [s[i:i + w] for i in range(0, len(s), w)]
:return list:
Inspired by Eric's answer
PyDocs-regex
import re
def wrap(s, w):
sre = re.compile(rf'(.{{{w}}})')
return [x for x in re.split(sre, s) if x]
:return list:
some_string="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
x=3
res=[some_string[y-x:y] for y in range(x, len(some_string)+x,x)]
print(res)
will produce
['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU', 'VWX', 'YZ']
In Split string every nth character?, "the wolf" gives the most concise answer:
>>> import re
>>> re.findall('..','1234567890')
['12', '34', '56', '78', '90']
Here is a one-liner that doesn't need to know the length of the string beforehand:
from functools import partial
from StringIO import StringIO
[l for l in iter(partial(StringIO(data).read, 4), '')]
If you have a file or socket, then you don't need the StringIO wrapper:
[l for l in iter(partial(file_like_object.read, 4), '')]
def split2len(s, n):
def _f(s, n):
while s:
yield s[:n]
s = s[n:]
return list(_f(s, n))
Got an re trick:
In [28]: import re
In [29]: x = "qwertyui"
In [30]: [x for x in re.split(r'(\w{2})', x) if x]
Out[30]: ['qw', 'er', 'ty', 'ui']
Then be a func, it might looks like:
def split(string, split_len):
# Regex: `r'.{1}'` for example works for all characters
regex = r'(.{%s})' % split_len
return [x for x in re.split(regex, string) if x]
Here are two generic approaches. Probably worth adding to your own lib of reusables. First one requires the item to be sliceable and second one works with any iterables (but requires their constructor to accept iterable).
def split_bylen(item, maxlen):
'''
Requires item to be sliceable (with __getitem__ defined)
'''
return [item[ind:ind+maxlen] for ind in range(0, len(item), maxlen)]
#You could also replace outer [ ] brackets with ( ) to use as generator.
def split_bylen_any(item, maxlen, constructor=None):
'''
Works with any iterables.
Requires item's constructor to accept iterable or alternatively
constructor argument could be provided (otherwise use item's class)
'''
if constructor is None: constructor = item.__class__
return [constructor(part) for part in zip(* ([iter(item)] * maxlen))]
#OR: return map(constructor, zip(* ([iter(item)] * maxlen)))
# which would be faster if you need an iterable, not list
So, in topicstarter's case, the usage is:
string = 'Baboons love bananas'
parts = 5
splitlen = -(-len(string) // parts) # is alternative to math.ceil(len/parts)
first_method = split_bylen(string, splitlen)
#Result :['Babo', 'ons ', 'love', ' ban', 'anas']
second_method = split_bylen_any(string, splitlen, constructor=''.join)
#Result :['Babo', 'ons ', 'love', ' ban', 'anas']
length = 4
string = "abcdefgh"
str_dict = [ o for o in string ]
parts = [ ''.join( str_dict[ (j * length) : ( ( j + 1 ) * length ) ] ) for j in xrange(len(string)/length )]
# spliting a string by the length of the string
def len_split(string,sub_string):
n,sub,str1=list(string),len(sub_string),')/^0*/-'
for i in range(sub,len(n)+((len(n)-1)//sub),sub+1):
n.insert(i,str1)
n="".join(n)
n=n.split(str1)
return n
x="divyansh_looking_for_intership_actively_contact_Me_here"
sub="four"
print(len_split(x,sub))
# Result-> ['divy', 'ansh', 'tiwa', 'ri_l', 'ooki', 'ng_f', 'or_i', 'nter', 'ship', '_con', 'tact', '_Me_', 'here']
There is a built in function in python for that
import textwrap
text = "Your Text.... and so on"
width = 5 #
textwrap.wrap(text,width)
Vualla
And for dudes who prefer it to be a bit more readable:
def itersplit_into_x_chunks(string,x=10): # we assume here that x is an int and > 0
size = len(string)
chunksize = size//x
for pos in range(0, size, chunksize):
yield string[pos:pos+chunksize]
output:
>>> list(itersplit_into_x_chunks('qwertyui',x=4))
['qw', 'er', 'ty', 'ui']
My solution
st =' abs de fdgh 1234 556 shg shshh'
print st
def splitStringMax( si, limit):
ls = si.split()
lo=[]
st=''
ln=len(ls)
if ln==1:
return [si]
i=0
for l in ls:
st+=l
i+=1
if i <ln:
lk=len(ls[i])
if (len(st))+1+lk < limit:
st+=' '
continue
lo.append(st);st=''
return lo
############################
print splitStringMax(st,7)
# ['abs de', 'fdgh', '1234', '556', 'shg', 'shshh']
print splitStringMax(st,12)
# ['abs de fdgh', '1234 556', 'shg shshh']
l = 'abcdefghijklmn'
def group(l,n):
tmp = len(l)%n
zipped = zip(*[iter(l)]*n)
return zipped if tmp == 0 else zipped+[tuple(l[-tmp:])]
print group(l,3)
The string splitting is required in many cases like where you have to sort the characters of the string given, replacing a character with an another character etc. But all these operations can be performed with the following mentioned string splitting methods.
The string splitting can be done in two ways:
Slicing the given string based on the length of split.
Converting the given string to a list with list(str) function, where characters of the string breakdown to form the the elements of a list. Then do the required operation and join them with 'specified character between the characters of the original string'.join(list) to get a new processed string.

How do I coalesce a sequence of identical characters into just one?

Suppose I have this:
My---sun--is------very-big---.
I want to replace all multiple hyphens with just one hyphen.
import re
astr='My---sun--is------very-big---.'
print(re.sub('-+','-',astr))
# My-sun-is-very-big-.
If you want to replace any run of consecutive characters, you can use
>>> import re
>>> a = "AA---BC++++DDDD-EE$$$$FF"
>>> print(re.sub(r"(.)\1+",r"\1",a))
A-BC+D-E$F
If you only want to coalesce non-word-characters, use
>>> print(re.sub(r"(\W)\1+",r"\1",a))
AA-BC+DDDD-EE$FF
If it's really just hyphens, I recommend unutbu's solution.
If you really only want to coalesce hyphens, use the other suggestions. Otherwise you can write your own function, something like this:
>>> def coalesce(x):
... n = []
... for c in x:
... if not n or c != n[-1]:
... n.append(c)
... return ''.join(n)
...
>>> coalesce('My---sun--is------very-big---.')
'My-sun-is-very-big-.'
>>> coalesce('aaabbbccc')
'abc'
As usual, there's a nice itertools solution, using groupby:
>>> from itertools import groupby
>>> s = 'aaaaa----bbb-----cccc----d-d-d'
>>> ''.join(key for key, group in groupby(s))
'a-b-c-d-d-d'
How about:
>>> import re
>>> re.sub("-+", "-", "My---sun--is------very-big---.")
'My-sun-is-very-big-.'
the regular expression "-+" will look for 1 or more "-".
re.sub('-+', '-', "My---sun--is------very-big---")
How about an alternate without the re module:
'-'.join(filter(lambda w: len(w) > 0, 'My---sun--is------very-big---.'.split("-")))
Or going with Tim and FogleBird's previous suggestion, here's a more general method:
def coalesce_factory(x):
return lambda sent: x.join(filter(lambda w: len(w) > 0, sent.split(x)))
hyphen_coalesce = coalesce_factory("-")
hyphen_coalesce('My---sun--is------very-big---.')
Though personally, I would use the re module first :)
mcpeterson
Another simple solution is the String object's replace function.
while '--' in astr:
astr = astr.replace('--','-')
if you don't want to use regular expressions:
my_string = my_string.split('-')
my_string = filter(None, my_string)
my_string = '-'.join(my_string)
I have
my_str = 'a, b,,,,, c, , , d'
I want
'a,b,c,d'
compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between:
my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Iterative find/replace from a list of tuples in Python

I have a list of tuples, each containing a find/replace value that I would like to apply to a string. What would be the most efficient way to do so? I will be applying this iteratively, so performance is my biggest concern.
More concretely, what would the innards of processThis() look like?
x = 'find1, find2, find3'
y = [('find1', 'replace1'), ('find2', 'replace2'), ('find3', 'replace3')]
def processThis(str,lst):
# Do something here
return something
>>> processThis(x,y)
'replace1, replace2, replace3'
Thanks, all!
You could consider using re.sub:
import re
REPLACEMENTS = dict([('find1', 'replace1'),
('find2', 'replace2'),
('find3', 'replace3')])
def replacer(m):
return REPLACEMENTS[m.group(0)]
x = 'find1, find2, find3'
r = re.compile('|'.join(REPLACEMENTS.keys()))
print r.sub(replacer, x)
A couple notes:
The boilerplate argument about premature optimization, benchmarking, bottlenecks, 100 is small, etc.
There are cases where the different solutions will return different results. if y = [('one', 'two'), ('two', 'three')] and x = 'one' then mhawke's solution gives you 'two' and Unknown's gives 'three'.
Testing this out in a silly contrived example mhawke's solution was a tiny bit faster. It should be easy to try it with your data though.
x = 'find1, find2, find3'
y = [('find1', 'replace1'), ('find2', 'replace2'), ('find3', 'replace3')]
def processThis(str,lst):
for find, replace in lst:
str = str.replace(find, replace)
return str
>>> processThis(x,y)
'replace1, replace2, replace3'
s = reduce(lambda x, repl: str.replace(x, *repl), lst, s)
Same answer as mhawke, enclosed with method str_replace
def str_replace(data, search_n_replace_dict):
import re
REPLACEMENTS = search_n_replace_dict
def replacer(m):
return REPLACEMENTS[m.group(0)]
r = re.compile('|'.join(REPLACEMENTS.keys()))
return r.sub(replacer, data)
Then we can call this method with example as below
s = "abcd abcd efgh efgh;;;;;; lkmnkd kkkkk"
d = dict({ 'abcd' : 'aaaa', 'efgh' : 'eeee', 'mnkd' : 'mmmm' })
print (s)
print ("\n")
print(str_replace(s, d))
output :
abcd abcd efgh efgh;;;;;; lkmnkd kkkkk
aaaa aaaa eeee eeee;;;;;; lkmmmm kkkkk

Categories

Resources