Can I create list from regular expressions? - python

I'm making a crawler.
User can specify regular expression string to download data.
When user input form is:
http://xxx/abc[x-z]/image(9|10|11).png
I want to download these.
http://xxx/abcx/image9.png
http://xxx/abcy/image9.png
http://xxx/abcz/image9.png
http://xxx/abcx/image10.png
http://xxx/abcy/image10.png
http://xxx/abcz/image10.png
http://xxx/abcx/image11.png
http://xxx/abcy/image11.png
http://xxx/abcz/image11.png
Can I create the following list from the above regular expression string? Or, can I use each string in for-in block?

If you are wanting to take a user's given regex as an input and generate a list of strings you can use the library sre_yield:
However, be very aware that trying to parse every possible string of a regex can get out of hand very quickly. You'll need to be sure that your users are aware of the implications that wildcard characters and open ended or repeating groups can have on the number of possible matching strings.
As an example, your regex string: http://xxx/abc[x-z]/image(9|10|11).png does not escape the ., which is a wildcard for any character, so it will generate a lot of unexpected strings. Instead we'll need to escape it as seen in the example below:
>>> import sre_yield
>>> links = []
>>> for each in sre_yield.AllStrings(r'http://xxx/abc[x-z]/image(9|10|11)\.png'):
links.append(each)
Or more simply links = list(sre_yield.AllStrings(r'http://xxx/abc[x-z]/image(9|10|11)\.png'))
The result is:
>>> links
['http://xxx/abcx/image9.png', 'http://xxx/abcy/image9.png',
'http://xxx/abcz/image9.png', 'http://xxx/abcx/image10.png',
'http://xxx/abcy/image10.png', 'http://xxx/abcz/image10.png',
'http://xxx/abcx/image11.png', 'http://xxx/abcy/image11.png',
'http://xxx/abcz/image11.png']

You can use product() from the itertools builtin:
from itertools import product
for x, y in product(['x', 'y', 'z'], range(9, 12)):
print 'http://xxx/abc{}/image{}'.format(x, y)
To build your list you can use a comprehension:
links = ['http://xxx/abc{}/image{}'.format(x, y) for x, y in product(['x', 'y', 'z'], range(9, 12))]

Simple try may be-alternative to the previous answers
lst = ['http://xxx/abc%s/image%s.png'%(x,y) for x, y in [(j,i) for i in (9,10,11) for j in ('x', 'y', 'z')]]
Omitted range and format function for quicker performance.
Analysis- I compared my way and the way posted by Jkdc
I ran both way 100000 times but mean shows that itertools approach is faster in terms of execution time-
from itertools import product
import time
from matplotlib import pyplot as plt
import numpy as np
prodct = []
native = []
def test():
start = time.clock()
lst = ['http://xxx/abc{}/image{}'.format(x, y) for x, y in product(('x', 'y', 'z'), range(9, 11))]
end = time.clock()
print '{0:.50f}'.format(end-start)
prodct.append('{0:.50f}'.format(end-start))
start1 = time.clock()
lst = ['http://xxx/abc%s/image%s'%(x,y) for x, y in [(j,i) for i in (9,10,11) for j in ('x', 'y', 'z')]]
end1 = time.clock()
print '{0:.50f}'.format(end1-start1)
native.append('{0:.50f}'.format(end1-start1))
for i in range(1,100000):
test()
y = np.dot(np.array(native).astype(np.float),100000)
x= np.dot(np.array(prodct).astype(np.float),100000)
print np.mean(y)
print np.mean(x)
and getting result for native(no module) and itertools-product as below
for native 2.1831179834
for itertools-product 1.60410432562

Related

How to extract each word consecutive to its own previous number in a string and sorting the result in Python

Input : x3b4U5i2
Output : bbbbiiUUUUUxxx
How can i solve this problem in Python. I have to print the word next to it's number n times and sort it
It wasn't clear if multiple digit counts or groups of letters should be handled. Here's a solution that does all of that:
import re
def main(inp):
parts = re.split(r"(\d+)", inp)
parts_map = {parts[i]:int(parts[i+1]) for i in range(0, len(parts)-1, 2)}
print(''.join([c*parts_map[c] for c in sorted(parts_map.keys(),key=str.lower)]))
main("x3b4U5i2")
main("x3brx4U5i2")
main("x23b4U35i2")
Result:
bbbbiiUUUUUxxx
brxbrxbrxbrxiiUUUUUxxx
bbbbiiUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUxxxxxxxxxxxxxxxxxxxxxxx
I'm assuming the formatting will always be <char><int> with <int> being in between 1 and 9...
input_ = "x3b4U5i2"
result_list = [input_[i]*int(input_[i+1]) for i in range(0, len(input_), 2)]
result_list.sort(key=str.lower)
result = ''.join(result_list)
There's probably a much more performance-oriented approach to solving this, it's just the first solution that came into my limited mind.
Edit
After the feedback in the comments I've tried to improve performance by sorting it first, but I have actually decreased performance in the following implementaiton:
input_ = "x3b4U5i2"
def sort_first(value):
return value[0].lower()
tuple_construct = [(input_[i], int(input_[i+1])) for i in range(0, len(input_), 2)]
tuple_construct.sort(key=sort_first)
result = ''.join([tc[0] * tc[1] for tc in tuple_construct])
Execution time for 100,000 iterations on it:
1) The execution time is: 0.353036
2) The execution time is: 0.4361724
One option, extract the character/digit(s) pairs with a regex, sort them by letter (ignoring case), multiply the letter by the number of repeats, join:
s = 'x3b4U5i2'
import re
out = ''.join([c*int(i) for c,i in
sorted(re.findall('(\D)(\d+)', s),
key=lambda x: x[0].casefold())
])
print(out)
Output: bbbbiiUUUUUxxx
If you want to handle multiple characters you can use '(\D+)(\d+)'
No list comprehensions or generator expressions in sight. Just using re.sub with a lambda to expand the length encoding, then sorting that, and then joing that back into a string.
import re
s = "x3b4U5i2"
''.join(sorted(re.sub(r"(\D+)(\d+)",
lambda m: m.group(1)*int(m.group(2)),
s),
key=lambda x: x[0].casefold()))
# 'bbbbiiUUUUUxxx'
If we use re.findall to extract a list of pairs of strings and multipliers:
import re
s = 'x3b4U5i2'
pairs = re.findall(r"(\D+)(\d+)", s)
Then we can use some functional style to sort that list before expanding it.
from operator import itemgetter
def compose(f, g):
return lambda x: f(g(x))
sorted(pairs, key=compose(str.lower, itemgetter(0)))
# [('b', '4'), ('i', '2'), ('U', '5'), ('x', '3')]

Python package for converting finite regex to a text array? [duplicate]

Suppose I have the following string:
trend = '(A|B|C)_STRING'
I want to expand this to:
A_STRING
B_STRING
C_STRING
The OR condition can be anywhere in the string. i.e STRING_(A|B)_STRING_(C|D)
would expand to
STRING_A_STRING_C
STRING_B_STRING C
STRING_A_STRING_D
STRING_B_STRING_D
I also want to cover the case of an empty conditional:
(|A_)STRING would expand to:
A_STRING
STRING
Here's what I've tried so far:
def expandOr(trend):
parenBegin = trend.index('(') + 1
parenEnd = trend.index(')')
orExpression = trend[parenBegin:parenEnd]
originalTrend = trend[0:parenBegin - 1]
expandedOrList = []
for oe in orExpression.split("|"):
expandedOrList.append(originalTrend + oe)
But this is obviously not working.
Is there any easy way to do this using regex?
Here's a pretty clean way. You'll have fun figuring out how it works :-)
def expander(s):
import re
from itertools import product
pat = r"\(([^)]*)\)"
pieces = re.split(pat, s)
pieces = [piece.split("|") for piece in pieces]
for p in product(*pieces):
yield "".join(p)
Then:
for s in ('(A|B|C)_STRING',
'(|A_)STRING',
'STRING_(A|B)_STRING_(C|D)'):
print s, "->"
for t in expander(s):
print " ", t
displays:
(A|B|C)_STRING ->
A_STRING
B_STRING
C_STRING
(|A_)STRING ->
STRING
A_STRING
STRING_(A|B)_STRING_(C|D) ->
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
import exrex
trend = '(A|B|C)_STRING'
trend2 = 'STRING_(A|B)_STRING_(C|D)'
>>> list(exrex.generate(trend))
[u'A_STRING', u'B_STRING', u'C_STRING']
>>> list(exrex.generate(trend2))
[u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']
I would do this to extract the groups:
def extract_groups(trend):
l_parens = [i for i,c in enumerate(trend) if c == '(']
r_parens = [i for i,c in enumerate(trend) if c == ')']
assert len(l_parens) == len(r_parens)
return [trend[l+1:r].split('|') for l,r in zip(l_parens,r_parens)]
And then you can evaluate the product of those extracted groups using itertools.product:
expr = 'STRING_(A|B)_STRING_(C|D)'
from itertools import product
list(product(*extract_groups(expr)))
Out[92]: [('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Now it's just a question of splicing those back onto your original expression. I'll use re for that :)
#python3.3+
def _gen(it):
yield from it
p = re.compile('\(.*?\)')
for tup in product(*extract_groups(trend)):
gen = _gen(tup)
print(p.sub(lambda x: next(gen),trend))
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
There's probably a more readable way to get re.sub to sequentially substitute things from an iterable, but this is what came off the top of my head.
It is easy to achieve with sre_yield module:
>>> import sre_yield
>>> trend = '(A|B|C)_STRING'
>>> strings = list(sre_yield.AllStrings(trend))
>>> print(strings)
['A_STRING', 'B_STRING', 'C_STRING']
The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently... It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

Regex: Split characters with "/"

I have these strings, for example:
['2300LO/LCE','2302KO/KCE']
I want to have output like this:
['2300LO','2300LCE','2302KO','2302KCE']
How can I do it with Regex in Python?
Thanks!
You can make a simple generator that yields the pairs for each string. Then you can flatten them into a single list with itertools.chain()
from itertools import product, chain
def getCombos(s):
nums, code = re.match(r'(\d+)(.*)', s).groups()
for pair in product([nums], code.split("/")):
yield ''.join(pair)
a = ['2300LO/LCE','2302KO/KCE']
list(chain.from_iterable(map(getCombos, a)))
# ['2300LO', '2300LCE', '2302KO', '2302KCE']
This has the added side benefit or working with strings like '2300LO/LCE/XX/CC' which will give you ['2300LO', '2300LCE', '2300XX', '2300CC',...]
You can try something like this:
list1 = ['2300LO/LCE','2302KO/KCE']
list2 = []
for x in list1:
a = x.split('/')
tmp = re.findall(r'\d+', a[0]) # extracting digits
list2.append(a[0])
list2.append(tmp[0] + a[1])
print(list2)
This can be implemented with simple string splits.
Since you asked the output with regex, here is your answer.
list1 = ['2300LO/LCE','2302KO/KCE']
import re
r = re.compile("([0-9]{1,4})([a-zA-Z].*)/([a-zA-Z].*)")
out = []
for s in list1:
items = r.findall(s)[0]
out.append(items[0]+items[1])
out.append(items[2])
print(out)
The explanation for the regex - (4 digit number), followed by (any characters), followed by a / and (rest of the characters).
they are grouped with () , so that when you use find all, it becomes individual elements.

pattern match get list and dict from string

I have string below,and I want to get list,dict,var from this string.
How can I to split this string to specific format?
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
import re
m1 = re.findall (r'(?=.*,)(.*?=\[.+?\],?)',s)
for i in m1 :
print('m1:',i)
I only get result 1 correctly.
Does anyone know how to do?
m1: list_c=[1,2],
m1: a=3,b=1.3,c=abch,list_a=[1,2],
Use '=' to split instead, then you can work around with variable name and it's value.
You still need to handle the type casting for values (regex, split, try with casting may help).
Also, same as others' comment, using dict may be easier to handle
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}'
al = s.split('=')
var_l = [al[0]]
value_l = []
for a in al[1:-1]:
var_l.append(a.split(',')[-1])
value_l.append(','.join(a.split(',')[:-1]))
value_l.append(al[-1])
output = dict(zip(var_l, value_l))
print(output)
You may have better luck if you more or less explicitly describe the right-hand side expressions: numbers, lists, dictionaries, and identifiers:
re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
# [('list_c', '[1,2]'), ('a', '3'), ('b', '1.3'), ('c', 'abch'),
# ('list_a', '[1,2]'), ('dict_a', '{a:2,b:3}')]
The answer is like below
import re
from pprint import pprint
s = 'list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1],Save,Record,dict_a={a:2,b:3}'
m1 = re.findall(r"([^=]+)=" # LHS and assignment operator
+r"([+-]?\d+(?:\.\d+)?|" # Numbers
+r"[+-]?\d+\.|" # More numbers
+r"\[[^]]+\]|" # Lists
+r"{[^}]+}|" # Dictionaries
+r"[a-zA-Z_][a-zA-Z_\d]*)", # Idents
s)
temp_d = {}
for i,j in m1:
temp = i.strip(',').split(',')
if len(temp)>1:
for k in temp[:-1]:
temp_d[k]=''
temp_d[temp[-1]] = j
else:
temp_d[temp[0]] = j
pprint(temp_d)
Output is like
{'Record': '',
'Save': '',
'a': '3',
'b': '1.3',
'c': 'abch',
'dict_a': '{a:2,b:3}',
'list_a': '[1]',
'list_c': '[1,2]'}
Instead of picking out the types, you can start by capturing the identifiers. Here's a regex that captures all the identifiers in the string (for lowercase only, but see note):
regex = re.compile(r'([a-z]|_)+=')
#note if you want all valid variable names: r'([a-z]|[A-Z]|[0-9]|_)+'
cases = [x.group() for x in re.finditer(regex, s)]
This gives a list of all the identifiers in the string:
['list_c=', 'a=', 'b=', 'c=', 'list_a=', 'dict_a=']
We can now define a function to sequentially chop up s using the
above list to partition the string sequentially:
def chop(mystr, mylist):
temp = mystr.partition(mylist[0])[2]
cut = temp.find(mylist[1]) #strip leading bits
return mystr.partition(mylist[0])[2][cut:], mylist[1:]
mystr = s[:]
temp = [mystr]
mylist = cases[:]
while len() > 1:
mystr, mylist = chop(mystr, mylist)
temp.append(mystr)
This (convoluted) slicing operation gives this list of strings:
['list_c=[1,2],a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'a=3,b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'b=1.3,c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'c=abch,list_a=[1,2],dict_a={a:2,b:3}',
'list_a=[1,2],dict_a={a:2,b:3}',
'dict_a={a:2,b:3}']
Now cut off the ends using each successive entry:
result = []
for x in range(len(temp) - 1):
cut = temp[x].find(temp[x+1]) - 1 #-1 to remove commas
result.append(temp[x][:cut])
result.append(temp.pop()) #get the last item
Now we have the full list:
['list_c=[1,2]', 'a=3', 'b=1.3', 'c=abch', 'list_a=[1,2]', 'dict_a={a:2,b:3}']
Each element is easily parsable into key:value pairs (and is also executable via exec).

python split a vector format string

I have a string input in the following format: (x,y) where x and y are doubles.
For example : (1,2.556) can be a vector.
I want the easiest way to split it into the x,y values, 1 and 2.556 in this case.
What would you suggest?
You could use code like this:
import ast
text = '(1,2.556)'
vector = ast.literal_eval(text)
print(vector)
The literal_eval function does not have a security risks associated with eval and works just as well in this particular case.
The eval answers are good. But if you are sure of the format of your strings -- always start and end with parentheses, no spaces in the string, etc., then you can do this fairly efficiently:
x, y = (float(num) for num in s[1:-1].split(','))
eval works:
>>> s = "(1.2,3.40)"
>>> eval(s)
(1.2, 3.4)
>>> x,y = eval(s)
>>> x
1.2
>>> y
3.4
eval has potential security risks, but if you trust that you are dealing with strings of that form then this is adequate.
Remove the first and last (, ) and then do splitting according to the comma.
re.sub(r'^\(|\)$', '',string).split(',')
OR
>>> s = "(1,2.556)"
>>> x = [i for i in re.split(r'[,()]', s) if i]
>>> x[0]
'1'
>>> x[1]
'2.556'
If you're sure they'll be passed in exactly this way, try this:
>>> s = '(1,2.556)'
>>> [float(i) for i in s[1:-1].split(',')]
[1.0, 2.556]

Categories

Resources