Regex: Split characters with "/" - python

I have these strings, for example:
['2300LO/LCE','2302KO/KCE']
I want to have output like this:
['2300LO','2300LCE','2302KO','2302KCE']
How can I do it with Regex in Python?
Thanks!

You can make a simple generator that yields the pairs for each string. Then you can flatten them into a single list with itertools.chain()
from itertools import product, chain
def getCombos(s):
nums, code = re.match(r'(\d+)(.*)', s).groups()
for pair in product([nums], code.split("/")):
yield ''.join(pair)
a = ['2300LO/LCE','2302KO/KCE']
list(chain.from_iterable(map(getCombos, a)))
# ['2300LO', '2300LCE', '2302KO', '2302KCE']
This has the added side benefit or working with strings like '2300LO/LCE/XX/CC' which will give you ['2300LO', '2300LCE', '2300XX', '2300CC',...]

You can try something like this:
list1 = ['2300LO/LCE','2302KO/KCE']
list2 = []
for x in list1:
a = x.split('/')
tmp = re.findall(r'\d+', a[0]) # extracting digits
list2.append(a[0])
list2.append(tmp[0] + a[1])
print(list2)

This can be implemented with simple string splits.
Since you asked the output with regex, here is your answer.
list1 = ['2300LO/LCE','2302KO/KCE']
import re
r = re.compile("([0-9]{1,4})([a-zA-Z].*)/([a-zA-Z].*)")
out = []
for s in list1:
items = r.findall(s)[0]
out.append(items[0]+items[1])
out.append(items[2])
print(out)
The explanation for the regex - (4 digit number), followed by (any characters), followed by a / and (rest of the characters).
they are grouped with () , so that when you use find all, it becomes individual elements.

Related

How to extract strings between two markers for each object of a list in python

I got a list of strings. Those strings have all the two markers in. I would love to extract the string between those two markers for each string in that list.
example:
markers 'XXX' and 'YYY' --> therefore i want to extract 78665786 and 6866
['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]
You can just loop over your list and grab the substring. You can do something like:
import re
my_list = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
output = []
for item in my_list:
output.append(re.search('XXX(.*)YYY', item).group(1))
print(output)
Output:
['78665786', '6866']
import re
l = ['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]
l = [re.search(r'XXX(.*)YYY', i).group(1) for i in l]
This should work
Another solution would be:
import re
test_string=['XXX78665786YYYjajk','XXX78665783336YYYjajk']
int_val=[int(re.search(r'\d+', x).group()) for x in test_string]
the command split() splits a String into different parts.
list1 = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
list2 = []
for i in list1:
d = i.split("XXX")
for g in d:
d = g.split("YYY")
list2.append(d)
print(list2)
it's saved into a list

How to filter list based on multiple conditions?

I have the following lists:
target_list = ["FOLD/AAA.RST.TXT"]
and
mylist =
[
"FOLD/AAA.RST.12345.TXT",
"FOLD/BBB.RST.12345.TXT",
"RUNS/AAA.FGT.12345.TXT",
"FOLD/AAA.RST.87589.TXT",
"RUNS/AAA.RST.11111.TXT"
]
How can I filter only those records of mylist that correspond to target_list? The expected result is:
"FOLD/AAA.RST.12345.TXT"
"FOLD/AAA.RST.87589.TXT"
The following mask is considered for filtering mylist
xxx/yyy.zzz.nnn.txt
If xxx, yyy and zzz coincide with target_list, then the record should be selected. Otherwise it should be dropped from the result.
How can I solve this task withou using for loop?
selected_list = []
for t in target_list:
r1 = l.split("/")[0]
a1 = l.split("/")[1].split(".")[0]
b1 = l.split("/")[1].split(".")[1]
for l in mylist:
r2 = l.split("/")[0]
a2 = l.split("/")[1].split(".")[0]
b2 = l.split("/")[1].split(".")[1]
if (r1==r2) & (a1==a2) & (b1==b2):
selected_list.append(l)
You can define a "filter-making function" that preprocesses the target list. The advantages of this are:
Does minimal work by caching information about target_list in a set: The total time is O(N_target_list) + O(N), since set lookups are O(1) on average.
Does not use global variables. Easily testable.
Does not use nested for loops
def prefixes(target):
"""
>>> prefixes("FOLD/AAA.RST.TXT")
('FOLD', 'AAA', 'RST')
>>> prefixes("FOLD/AAA.RST.12345.TXT")
('FOLD', 'AAA', 'RST')
"""
x, rest = target.split('/')
y, z, *_ = rest.split('.')
return x, y, z
def matcher(target_list):
targets = set(prefixes(target) for target in target_list)
def is_target(t):
return prefixes(t) in targets
return is_target
Then, you could do:
>>> list(filter(matcher(target_list), mylist))
['FOLD/AAA.RST.12345.TXT', 'FOLD/AAA.RST.87589.TXT']
Define a function to filter values:
target_list = ["FOLD/AAA.RST.TXT"]
def keep(path):
template = get_template(path)
return template in target_list
def get_template(path):
front, numbers, ext = path.rsplit('.', 2)
template = '.'.join([front, ext])
return template
This uses str.rsplit which searches the string in reverse and splits it on the given character, . in this case. The parameter 2 means it only performs at most two splits. This gives us three parts, the front, the numbers, and the extension:
>>> 'FOLD/AAA.RST.12345.TXT'.rsplit('.', 2)
['FOLD/AAA.RST', '12345', 'TXT']
We assign these to front, numbers and ext.
We then build a string again using str.join
>>> '.'.join(['FOLD/AAA.RST', 'TXT']
'FOLD/AAA.RST.TXT'
So this is what get_template returns:
>>> get_template('FOLD/AAA.RST.12345.TXT')
'FOLD/AAA.RST.TXT'
We can use it like so:
mylist = [
"FOLD/AAA.RST.12345.TXT",
"FOLD/BBB.RST.12345.TXT",
"RUNS/AAA.FGT.12345.TXT",
"FOLD/AAA.RST.87589.TXT",
"RUNS/AAA.RST.11111.TXT"
]
from pprint import pprint
pprint(filter(keep, mylist))
Output:
['FOLD/AAA.RST.12345.TXT'
'FOLD/AAA.RST.87589.TXT']
You can use regular expressions to define a pattern, and check if your strings match that pattern.
In this case, split the target and insert a \d+ in between the xxx/yyy.zzz. and the .txt part. Use this as the pattern.
The pattern \d+ means any number of digits. The rest of the pattern will be created based on the literal values of xxx/yyy.zzz and .txt. Since the period has a special meaning in regular expressions, we have to escape it with a \.
import re
selected_list = []
for target in target_list:
base, ext = target.rsplit(".", 1)
pat = ".".join([base, "\d+", ext] ).replace(".", "\.")
selected_list.append([s for s in mylist if re.match(pat, s) is not None])
print(selected_list)
#[['FOLD/AAA.RST.12345.TXT', 'FOLD/AAA.RST.87589.TXT']]
If the pattern does not match, re.match returns None.
Why not use filter + lambda function:
import re
result=list(filter(lambda item: re.sub(r'.[0-9]+', '', item) == target_list[0], mylist))
Some comments:
The approach is to exclude digits from the comparison. So in the
lambda function, for each mylist item we replace digits with '',
then compare against the only item in target_list, target_list[0].
filter will match all items where the lambda function is True
Wrap everything in list to convert from filter object to list
object

Regular Expressions: Search in list in python3

I have a list of strings.
Consider the code below:
import re
mylist = ["http://abc/12345?abc", "https://abc/abc/2516423120?$abc$"]
r = re.compile("(\d{3,})")
result0 = list(filter(r.findall, mylist)) # Note 1
print(result0)
result1 = r.findall(mylist[0])
result2 = r.findall(mylist[1])
print(result1, result2)
The results are:
['http://abc/12345?abc', 'https://abc/abc/2516423120?$abc$']
['12345'] ['2516423120']
Why is there a difference in the results we get?
Code snippet
I'm not sure what you expected filter to do, but what it does here is that it returns an iterator over all elements x of mylist for which bool(r.findall(x)) is False. This is only the case if r.findall(x) returns an empty list, i.e. the regex does not match the string, so here result0 contains the same values as mylist.

Spliting string into two by comma using python

I have following data in a list and it is a hex number,
['aaaaa955554e']
I would like to split this into ['aaaaa9,55554e'] with a comma.
I know how to split this when there are some delimiters between but how should i do for this case?
Thanks
This will do what I think you are looking for:
yourlist = ['aaaaa955554e']
new_list = [','.join([x[i:i+6] for i in range(0, len(x), 6)]) for x in yourlist]
It will put a comma at every sixth character in each item in your list. (I am assuming you will have more than just one item in the list, and that the items are of unknown length. Not that it matters.)
i assume you wanna split into every 6th character
using regex
import re
lst = ['aaaaa955554e']
newlst = re.findall('\w{6}', lst[0])
# ['aaaaa9', '55554e']
Using list comprehension, this works for multiple items in lst
lst = ['aaaaa955554e']
newlst = [item[i:i+6] for i in range(0,len(a[0]),6) for item in lst]
# ['aaaaa9', '55554e']
This could be done using a regular expression substitution as follows:
import re
print re.sub(r'([a-zA-Z]+\d)(.*?)', r'\1,\2', 'aaaaa955554e', count=1)
Giving you:
aaaaa9,55554e
This splits after seeing the first digit.

Append two integers to list when seperated by '..' Python

If i have a list strings:
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
how would i append the numbers flanking the .. to their corresponding lists ? to get:
first = [1,34,407]
last = [23,405,4032]
i wouldn't mind strings either because i can convert to int later
first = ['1','34','407']
last = ['23','405','4032']
Use re.search to match the numbers between .. and store them in two different groups:
import re
first = []
last = []
for s in my_list:
match = re.search(r'(\d+)\.\.(\d+)', s)
first.append(match.group(1))
last.append(match.group(2))
DEMO.
I'd use a regular expression:
import re
num_range = re.compile(r'(\d+)\.\.(\d+)')
first = []
last = []
my_list = [' abc 1..23',' bcd 34..405','cda 407..4032']
for entry in my_list:
match = num_range.search(entry)
if match is not None:
f, l = match.groups()
first.append(int(f))
last.append(int(l))
This outputs integers:
>>> first
[1, 34, 407]
>>> last
[23, 405, 4032]
One more solution.
for string in my_list:
numbers = string.split(" ")[-1]
first_num, last_num = numbers.split("..")
first.append(first_num)
last.append(last_num)
It will throw a ValueError if there is a string with no spaces in my_list or there is no ".." after the last space in some of the strings (or there is more than one ".." after the last space of the string).
In fact, this is a good thing if you want to be sure that values were really obtained from all the strings, and all of them were placed after the last space. You can even add a try…catch block to do something in case the string it tries to process is in an unexpected format.
first=[(i.split()[1]).split("..")[0] for i in my_list]
second=[(i.split()[1]).split("..")[1] for i in my_list]

Categories

Resources