Python string split by multiple delimiters

Python string split by multiple delimiters - python

Given a string: s = FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE
The delimiting characters are P, Q, Dand E
I want to be able to split the string on these characters.
Based on: Is it possible to split a string on multiple delimiters in order?
I have the following
def splits(s,seps):
l,_,r = s.partition(seps[0])
if len(seps) == 1:
return [l,r]
return [l] + splits(r,seps[1:])
seps = ['P', 'D', 'Q', 'E']
sequences = splits(s, seps)
This gives me:
['FFFFRRFFFFFFF',
'PRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLF',
'RRFRRFFFFFFFFR',
'',
'E']
As we can see the second entry has many P.
I want is the occurrence of characters between the last set of P, not the first occurrence (i.e., RFFFFFFFLF).
Also, the order of occurrence of the delimiting characters is not fixed.
Looking for solutions/hints on how to achieve this?
Update: Desired output, all set of strings between these delimiters (similar to the one shown) but adhering to the condition of the last occurrence as above
Update2: Expected output
['FFFFRRFFFFFFF',
'RFFFFFFFLF', # << this is where the output differs
'RRFRRFFFFFFFFR',
'',
''] # << the last E is 2 consecutive E with no other letters, hence should be empty

Sounds like you want to split at sequence from first character appearance until the last.
([PDQE])(?:.*\1)?
([PDQE]) captures one of the characters in class
(?:.*\1)? optionally match any amount of characters until last occurence of captured.
Have a try with split pattern at regex101 and a PHP Demo at 3v4l.org (should be similar in Python).

import re
s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"
def get_sequences(s):
seen_delimiters = {c: ('', None) for c in 'PDQE'}
order = 0
for g in re.finditer(r'(.*?)([PDQE]|\Z)', s):
if g[2]:
if seen_delimiters[g[2][0]][1] == None:
seen_delimiters[g[2][0]] = (g[1], order)
order += 1
return seen_delimiters
for k, (seq, order) in get_sequences(s).items():
print('{}: order: {} seq: {}'.format(k, order, seq))
Prints:
P: order: 0 seq: FFFFRRFFFFFFF
D: order: 1 seq: RFFFFFFFLF
Q: order: 2 seq: RRFRRFFFFFFFFR
E: order: 3 seq:
Update (for print sequences and delimiters enclosing):
import re
s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"
for g in re.finditer(r'(.*?)([PDQE]+|\Z)', s):
print(g[1], g[2])
Prints:
FFFFRRFFFFFFF PP
RRRRRRLLRLLRLLL PP
F PP
L PP
L PP
LF PP
FF P
FLR P
FFRRLLR P
F P
RFFFFFFFLF D
RRFRRFFFFFFFFR QEE

Use re.split with a character class [PQDE]:
import re
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
sequences = re.split(r'[PQDE]', s)
print(sequences)
Output:
['FFFFRRFFFFFFF', '', 'RRRRRRLLRLLRLLL', '', 'F', '', 'L', '', 'L', '', 'LF', '', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '', '', '']
If you want to split on 1 or more delimiter:
import re
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
sequences = re.split(r'[PQDE]+', s)
print(sequences)
Output:
['FFFFRRFFFFFFF', 'RRRRRRLLRLLRLLL', 'F', 'L', 'L', 'LF', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '']
If you want to capture the delimiters:
import re
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
sequences = re.split(r'([PQDE])', s)
print(sequences)
Output:
['FFFFRRFFFFFFF', 'P', '', 'P', 'RRRRRRLLRLLRLLL', 'P', '', 'P', 'F', 'P', '', 'P', 'L', 'P', '', 'P', 'L', 'P', '', 'P', 'LF', 'P', '', 'P', 'FF', 'P', 'FLR', 'P', 'FFRRLLR', 'P', 'F', 'P', 'RFFFFFFFLF', 'D', 'RRFRRFFFFFFFFR', 'Q', '', 'E', '', 'E', '']

This solution is iterating the delimiters one by one, so you can control the order you want to apply each one of them:
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
spliters='PDQE'
for sp in spliters:
if type(s) is str:
s = s.split(sp)
else: #type is list
s=[x.split(sp) for x in s]
s = [item for sublist in s for item in sublist if item != ''] #flatten the list
output:
['FFFFRRFFFFFFF',
'RRRRRRLLRLLRLLL',
'F',
'L',
'L',
'LF',
'FF',
'FLR',
'FFRRLLR',
'F',
'RFFFFFFFLF',
'RRFRRFFFFFFFFR']

Related

Python ignore punctuation and white space

string = "Python, program!"
result = []
for x in string:
if x not in result:
result.append(x)
print(result)
This program makes it so if a repeat letter is used twice in a string, it'll appear only once in the list. In this case, the string "Python, program!" will appear as
['P', 'y', 't', 'h', 'o', 'n', ',', ' ', 'p', 'r', 'g', 'a', 'm', '!']
My question is, how do I make it so the program ignores punctuation such as ". , ; ? ! -", and also white spaces? So the final output would look like this instead:
['P', 'y', 't', 'h', 'o', 'n', 'p', 'r', 'g', 'a', 'm']

Just check if the string (letter) is alphanumeric using str.isalnum as an additional condition before appending the character to the list:
string = "Python, program!"
result = []
for x in string:
if x.isalnum() and x not in result:
result.append(x)
print(result)
Output:
['P', 'y', 't', 'h', 'o', 'n', 'p', 'r', 'g', 'a', 'm']
If you don't want numbers in your output, try str.isalpha() instead (returns True if the character is alphabetic).

You can filler them out using the string module. This build in library contains several constants that refer to collections of characters in order, like letters and whitespace.
import string
start = "Python, program!" #Can't name it string since that's the module's name
result = []
for x in start:
if x not in result and (x in string.ascii_letters):
result.append(x)
print(result)

Nesting a function inside itself (i'm desperate)

Mentally exhausted.
An explanation just for context, dont actually need help with hashes:
I'm trying to make a python script that can bruteforce a hashed string or password (learning only, i'm sure there are tenter code herehousands out there).
The goal is making a function that can try all the possible combinations of different letters, starting from one character (a, b... y, z) and then start trying with one more character (aa, ab... zy, zz then aaa, aab... zzy, zzz) indefinetly until it finds a match.
First, it asks you for a string (aaaa for example) then it hashes the string, and then try to bruteforce that hash with the function, and finally the function returns the string again when it finds a match.
PASSWORD_INPUT = input("Password string input: ")
PASSWORD_HASH = encrypt_password(PASSWORD_INPUT) # This returns the clean hash
found_password = old_decrypt() # This is the function below
print(found_password)
I managed to do this chunk of ugly ass code:
built_password = ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
def old_decrypt():
global built_password
# First letter
for a in range(len(characters)): # Characters is a list with the abecedary
built_password[0] = characters[a]
if test_password(pretty(built_password)): # This returns True if it matches
return pretty(built_password)
# Second letter
for b in range(len(characters)):
built_password[1] = characters[b]
if test_password(pretty(built_password)):
return pretty(built_password)
# Third letter
for c in range(len(characters)):
built_password[2] = characters[c]
if test_password(pretty(built_password)):
return pretty(built_password)
# Fourth letter
for d in range(len(characters)):
built_password[3] = characters[d]
if test_password(pretty(built_password)):
return pretty(built_password)
The problem of this is that it only works with 4 letters strings.
As you can see, it's almost the exact same loop for every letter, so i thought "Hey i can make this a single function"... After obsessively trying everything that came to my mind for 3 whole days i came with this:
# THIS WORKS
def decrypt(letters_amount_int):
global built_password
for function_num in range(letters_amount_int):
for letter in range(len(characters)):
built_password[function_num] = characters[letter]
if letters_amount_int >= 1:
decrypt(letters_amount_int - 1)
if test_password(pretty(built_password)):
return pretty(built_password)
# START
n = 1
while True:
returned = decrypt(n)
# If it returns a string it gets printed, else trying to print None raises TypeError
try:
print("Found hash for: " + returned)
break
except TypeError:
n += 1
Function gets a "1", tries with 1 letter and if it doesnt return anything it gets a "2" and then tries with 2.
It works, but for some reason it makes a ridiculous number of unnecessary loops that takes exponentially more and more time, i've been smashing my head and came to the conclussion that i'm not understanding something about python's internal functioning.
Can someone please drop some light on this? Thanks
In case of needed these are the other functions:
def encrypt_password(password_str):
return hashlib.sha256(password_str.encode()).hexdigest()
def test_password(password_to_test_str):
global PASSWORD_HASH
if PASSWORD_HASH == encrypt_password(password_to_test_str):
return True
characters = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D',
'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
'Y', 'Z']

Recursion in this case gives a very elegant solution:
def add_char(s, limit):
if len(s) == limit:
yield s
else:
for c in characters:
yield from add_char(s + c, limit)
def generate_all_passwords_up_to_length(maxlen):
for i in range(1, maxlen + 1):
yield from add_char("", i)
for password in generate_all_passwords_up_to_length(5):
test_password(password)

Maybe you could try something like this. Inspired by Multiple permutations, including duplicates
Itertools has a cartesian product generator, which is related to permutation.
import itertools
def decrypt(characters, num_chars):
for test_list in itertools.product(characters, repeat=num_chars):
test_str = ''.join(test_list)
if test_password(test_str):
return test_str
return None
for i in range(min_chars, max_chars+1):
result = decrypt(characters, i)
if result:
print(f'Match found: {result}')
If you run this code with characters, min_chars, max_chars = (characters, 1, 3) and print test_str at each step, you'll get:
0
1
2
00
01
02
10
11
12
20
21
22
or will stop earlier if a match is found. I recommend you look up a recursive, pure python implementation of the the cartesian product function if you want to learn more. However, I'd suspect the cartesian product generator will be faster than a recursive solution.
Note that itertools.product() is a generator, so it's generating each value on demand, and writing it this way allows you to find a match for shorter strings faster than longer ones. But the time it takes this brute force algorithm should indeed increase exponentially with the length of the true password.
Hope this helps.

Returning a set for all individual letters, vs a set for each word

I don't understand why I am receiving a set for each individual letter when I have the code below; however, when I simply remove the '''if word in 'abcdefghijklmnopqrstuvwxyz ':''' then I receive a set for each phrase. However, I need something that will remove anything that isn't a space (i.e. / [ ] - etc., from the larger passage, so the abcd was the best I could think of for this).
Two follow-up questions:
It seems that if I use return vs print, I receive two different answers (return only returns the last set; where print returns all sets).
Rather than having it be 5 individual sets, how would I put this into a list of 5 sets?
def make_itemsets(words):
words = str(words)
words.lower().split()
for word in words:
newset = set()
if word in 'abcdefghijklmnopqrstuvwxyz ':
newset.update(word)
print(newset)
words = ['sed', 'ut', 'perspiciatis', 'unde', 'omnis']
make_itemsets(words)
This returns the five lists (but doesn't remove all excess and won't remove non-characters from the larger passage):
def make_itemsets(words):
words = str(words)
words.lower().split()
for word in words:
newset = set()
newset.update(word)
print(newset)
This would be expected output:
[{'d', 'e', 's'},
{'t', 'u'},
{'a', 'c', 'e', 'i', 'p', 'r', 's', 't'},
{'d', 'e', 'n', 'u'},
{'i', 'm', 'n', 'o', 's'}]

You can have your expected output like this:
print ( [set(w) for w in words] )
Output is:
[{'d', 's', 'e'}, {'u', 't'}, {'p', 'e', 'i', 'a', 'c', 'r', 's', 't'}, {'d', 'u', 'e', 'n'}, {'m', 'i', 'o', 's', 'n'}]
Note that sets have no order.
If you want words which are alphabetic characters only, you can do this:
print ( [set(w) for w in words if w.isalpha()] )

Match all clusters containing only letters:
for word in re.compile('[a-z]+').findall('sed ut perspfkdls'):
If you want to keep create a list for aggragated results:
result = []
...
result.append({c for c in word})
...
return result
Edit: I updated my answer after reading the clarification.
def make_itemsets(words):
matcher = re.compile('[a-z]+')
words = str(words).lower()
words = matcher.findall(words)
return [{c for c in w} for w in words]
Edit 2: I already gave almost a complete implementation, so I connected the dots.

How to join subsequent digits in a Python list into a double (or more) digit number

I have the following string:
string = 'TAA15=ATT'
I make a list out of it:
string_list = list(string)
print(string_list)
and the result is:
['T', 'A', 'A', '1', '5','=', 'A', 'T', 'T']
I need to detect subsequent digits and join them into a single number, as shown below:
['T', 'A', 'A', '15','=', 'A', 'T', 'T']
I'm also quite concerned with performances. This string conversion is done thousand times.
Thank you for any hints you can provide.

Here is a very short solution
import re
def digitsMerger(source):
return re.findall(r'\d+|.', source)
digitsMerger('TAA15=ATT')
['T', 'A', 'A', '15', '=', 'A', 'T', 'T']

Using itertools.groupby
Ex:
from itertools import groupby
string = 'TAA15=ATT'
result = []
for k, v in groupby(string, str.isdigit):
if k:
result.append("".join(v))
else:
result.extend(v)
print(result)
Output:
['T', 'A', 'A', '15', '=', 'A', 'T', 'T']

Another regexp:
import re
s = 'TAA15=ATT'
pattern = r'\d+|\D'
m = re.findall(pattern, s)
print(m)

You can use regular expressions, in Python the library re:
import re
string = 'TAA15=ATT'
num = re.sub('[^0-9,]', "", string)
pos = string.find(num)
str2 = re.sub('\\d+',"", string)
str2 = re.sub('=',"", str2)
print(str2)
l = list()
for el in str2:
l.append(el)
l.insert(pos, num)
print(l)
Basically re.sub('[^0-9,]', "", string) is telling: take the string, match all the characters that are not (^ means negation) numbers (0-9) and substitute them with the second parameter, ie., an empty string. So basically what's left are only digits that you have to convert to an integer.
If the = is always after the digit instead of
str2 = re.sub('\\d+',"", string)
str2 = re.sub('=',"", str2)
you can do
str2 = re.sub('\\d+=',"", string)

You can create a function that compares the last value seen and the next and use functools.reduce:
from functools import reduce
string_list = ['T', 'A', 'A', '1', '5', 'A', 'T', 'T']
def combine_nums(lst, nxt):
if lst and all(map(str.isdigit, (lst[-1], nxt))):
nxt = lst[-1] + nxt
return lst + [nxt]
print(reduce(combine_nums, string_list, [])
Results:
['T', 'A', 'A', '1', '15', 'A', 'T', 'T']

add a single spaces in list containing characters

I am new to the programming. I have a list. List contains multiple spaces. All the multiple spaces should be replaced with single space.
lis = ['h','e','l','l','','','','','o','','','','w']
output = ['h','e','l','l','','o','','w']
Could anyone tell how to do?

Just simple list comprehension will suffice:
lis = ['h','e','l','l','','','','','o','','','','w']
output = [v for i, v in enumerate(lis) if v != '' or i == 0 or (v == '' and lis[i-1] != '')]
print(output)
This will print:
['h', 'e', 'l', 'l', '', 'o', '', 'w']

You can use a list comprehension with enumerate
and select only those '' which
follow non-empty characters themselves
[c for i, c in enumerate(lis) if c or (not c and lis[i - 1])]

You can use itertools. The idea here is to group according whether strings are empty, noting bool('') == False.
from itertools import chain, groupby
L = (list(j) if i else [''] for i, j in groupby(lis, key=bool))
res = list(chain.from_iterable(L))
print(res)
['h', 'e', 'l', 'l', '', 'o', '', 'w']

You can simply use zip() within a list comprehension as following:
In [21]: lis = ['', '', 'h','e','l','l','','','','','o','','','','w', '', '']
In [22]: lis[:1] + [j for i, j in zip(lis, lis[1:]) if i or j]
Out[22]: ['', 'h', 'e', 'l', 'l', '', 'o', '', 'w', '']
In this case we're looping over each pair an keeping the second item if one of the items in our pair is valid (not empty). You just then need to add the first item to the result because it's omitted.

Why not just a for loop?
new_list = []
for char in lis:
if char == '' and new_list[-1] == '': continue
new_list.append(char)
Outputs
['h', 'e', 'l', 'l', '', 'o', '', 'w']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python string split by multiple delimiters - python

Related

Python ignore punctuation and white space

Nesting a function inside itself (i'm desperate)

Returning a set for all individual letters, vs a set for each word

How to join subsequent digits in a Python list into a double (or more) digit number

add a single spaces in list containing characters

Categories

Resources