I'm working on a text pattern problem. I've the following input -
term = 'CG-14/0,2-L-0_2'
I need to remove all the possible punctuation (delimiters) from the input term. Basically I need the following output from the input term -
'CG1402L02'
I also need to store (in any format (object, dict, tuple etc.)) the delimiter and the position of the delimiter before removing the delimiters.
Example of the output (If returned as tuple) -
((-,2), (/,5), (,,7), (-,9), (-,11), (_,13))
I'm able to get the output using the following python code -
re.sub(r'[^\w]', '', term.replace('_', ''))
But how do I store the delimiter and delimiter position (in the most efficient way) before removing the delimiters?
You can simply walk once through term and collect all nessessary infos on the way:
from string import ascii_letters,digits
term = 'CG-14/0,2-L-0_2'
# defined set of allowed characters a-zA-Z0-9
# set lookup is O(1) - fast
ok = set(digits +ascii_letters)
specials = {}
clean = []
for i,c in enumerate(term):
if c in ok:
clean.append(c)
else:
specials.setdefault(c,[])
specials[c].append(i)
cleaned = ''.join(clean)
print(clean)
print(cleaned)
print(specials)
Output:
['C', 'G', '1', '4', '0', '2', 'L', '0', '2'] # list of characters in set ok
CG1402L02 # the ''.join()ed list
{'-': [2, 9, 11], '/': [5], ',': [7], '_': [13]} # dict of characters/positions not in ok
See:
string.ascii_letters
string.digits
You can use
specials = []
and inside the iteration:
else:
specials.append((c,i))
to get a list of tuples instead of the dictionary:
[('-', 2), ('/', 5), (',', 7), ('-', 9), ('-', 11), ('_', 13)]
You could do something like this, adding whatever other delimiters you need to the list delims
term = 'CG-14/0,2-L-0_2'
delims = ['-','/',',','_']
locations = []
pos = 0
for c in term: ##iterate through the characters in the string
if c in delims:
locations.append([c,pos]) ##store the character and its original position
pos+=1
And then do you re.sub command to replace them.
Related
I have two lists:
list1=('a','b','c')
list2=('2','1','3')
and a text file
the text file has 3 lines so I want to add
'a' in the 2nd line
'-' in others,
'b' in the 1st line
'-' in others, and
'c' in the 3rd line
'-' in others according to the list1 and list2 like this
xxxx-b-
xxxxa--
xxxx--c
First task is to get the first list sorted correctly. This is easy if you zip the two lists together and then sort based on the (int-converted) line number:
>>> list1 = ['a', 'b', 'c']
>>> list2 = ['2', '1', '3']
>>> sorted(zip(list1, list2), key=lambda p: int(p[1]))
[('b', '1'), ('a', '2'), ('c', '3')]
Then you need to format the letter into the appropriate string. I'd do that with something like:
'xxxx' + ''.join(char if char == letter else '-' for char in 'abc')
so all together it's:
>>> for row in [
... 'xxxx' + ''.join(char if char == letter else '-' for char in 'abc')
... for letter, _line in sorted(zip(list1, list2), key=lambda p: int(p[1]))
... ]:
... print(row)
...
xxxx-b-
xxxxa--
xxxx--c
Now you just need to write it to the appropriate text file instead of printing it; since you don't specify how you want to do that (is it a specific text file? is it the parameter to a function? is it an existing file you're appending to?) I'll leave that part for you to fill in. :)
I did it but I think there is a good method than my
list1=['1','4','3','2']
list2=['a','b','c','d']
j=0
while j < len(list1):
with open("note2.txt",'r+') as f:
line = f.readlines()
note=""
f.seek(0)
for index,line in enumerate(line):
if index==list1[j]:
note+=line.strip()+ str(list2[j])+'\n'
else:
note+=line.strip()+ '-\n'
f.write(note)
f.close()
j+=1
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Sorry for the confusing title but describing the question within one line is a bit hard. So I have a list that looks like this:
['','','','A','','','B','','C','','D','','','']
And I want to get something like this:
['A','','','B','C','D']
Process:
1. Remove any starting and ending empty string (those before A and after D).
2. Remove the single empty strings that are sandwiched by non-empty strings (like the ones between B & C and C & D). However, if there are more than 1 empty strings sandwiched, keep them (like those 2 between A & B).
Could someone help me out on this issue? Thank you very much in advance!
Here's one possible solution. You could use itertools.groupby to identify runs of identical strings, and count how many appear in a row:
>>> import itertools
>>> seq = ['','','','A','','','B','','C','','D','','','']
>>> runs = [(c, len(list(g))) for c,g in itertools.groupby(seq)]
>>> runs
[('', 3), ('A', 1), ('', 2), ('B', 1), ('', 1), ('C', 1), ('', 1), ('D', 1), ('', 3)]
Then remove the first and last elements if they are empty strings:
>>> if runs[0][0] == '': runs = runs[1:]
...
>>> if runs[-1][0] == '': runs = runs[:-1]
...
>>> runs
[('A', 1), ('', 2), ('B', 1), ('', 1), ('C', 1), ('', 1), ('D', 1)]
Then remove any interior groups that are composed of one empty string:
>>> runs = [(char, count) for char, count in runs if not (char == '' and count == 1)]
>>> runs
[('A', 1), ('', 2), ('B', 1), ('C', 1), ('D', 1)]
Then reconsitute the runs into a flat list.
>>> result = [char for char, count in runs for _ in range(count)]
>>> result
['A', '', '', 'B', 'C', 'D']
This is an answer that won't work under all conditions, but will work if you can identify a character that is not present in the list. The general idea is to join the list, strip, replace single runs of the element, and then split on the element:
Setup
L = ['', '', '', 'A', '', '', 'B', '', 'C', '', 'D', '', '', '']
import re
re.sub(r'(?<!#)##(?!#)', r'#', '#'.join(L).strip('#')).split('#')
['A', '', '', 'B', 'C', 'D']
Wrap it in a function and assert that the el element is valid:
def custom_stripper(L, el):
"""
Strips empty elements from start/end of a list,
and removes single empty whitespace runs
Parameters
----------
L: iterable, required
The list to modify
el: str, required
An element found nowhere in the joined list
Returns
-------
A properly formatted list
"""
assert(el not in ''.join(L))
rgx = r'(?<!{el}){el}{el}(?!{el})'.format(el=el)
return re.sub(rgx, el, el.join(L).strip(el)).split(el)
>>> custom_stripper(L, '#')
['A', '', '', 'B', 'C', 'D']
>>> custom_stripper(L, 'A')
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-161-7afa6741e503> in <module>()
----> 1 custom_stripper(L, 'A')
<ipython-input-158-606893c3fe1c> in custom_stripper(L, el)
11 """
12
---> 13 assert(el not in ''.join(L))
14 rgx = r'(?<!{el}){el}{el}(?!{el})'.format(el=el)
15
AssertionError:
To break this down:
>>> '#'.join(L).strip('#')
'A###B##C##D'
>>> re.sub(r'(?<!#)##(?!#)', r'#', 'A###B##C##D')
'A###B#C#D'
>>> 'A###B#C#D'.split('#')
['A', '', '', 'B', 'C', 'D']
Regex Explanation
The substitution is key, because it allows replacement of two # in a row (signifying a place in the list where only a single empty string existed). However, you have to be careful that you don't replace two # in a row, inside another run of # (for example, if there were two empty strings in a row). The key here is negative lookahead/lookbehind.
(?<! # Negative lookbehind
# # Asserts string *does not* match #
)
## # Matches ##
(?! # Negative lookahead
# # Asserts string *does not* match #
)
The title pretty much says it all. I have a small run-length decoding script:
def RLdecode(characterList):
decodedString = ""
for character, count in characterList:
decodedString += character.upper() * count
return decodedString
That script requires a list (or whatever this is) that looks like:
[(5,"A"),(2,"B"),(4,"C"),(11,"G")]
But in order to make it more user-friendly, I want the user to be able to input a string like this:
"5A2B4C11G"
How would I convert a string like the one above into a list readable by my script? Also, sorry that the title of the question is very specific, but I don't know what the process is called :\
using itertools.groupby:
There's a nice way to do the letter/digit grouping using itertools.groupby:
import itertools
a="5A2B4C11G"
result = [("".join(v)) for k,v in itertools.groupby(a,str.isdigit)]
that returns ['5', 'A', '2', 'B', '4', 'C', '11', 'G']
Unfortunately, it flattens the number/letter tuple, so more work is required. Note that applying Kaushik solution to that input gives expected result now that the number/letter is properly done:
[(int(result[i]),result[i+1]) for i in range(0,len(result),2)]
result:
[(5, 'A'), (2, 'B'), (4, 'C'), (11, 'G')]
using regexes:
Anyway, in that case, regular expressions are well suited to extract the patterns with the required hierarchy.
Just match the string using 1 or more digits + a letter, and convert the obtained tuples to match the (integer, string) format, using a list comprehension to do so, in one line.
import re
a="5A2B4C11G"
result = [(int(i),v) for i,v in re.findall('(\d+)([A-Z])',a)]
print(result)
gives:
[(5, 'A'), (2, 'B'), (4, 'C'), (11, 'G')]
Using list comprehension :
#s is the string
[(int(s[i]),s[i+1]) for i in range(0,len(s),2)]
#driver values
IN : s="5A2B4C"
OUT : [(5, 'A'), (2, 'B'), (4, 'C')]
Here range(0,len(s),2) gives values as : [0, 2, 4] which we use to go through the string.
NOTE : this ofcourse only works with strings of even size and with numbers below 10.
EDIT : As for numbers with double digits, the answer by Jean-François Fabre works well.
You can do this with regex if you want:
In one line
sorted_list=[i for i in re.findall(pattern, a, re.M)]
Same approach :
import re
a="5A2B4C"
pattern=r'(\d)(\w)'
list=[]
art=re.findall(pattern,a,re.M)
for i in art:
list.append(i)
print(list)
For your new edited problem here is my new solution :
import re
a = "5A2B4C11G"
pattern = r'([0-9]+)([a-zA-Z])'
list = []
art = re.findall(pattern, a, re.M)
for i in art:
list.append(i)
print(list)
Output:
[('5', 'A'), ('2', 'B'), ('4', 'C'), ('11', 'G')]
You have already got the answer from Jean-François Fabre.
The process is call length decoding.
The whole process can be done in one liner by following code.
from re import sub
text = "5A2B4C11G"
sub(r'(\d+)(\D)', lambda m: m.group(2) * int(m.group(1)),text)
OUTPUT : 'AAAAABBCCCCGGGGGGGGGGG'
NOTE This is not the answer but just an optimization idea for the OP as answer is already present in Jean-François Fabre
import re
str = "5A2B4C11G"
pattern = r"(\d+)(\D)" # group1: digit(s), group2: non-digit
substitution = r"\1,\2 " # "ditits,nondigit "
temp = re.sub(pattern, substitution, str) # gives "5,A 2,B 4,C 11,G "
temp = temp.split() # gives ['5,A', '2,B', '4,C', '11,G']
result = [el.split(",") for el in temp] # gives [['5', 'A'], ['2', 'B'],
# ['4', 'C'], ['11', 'G']] - see note
First we replace sequences of digits followed by a symbol to something to which we can apply 2-level split(), choosing 2 different delimiters in the replacement string r"\1,\2 "
space for the 1st level (outer) split(), and
, for the 2nd level one (inner).
Then we apply those 2 splits.
Note: If you have a significant reason to obtain tuples (instead of good enough inner lists), simply apply the tuple() function in the last statement:
result = [tuple(el.split(",")) for el in temp]
I have no idea where to start with this. I need to write a function that will return a string of numbers in ordinal value. So like
stringConvert('DABC')
would give me '4123'
stringConvert('XPFT')
would give me '4213'
I thought maybe I could make a dictionary and make the each letter from the string associate it with an integer, but that seems too inefficient and I still don't know how to put them in order.
You could sort the unique characters in the input string and apply indices to each letter by using the enumerate() function:
def stringConvert(s):
ordinals = {c: str(ordinal) for ordinal, c in enumerate(sorted(set(s)), 1)}
return ''.join([ordinals[c] for c in s])
The second argument to enumerate() is the integer at which to start counting; since your ordinals start at 1 you use that as the starting value rather than 0. set() gives us the unique values only.
ordinals then is a dictionary mapping character to an integer, in alphabetical order.
Demo:
>>> def stringConvert(s):
... ordinals = {c: str(ordinal) for ordinal, c in enumerate(sorted(set(s)), 1)}
... return ''.join([ordinals[c] for c in s])
...
>>> stringConvert('DABC')
'4123'
>>> stringConvert('XPFT')
'4213'
Breaking that all down a little:
>>> s = 'XPFT'
>>> set(s) # unique characters
set(['X', 'F', 'T', 'P'])
>>> sorted(set(s)) # unique characters in sorted order
['F', 'P', 'T', 'X']
>>> list(enumerate(sorted(set(s)), 1)) # unique characters in sorted order with index
[(1, 'F'), (2, 'P'), (3, 'T'), (4, 'X')]
>>> {c: str(ordinal) for ordinal, c in enumerate(sorted(s), 1)} # character to number
{'P': '2', 'T': '3', 'X': '4', 'F': '1'}
Take a look at string module, especially maketrans and translate
With those, your code may look like
def stringConvert(letters):
return translate(letters, maketrans(''.join(sorted(set(letters))).ljust(9), '123456789'))
and pass your strings as variable
You could make a character translation table and use the translate() string method:
from string import maketrans
TO = ''.join(str(i+1)[0] for i in xrange(256))
def stringConvert(s):
frm = ''.join(sorted(set(s)))
return s.translate(maketrans(frm, TO[:len(frm)]))
print stringConvert('DABC') # --> 4123
print stringConvert('XPFT') # --> 4213
I can do basic regex alright, but this is slightly different, namely I don't know what the pattern is going to be.
For example, I have a list of similar strings:
lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
In this case the common pattern is two segments of common text: 'sometxt' and 'moretxt', starting and separated by something else that is variable in length.
The common string and variable string can of course occur at any order and at any number of occasions.
What would be a good way to condense/compress the list of strings into their common parts and individual variations?
An example output might be:
c = ['sometxt', 'moretxt']
v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]
This solution finds the two longest common substrings and uses them to delimit the input strings:
def an_answer_to_stackoverflow_question_1914394(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> an_answer_to_stackoverflow_question_1914394(lst)
(['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
"""
delimiters = find_delimiters(lst)
return delimiters, list(split_strings(lst, delimiters))
find_delimiters and friends finds the delimiters:
import itertools
def find_delimiters(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> find_delimiters(lst)
['sometxt', 'moretxt']
"""
candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
raise ValueError("Unable to find useful delimiters")
if candidates[1] in candidates[0]:
raise ValueError("Unable to find useful delimiters")
return candidates[0:2]
def find_longest_common_substrings(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> list(itertools.islice(find_longest_common_substrings(lst), 3))
['sometxt', 'moretxt', 'sometx']
"""
for i in xrange(min_length(lst), 0, -1):
for substring in common_substrings(lst, i):
yield substring
def min_length(lst):
return min(len(item) for item in lst)
def common_substrings(lst, length):
"""
>>> list(common_substrings(["hello", "world"], 2))
[]
>>> list(common_substrings(["aabbcc", "dbbrra"], 2))
['bb']
"""
assert length <= min_length(lst)
returned = set()
for i, item in enumerate(lst):
for substring in all_substrings(item, length):
in_all_others = True
for j, other_item in enumerate(lst):
if j == i:
continue
if substring not in other_item:
in_all_others = False
if in_all_others:
if substring not in returned:
returned.add(substring)
yield substring
def all_substrings(item, length):
"""
>>> list(all_substrings("hello", 2))
['he', 'el', 'll', 'lo']
"""
for i in range(len(item) - length + 1):
yield item[i:i+length]
split_strings splits the strings using the delimiters:
import re
def split_strings(lst, delimiters):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> list(split_strings(lst, find_delimiters(lst)))
[('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
"""
for item in lst:
parts = re.split("|".join(delimiters), item)
yield tuple(part for part in parts if part != '')
Here is a scary one to get the ball rolling.
>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')
I hope its sheer ugliness will inspire better solutions :)
This seems to be an example of the longest common subsequence problem. One way could be to look at how diffs are generated. The Hunt-McIlroy algorithm seems to have been the first, and is such the simplest, especially since it apparently is non-heuristic.
The first link contains detailed discussion and (pseudo) code examples. Assuming, of course, Im not completely of the track here.
This look much like the LZW algorithm for data (text) compression. There should be python implementations out there, which you may be able to adapt to your need.
I assume you have no a priori knowledge of these sub strings that repeat often.
I guess you should start by identifying substrings (patterns) that frequently occur in the strings. Since naively counting substrings in a set of strings is rather computationally expensive, you'll need to come up with something smart.
I've done substring counting on a large amount of data using generalized suffix trees (example here). Once you know the most frequent substrings/patterns in the data, you can take it from there.
How about subbing out the known text, and then splitting?
import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]