write RE to get string in between first alphabet to last alphabet? - python

I need to print the 2 missing strings KMJD23KN0008393 and KMJD23KN0008394 but what I am receiving is KMJD23KN8393 and KMJD23KN8394 .I need those missing zeros also in our list.
ll = ['KMJD23KN0008391','KMJD23KN0008392','KMJD23KN0008395','KMJD23KN0008396']
missList=[]
for i in ll:
reList=re.findall(r"[^\W\d_]+|\d+", i)
print(reList)

The issue can be decomposed into three parts:
Extracting the trailing number string
Interpreting that substring as the actual number, which should form a consecutive sequence
Re-formatting the missing items based on the surrounding items.
There are multiple assumptions implicit in these items. You need to be aware of these, and ideally make them explicit. In the following, I’ve worked with the following assumptions:
Anything that forms a number at the end of the string is considered. Everything before that is a prefix and is assumed to be identical throughout the series.
The trailing numbers in the series always have the same width.
The items in the list are sorted in ascending order by their trailing number.
The following code implements these assumptions:
all_missing = []
last_num = int(re.search(r'\d+$', ll[-1])[0])
prefix = re.match('.*\D', ll[0])[0]
for item in ll:
num_str = re.search(r'\d+$', item)[0]
num = int(num_str)
num_width = len(num_str)
for missing in range(last_num + 1, num):
all_missing.append(f'{prefix}{missing:0{num_width}}')
last_num = num
print(all_missing)
Some notes here:
To extract the trailing number, a very simple regex is sufficient: \d+$. That is: one or more digits, until the end of the string.
Conversely, to extract the prefix, we search for any sequence of arbitrary characters where the last character is a non-digit. That is: .*\D.
To re-format the missing items, we concatenate the prefix with the missing number, and we pad the missing number with zeros (from left) until it is of the expected width. This is achieved by using Python’s f-strings with the format specifier '0{num_width}'.

Related

Remove leading zeros from python complex executable string

I am working with Grammatical Evolution (GE) on Python 3.7.
My grammar generates executable strings in the format:
np.where(<variable> <comparison_sign> <constant>, (<probability1>), (<probability2>))
Yet, the string can get quite complex, with several chained np.where .
<constant> in some cases contains leading zeros, which makes the executable string to generate errors. GE is supposed to generate expressions containing leading zeros, however, I have to detect and remove them.
An example of a possible solution containing leading zeros:
"np.where(x < 02, np.where(x > 01.5025, (0.9), (0.5)), (1))"
Problem:
There are two types of numbers containing leading zeros: int and float.
Supposing that I detect "02" in the string. If I replace all occurrences in the string from "02" to "2", the float "01.5025" will also be changed to "01.525", which cannot happen.
I've made several attempts with different re patterns, but couldn't solve it.
To detect that an executable string contains leading zeros, I use:
try:
_ = eval(expression)
except SyntaxError:
new_expression = fix_expressions(expression)
I need help building the fix_expressions Python function.
You could try to come up with a regular expression for numbers with leading zeros and then replace the leading zeros.
import re
def remove_leading_zeros(string):
return re.sub(r'([^\.^\d])0+(\d)', r'\1\2', string)
print(remove_leading_zeros("np.where(x < 02, np.where(x > 01.5025, (0.9), (0.5)), (1))"))
# output: np.where(x < 2, np.where(x > 1.5025, (0.9), (0.5)), (1))
The remove_leading_zeros function basically finds all occurrences of [^\.^\d]0+\d and removes the zeros. [^\.^\d]0+\d translates to not a number nor a dot followed by at least one zero followed by a number. The brackets (, ) in the regex signalize capture groups, which are used to preserve the character before the leading zeros and the number after.
Regarding Csaba Toth's comment:
The problem with 02+03*04 is that there is a zero at the beginning of the string.
One can modify the regex such that it matches also the beginning of the string in the first capture group:
r"(^|[^\.^\d])0+(\d)"
You can remove leading 0's in a string using .lstrip()
str_num = "02.02025"
print("Initial string: %s \n" % str_num)
str_num = str_num.lstrip("0")
print("Removing leading 0's with lstrip(): %s" % str_num)

The number of strings that contain any characters that are not alphabetic characters

I am a few weeks into learning python properly and couldn't find a way to proceed from what I currently have. The question is:
Use the accumulator pattern to write a function count_messy(strings) that takes a list of strings as a parameter and returns an int representing the number of strings that contain any characters that are not alphabetic characters. The string method isalpha will be useful here.
Here's my current code:
def count_messy(strings):
for string in strings:
ans = strings.isalpha(string)
return(len(ans))
print(count_messy(["x", "y2y", "zz%z"]))
Should output:
2
Preferably the use of for loops for the accumulator pattern and no list comprehension will be appreciated.
In order to end this...
The correct solution is:
def count_messy(strings):
count = 0
for string in strings:
if not string.isalpha():
count += 1
return count
Issues with the original code and with several other solution attempts included:
isalpha is a method of the string string, not of the list strings.
not string.isalpha() must be used to count the strings that do not consist only of alphabetic characters.
A count integer variable needs to be initialized with 0 and incremented for each string fulfilling this condition.

How to get all possible incremental values between two mixed strings in Python?

I am new to Python (and programming, in general) and hoping to see if someone can help me. I am trying to automate a task that I am currently doing manually but is no longer feasible. I want to find and write all strings between two given strings. For example, if starting and ending strings are XYZ-DF 000010 and XYZ-DF 000014, the desired output should be XYZ-DF 000010; XYZ-DF 000011; XYZ-DF 000012; XYZ-DF 000013; XYZ-DF 000014. The prefix and numbers (and their padding) are not always the same. For example, next starting and ending strings in the list could be ABC_XY00000001 and ABC_XY00000123. The prefix and padding for any pair of starting and ending strings, though, will always be the same.
I think I need to separate the prefix (includes any alphabets, spaces, underscore, hyphen etc.) and numbers, remove padding from the numbers, increment the numbers by 1 from starting number to ending number for every starting and ending strings in a second loop, and then finally get the output by concatenation.
So far this is what I have:
First, I read the 2 columns that contain a list of starting and ending strings in a csv into lists using pandas:
columns = ['Beg', 'End']
data = pd.read_csv('C:/Downloads/test.csv', names=columns, header = None)
begs = data.Beg.tolist()
ends= data.End.tolist()
Next, I loop over "begs" and "ends" using the zip function.
for beg, end in zip(begs,ends):
Inside the loop, I want to iterate over each string in begs and ends (one pair at a time) and perform the following operations on them:
1) Use regex to separate the characters (including alphabets, spaces, underscore, hyphen etc.) from the numbers (including padding) for each of the strings one at a time.
start = re.match(r"([a-z-_ ]+)([0-9]+)", beg, re.I) #Let's assume first starting string in the begs list is "XYZ-DF 000010" from my example above
prefix = start.group(1) #Should yield "XYZ-DF "
start_num = start.group(2) #Should yield "000010"
padding = (len(start_num)) #Yields 6
start_num_stripped = start_num.lstrip("0") #Yields 10
end = re.match(r"([a-z-_ ]+)([0-9]+)", end, re.I) #Let's assume first ending string in the ends list is "XYZ-DF 000014" from my example above
end_num = end.group(2) #Yields 000014
end_num_stripped = end_num.lstrip("0") #Yields 14
2) After these operations, run a nested while loop from start_num_stripped until end_num_stripped
output_string = ""
while start_num_stripped <= end_num_stripped:
output_string = output_string+prefix+start_num_stripped.zfill(padding)+"; "
start_num_stripped += 1
Finally, how do I write the output_string for each pair of starting and ending strings to a csv file that contains 3 columns containing the starting string, ending string, and their output string? An example of an output in csv format is given below (newline after each row is for clarity and not needed in the output).
"Starting String", "Ending String", "Output String"
"ABCD-00001","ABCD-00003","ABCD-00001; ABCD-00002; ABCD-00003"
"XYZ-DF 000010","XYZ-DF 000012","XYZ-DF 000010; XYZ-DF 000011; XYZ-DF 000012"
"BBB_CC0000008","BBB_CC0000014","BBB_CC0000008; BBB_CC0000009; BBB_CC0000010; BBB_CC0000011; BBB_CC0000012; BBB_CC0000013; BBB_CC0000014"
You could find the longest trailing numeric suffix using a regular expression. Then simply iterate numbers from start to end appending them (with leading zeros) to the common prefix:
import re
startString = "XYZ-DF 000010"
endString = "XYZ-DF 000012"
suffixLen = len(re.findall("[0-9]*$",startString)[0])
start = int("1"+startString[-suffixLen:])
end = int("1"+endString[-suffixLen:])
result = [ startString[:-suffixLen]+str(n)[1:] for n in range(start,end+1) ]
csvLine = '"' + '","'.join([ startString,endString,";".join(result) ]) + '"'
print(csvLine) # "XYZ-DF 000010","XYZ-DF 000012","XYZ-DF 000010;XYZ-DF 000011;XYZ-DF 000012"
Note: using int("1" + suffix) causes numbers in the range to always have 1 more digit than the length of the suffix (1xxxxx). This makes it easy to get the leading zeroes by simply dropping the first character after turning them back into strings str(n)[1:]
Note2: I'm not familiar with pandas but I'm pretty sure it has a way to write a csv directly from the result list rather than formatting it manually as I did here in csvLine.

How to merge strings with overlapping characters in python?

I'm working on a python project which reads in an URL encoded overlapping list of strings. Each string is 15 characters long and overlaps with its sequential string by at least 3 characters and at most 15 characters (identical).
The goal of the program is to go from a list of overlapping strings - either ordered or unordered - to a compressed URL encoded string.
My current method fails at duplicate segments in the overlapping strings. For example, my program is incorrectly combining:
StrList1 = [ 'd+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
to output:
output = ['ublic+class+HelloWorld+%7B%0A++++public+', '%2F%2F+Sample+program%0Apublic+static+v`]
when correct output is:
output = ['%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v']
I am using simple python, not biopython or sequence aligners, though perhaps I should be?
Would greatly appreciate any advice on the matter or suggestions of a nice way to do this in python!
Thanks!
You can start with one of the strings in the list (stored as string), and for each of the remaining strings in the list (stored as candidate) where:
candidate is part of string,
candidate contains string,
candidate's tail matches the head of string,
or, candidate's head matches the tail of string,
assemble the two strings according to how they overlap, and then recursively repeat the procedure with the overlapping string removed from the remaining strings and the assembled string appended, until there is only one string left in the list, at which point it is a valid fully assembled string that can be added to the final output.
Since there can potentially be multiple ways several strings can overlap with each other, some of which can result in the same assembled strings, you should make output a set of strings instead:
def assemble(str_list, min=3, max=15):
if len(str_list) < 2:
return set(str_list)
output = set()
string = str_list.pop()
for i, candidate in enumerate(str_list):
matches = set()
if candidate in string:
matches.add(string)
elif string in candidate:
matches.add(candidate)
for n in range(min, max + 1):
if candidate[:n] == string[-n:]:
matches.add(string + candidate[n:])
if candidate[-n:] == string[:n]:
matches.add(candidate[:-n] + string)
for match in matches:
output.update(assemble(str_list[:i] + str_list[i + 1:] + [match]))
return output
so that with your sample input:
StrList1 = ['d+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
assemble(StrList1) would return:
{'%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v'}
or as an example of an input with various overlapping possibilities (that the second string can match the first by being inside, having tail matching the head, and having head matching the tail):
assemble(['abcggggabcgggg', 'ggggabc'])
would return:
{'abcggggabcgggg', 'abcggggabcggggabc', 'abcggggabcgggggabc', 'ggggabcggggabcgggg'}

Find Certain String Indices

I have this string and I need to get a specific number out of it.
E.G. encrypted = "10134585588147, 3847183463814, 18517461398"
How would I pull out only the second integer out of the string?
You are looking for the "split" method. Turn a string into a list by specifying a smaller part of the string on which to split.
>>> encrypted = '10134585588147, 3847183463814, 18517461398'
>>> encrypted_list = encrypted.split(', ')
>>> encrypted_list
['10134585588147', '3847183463814', '18517461398']
>>> encrypted_list[1]
'3847183463814'
>>> encrypted_list[-1]
'18517461398'
Then you can just access the indices as normal. Note that lists can be indexed forwards or backwards. By providing a negative index, we count from the right rather than the left, selecting the last index (without any idea how big the list is). Note this will produce IndexError if the list is empty, though. If you use Jon's method (below), there will always be at least one index in the list unless the string you start with is itself empty.
Edited to add:
What Jon is pointing out in the comment is that if you are not sure if the string will be well-formatted (e.g., always separated by exactly one comma followed by exactly one space), then you can replace all the commas with spaces (encrypt.replace(',', ' ')), then call split without arguments, which will split on any number of whitespace characters. As usual, you can chain these together:
encrypted.replace(',', ' ').split()

Categories

Resources