Get all substrings between two different start and ending delimiters - python

I am trying in Python 3 to get a list of all substrings of a given String a, which start after a delimiter x and end right before a delimiter y.
I have found solutions which only get me the first occurence, but the result needs to be a list of all occurences.
start = '>'
end = '</'
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print((s.split(start))[1].split(end)[0])
the above example is what I've got so far. But I am searching for a more elegant and stable way to get all the occurences.
So the expected return as list would contain the javascript code as following entries:
a=eval;b=alert;a(b(/XSS/.source));
a=eval;b=alert;a(b(/XSS/.source));

Looking for patterns in strings seems like a decent job for regular expressions.
This should return a list of anything between a pair of <script> and </script>:
import re
pattern = re.compile(r'<script>(.*?)</script>')
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>\'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print(pattern.findall(s))
Result:
['a=eval;b=alert;a(b(/XSS/.source));', 'a=eval;b=alert;a(b(/XSS/.source));']

Related

Convert data into comma separated values

How do i convert data into comma separated values, i want to convert like
I have this data in excel on single cell
"ABCD x3 ABC, BAC x 3"
Want to convert to
ABCD,ABCD,ABCD,ABC,BAC,BAC,BAC
can't find an easy way to do that.
I am trying to solve it in python so i can get a structured data
Hi Zeeshan to try and sort the string into usable data while also multiplying certain parts of the string is kind of tricky for me.
the best solution I can think of is kind of gross but it seems to work. hopefully my comments aren't too confusing <3
import re
data = "ABCD x3 AB BAC x2"
#this will split the string into a list that you can iterate through.
Datalist = re.findall(r'(\w+)', data)
#create a new list for the final result
newlist = []
for object in Datalist:
#for each object in the Datalist list
#if the object starts with 'x'
if re.search("x.*", object):
#convert the multiplier to type(string) and then split the x from the multiplier number string
xvalue = str(object).split('x')
#grab and remove the last item added to the newlist because it hasnt been multiplied.
lastitem = newlist.pop()
#now we can add the last item back in by as many times as the x value
newlist.extend([lastitem] * int(xvalue[1]))
else:
#if the object doesnt start with an x then we can just add it to the list.
newlist.extend([object])
#print result
print(newlist)
#re.search() - looks for a match in a string
#.split() - splits a string into multiple substrings
#.pop() - removes the last item from a list and returns that item.
#.extend() - adds an item to the end of a list
keep in mind that to find the multiplier its looking for x followed by a number (x1). if there is a space for example = (x 1) then it will match x but it wont return a value because there is a space.
there might be multiple ways around this issue and I think the best fix will be to restructure how the data is Formatted into the cell.
here are a couple of ways you can work with the data. it wont directly solve your issue but I hope it will help you think about how you approach it (not being rude I don't actually have a good way to handle your example <3 )
split() will split your string as character 'x' and return a list of substrings you can iterate over.
data = 'ABCD ABCD ABCD ABC BAC BAC BAC'
splitdata = data.split(' ')
print(splitdata)
#prints - ['ABCD', 'ABCD', 'ABCD', 'ABC', 'BAC', 'BAC', 'BAC']
you could also try and match strings from the data
import re
data2 = "ABCD x3 ABC BAC x3"
result = []
for match in re.finditer(r'(\w+) x(\d+)', data2):
substring, count = match.groups()
result.extend([substring] * int(count))
print(result)
use re.finditer to go through the string and match the data with the following format = '(\w+) x(\d+)'
each match then gets added to the list.
'\w' is used to match a character.
'\d' is used to match a digit.
'+' is the quantifier, means one or more.
so we are matching = '(\w+) x(\d+)',
which broken down means we are matching (\w+) one or more characters followed by a 'space' then 'x' followed by (\d+) one or more digits
so because your cell data is essentially a string followed by a multiplier then a string followed by another string and then another multiplier, the data just feels too random for a general solution and i think this requires a direct solution that can only work if you know exactly what data is already in the cell. that's why i think the best way to fix it is to rework the data in the cell first. im in no way an expert and this answer is to help you think of ways around the problem and to add to the discussion :) ,if someone wants to correct me and offer a better solution to this I would love to know myself.

In python, find tokens in line

long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]

How can i find the Correct String or 3 Most Similar Strings from a List using a String?

Trying to find 3 closest matches with a given string like "_e_ul" and the correct match from my list would be Mehul, but difflib.get_close_matches seems to be getting very weird matches that don't match my word at all and look very random.
Also if this helps at all, I have a list with all the Possible Strings so the String I am given will surely be in the List. I need some library that can match the positions of the alphabets shown in the string with that of the list.
res = get_close_matches(hint,pokemons,n=3,cutoff= 0.2)
if res == []:
print("No Similar Names Found Unfortunately")
else:
var = "\n".join(res)
print(f"**Did you Mean...**\n`{var}`")```
hint was _ar__n_e - answer supposed to be mareanie
pokemons is a list containing all pokemon names
The output came out to be
**Did you Mean...**
`Cacturne
Carnivine
Charmander`

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

How to merge strings with overlapping characters in python?

I'm working on a python project which reads in an URL encoded overlapping list of strings. Each string is 15 characters long and overlaps with its sequential string by at least 3 characters and at most 15 characters (identical).
The goal of the program is to go from a list of overlapping strings - either ordered or unordered - to a compressed URL encoded string.
My current method fails at duplicate segments in the overlapping strings. For example, my program is incorrectly combining:
StrList1 = [ 'd+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
to output:
output = ['ublic+class+HelloWorld+%7B%0A++++public+', '%2F%2F+Sample+program%0Apublic+static+v`]
when correct output is:
output = ['%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v']
I am using simple python, not biopython or sequence aligners, though perhaps I should be?
Would greatly appreciate any advice on the matter or suggestions of a nice way to do this in python!
Thanks!
You can start with one of the strings in the list (stored as string), and for each of the remaining strings in the list (stored as candidate) where:
candidate is part of string,
candidate contains string,
candidate's tail matches the head of string,
or, candidate's head matches the tail of string,
assemble the two strings according to how they overlap, and then recursively repeat the procedure with the overlapping string removed from the remaining strings and the assembled string appended, until there is only one string left in the list, at which point it is a valid fully assembled string that can be added to the final output.
Since there can potentially be multiple ways several strings can overlap with each other, some of which can result in the same assembled strings, you should make output a set of strings instead:
def assemble(str_list, min=3, max=15):
if len(str_list) < 2:
return set(str_list)
output = set()
string = str_list.pop()
for i, candidate in enumerate(str_list):
matches = set()
if candidate in string:
matches.add(string)
elif string in candidate:
matches.add(candidate)
for n in range(min, max + 1):
if candidate[:n] == string[-n:]:
matches.add(string + candidate[n:])
if candidate[-n:] == string[:n]:
matches.add(candidate[:-n] + string)
for match in matches:
output.update(assemble(str_list[:i] + str_list[i + 1:] + [match]))
return output
so that with your sample input:
StrList1 = ['d+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
assemble(StrList1) would return:
{'%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v'}
or as an example of an input with various overlapping possibilities (that the second string can match the first by being inside, having tail matching the head, and having head matching the tail):
assemble(['abcggggabcgggg', 'ggggabc'])
would return:
{'abcggggabcgggg', 'abcggggabcggggabc', 'abcggggabcgggggabc', 'ggggabcggggabcgggg'}

Categories

Resources