Python Regex match string between specific string and end character

Python Regex match string between specific string and end character - python

I am building a file stripper to build a config report, and I have a very very long string as my base data. The following is a very small snippet of it, but it at least illustrates what I'm working with.
Snippet Example: DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&
How would I go about matching the following:
between "DEFAULT_GATEWAY=" and "&"
between "DELVRY_AGGREGATION_INTERVAL0=" and "&"
between "DELVRY_AGGREGATION_INTERVAL1=" and "&"
between "DELVRY_SCHEDULE=" and "&"
between "DELVRY_SNI0=" and "&"
between "DELVRY_USE_SSL_TLS1=" and "&"
and building a dict with it like:
{"DEFAULT_GATEWAY":"192.168.88.1",
"DELVRY_AGGREGATION_INTERVAL0":"1",
"DELVRY_AGGREGATION_INTERVAL1":"1",
"DELVRY_SCHEDULE0":"1",
"DELVRY_SNI0":"0",
"DELVRY_USE_SSL_TLS1":"0"}
?

Here is a way to do it.
In [1]: input = 'DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&'
In [2]: input.split('&')
Out[2]:
['DEFAULT_GATEWAY=192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0=1',
'DELVRY_AGGREGATION_INTERVAL1=1',
'DELVRY_SCHEDULE0=1',
'DELVRY_SNI0=192.168.88.158',
'DELVRY_USE_SSL_TLS1=0',
'']
In [3]: [keyval.split('=') for keyval in input.split('&') if keyval]
Out[3]:
[['DEFAULT_GATEWAY', '192.168.88.1'],
['DELVRY_AGGREGATION_INTERVAL0', '1'],
['DELVRY_AGGREGATION_INTERVAL1', '1'],
['DELVRY_SCHEDULE0', '1'],
['DELVRY_SNI0', '192.168.88.158'],
['DELVRY_USE_SSL_TLS1', '0']]
In [4]: dict(keyval.split('=') for keyval in input.split('&') if keyval)
Out[4]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Notes
This is the input line
Split by & to get pairs of key-values. Note the last entry is empty
Split each entry by the equal sign while throwing away empty entries
Build a dictionary
Another Solution
In [8]: import urlparse
In [9]: urlparse.parse_qsl(input)
Out[9]:
[('DEFAULT_GATEWAY', '192.168.88.1'),
('DELVRY_AGGREGATION_INTERVAL0', '1'),
('DELVRY_AGGREGATION_INTERVAL1', '1'),
('DELVRY_SCHEDULE0', '1'),
('DELVRY_SNI0', '192.168.88.158'),
('DELVRY_USE_SSL_TLS1', '0')]
In [10]: dict(urlparse.parse_qsl(input))
Out[10]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}

Split first by '&' to get a list of strings, then by '=', like this:
d = dict(kv.split('=') for kv in line.split('&'))

import re
keys = {"DEFAULT_GATEWAY",
"DELVRY_AGGREGATION_INTERVAL0",
"DELVRY_AGGREGATION_INTERVAL1",
"DELVRY_SCHEDULE0",
"DELVRY_SNI0",
"DELVRY_USE_SSL_TLS1"}
resdict = {}
for k in keys:
pat = '{}([^&])&'.format(k)
mo = re.search(pat, bigstring)
if mo is None: continue # no match
resdict[k] = mo.group(1)
will leave your desired result in resdict, if bigstring is the string you're searching in.
This assumes you know in advance which keys you'll be looking for, and you keep them in a set keys. If you don't know in advance the keys of interest, that's a very different issue of course.

Related

Python split regex matches into two variables

I am trying to split a re.findall match into two variables - one for the date and one for the time - but i can't find a way to split the list up somehow. Any help appreciated!
txt = "created_at': datetime.datetime(2023, 1, 17, 11, 38, 26, tzinfo"
x = re.findall("datetime.datetime\((.+?)\, (.+?)\, (.+?)\, (.+?)\, (.+?)\, (.+?)\, tzinfo", txt)
print(x)
print(x[0:4])
This is the results I get
[('2023', '1', '17', '11', '38', '26')]
[('2023', '1', '17', '11', '38', '26')]
It seems like the re.findall doesn't create a list with each find but just puts it all into one entry. The first 3 numbers are the date, the last 3 the time. I really don't know how to approach this without being able to grab each item individually.

You can use a list comprehension to extract the required fields for each match.
res = [o[:4] for o in x]

removing multiple pipes from a list

so i have some data i have been trying to clean up, its a list and it looks like this
a = [\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain]
i have tried to clean it up by doing this
a.replace("\n", "|")
the output turn out like this :
[london||18||20||30||||japan||6||80||2|||Spain]
if i do this:
a.replace("\n","")
i get this:
[london,"", "", 18,"","",20"","",30,"","","",""japan,"",""6,"","",80,"","",2"","","","",Spain]
can anyone explain why i am having multiple pipes, spaces and whats the best way to clean the data.

Assuming that your input is:
s = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
The issue is that there are multiple '\n' in-between data, therefore just replacing each '\n' with another character (say '|') will give you as many of the new characters as there were '\n'.
The simplest approach is to use str.split() to get the non-blank data:
l = list(s.split())
print(l)
# ['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
or, combine it with str.join(), if you want to have it separated by '|':
t = '|'.join(s.split())
print(t)
# london|18|20|30|japan|6|80|2|Spain

I tried it and got this:
a = ['\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain']
print(a[0].replace("\n", ""))
Output:
london182030japan6802Spain
Could you please clarify the exact input and the expected output? it does not seem correct yet and I have taken some liberties.
If your input was a string you can use split():
a = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
print(a.split())
Output:
['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']

python regex for incomplete decimals numbers

I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all

My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']

You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']

This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6

extract substring between multiple words in a pandas dataframe

I have a pandas data frame where I need to extract sub-string from each row of a column based on the following conditions
We have start_list ('one','once I','he') and end_list ('fine','one','well').
The sub-string should be preceded by any of the elements of the start_list.
The sub-string may be succeeded by any of the elements of the end_list.
When any of the elements of the start_list is available then the succeeding sub string should be extracted with/without the presence of the elements of the end_list.
Example Problem:
df = pd.DataFrame({'a' : ['one was fine today', 'we had to drive', ' ','I
think once I was fine eating ham ', 'he studies really
well
and is polite ', 'one had to live well and prosper',
'43948785943one by onej89044809', '827364hjdfvbfv',
'&^%$&*+++===========one kfnv dkfjn uuoiu fine', 'they
is one who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '', '88',
'99']})
Expected Result:
df = pd.DataFrame({'a' : ['was', '','','was ','studies really','had to live',
'by','','kfnv dkfjn uuoiu','who makes me crazy'],
'b' : ['11', '22', '33', '44', '55', '66', '77', '',
'88','99']})

I think this should work for you. This solution requires Pandas of course and also the built-in library functools.
Function: remove_preceders
This function takes as input a collection of words start_list and str string. It looks to see if any of the items in start_list are in string, and if so returns only the piece of string that occurs after said items. Otherwise, it returns the original string.
def remove_preceders(start_list, string):
for word in start_list:
if word in string:
string = string[string.find(word) + len(word):]
return string
Function: remove_succeders
This function is very similar to the first, except it returns only the piece of string that occurs before the items in end_list.
def remove_succeeders(end_list, string):
for word in end_list:
if word in string:
string = string[:string.find(word)]
return string
Function: to_apply
How do you actually run the above functions? The apply method allows you to run complex functions on a DataFrame or Series, but it will then look for as input either a full row or single value, respectively (based on whether you're running on a DF or S).
This function takes as input a function to run & a collection of words to check, and we can use it to run the above two functions:
def to_apply(func, words_to_check):
return functools.partial(func, words_to_check)
How to Run
df['no_preceders'] = df.a.apply(
to_apply(remove_preceders,
('one', 'once I', 'he'))
)
df['no_succeders'] = df.a.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
df['substring'] = df.no_preceders.apply(
to_apply(remove_succeeders,
('fine', 'one', 'well'))
)
And then there's one final step to remove the items from the substring column that were not affected by the filtering:
def final_cleanup(row):
if len(row['a']) == len(row['substring']):
return ''
else:
return row['substring']
df['substring'] = df.apply(final_cleanup, axis=1)
Results
Hope this works.

Why is requests is adding a part of a list to repeated query URIs?

I am trying to place a variable into a POST request using the Requests library. Here is my code:
import requests
message = "if i can\'t let it go out of my mind"
split_message = message.split()
initial_request = requests.get('http://ws.spotify.com/search/1/track?q='.join(split_message[:3]))
print initial_request.content
The outcome I am getting is this (which is an error, as it is messing up the URI):
No connection adapters were found for 'ifhttp://ws.spotify.com/search/1/track?q=ihttp://ws.spotify.com/search/1/track?q=can't'
I would like the request URI to look like this:
"http://ws.spotify.com/search/1/track?q=if i can\'t"
What am I missing here? Is there a better way to pass a variable to the request URI? I tried using a dictionary as the payload, but I can't place a variable into a dictionary.

String.join is very counter intuitive, the string is something you join with not join to.
Examine the following code:
>>> x = range(5)
>>> x
[0, 1, 2, 3, 4]
>>> x = [str(c) for c in x]
>>> x
['0', '1', '2', '3', '4']
>>> "-".join(x)
'0-1-2-3-4'
Here the "joiner" (-) is inserted between every element in our array, which we've converted to strings using list comprehension.
Here is your code:
'http://ws.spotify.com/search/1/track?q='.join(split_message[:3])
Your joining string is 'http://ws.spotify.com/search/1/track?q=' and so this is inserted between every element in the array split_message[:3].
So lets examine whats going on:
>>> message = "if i can\'t let it go out of my mind"
>>> split_message = message.split()
>>> split_message
['if', 'i', "can't", 'let', 'it', 'go', 'out', 'of', 'my', 'mind']
>>> split_message[:3]
['if', 'i', "can't"]
Here the array is 3 items long, which explains why the output string is:
ifhttp://ws.spotify.com/search/1/track?q=ihttp://ws.spotify.com/search/1/track?q=can't
But adding in some line breaks:
if
http://ws.spotify.com/search/1/track?q=
i
http://ws.spotify.com/search/1/track?q=
can't'
Notice the joining string is inserted twice, which explains why at first glance it just looks like its mucking up the URI.
What you want instead is:
request = 'http://ws.spotify.com/search/1/track?q='+"%20".join(split_message[:3])
Notice, in the above we join using %20: like so: "%20".join(split_message[:3]) and add this the request prefix. Which gives the URL below, with the spaces correctly encoded:
"http://ws.spotify.com/search/1/track?q=if%20i%20can't"

initial_request = requests.get('http://ws.spotify.com/search/1/track?q=' +
' '.join(split_message[:3]))
s.join(seq) sticks together the elements of seq with the string s stuck between them. For example, 'a'.join('bcd') == 'bacad'. Note that your desired string probably isn't what you really should be sending; you should probably use + instead of spaces. The requests module can handle url parameter encoding for you:
params = {'q': ' '.join(split_message[:3])}
initial_request = requests.get('http://ws.spotify.com/search/1/track',
params=params)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex match string between specific string and end character - python

Split first by '&' to get a list of strings, then by '=', like this: d = dict(kv.split('=') for kv in line.split('&'))

Related

Python split regex matches into two variables

removing multiple pipes from a list

python regex for incomplete decimals numbers

extract substring between multiple words in a pandas dataframe

Why is requests is adding a part of a list to repeated query URIs?

Categories

Resources