Python split regex matches into two variables - python

I am trying to split a re.findall match into two variables - one for the date and one for the time - but i can't find a way to split the list up somehow. Any help appreciated!
txt = "created_at': datetime.datetime(2023, 1, 17, 11, 38, 26, tzinfo"
x = re.findall("datetime.datetime\((.+?)\, (.+?)\, (.+?)\, (.+?)\, (.+?)\, (.+?)\, tzinfo", txt)
print(x)
print(x[0:4])
This is the results I get
[('2023', '1', '17', '11', '38', '26')]
[('2023', '1', '17', '11', '38', '26')]
It seems like the re.findall doesn't create a list with each find but just puts it all into one entry. The first 3 numbers are the date, the last 3 the time. I really don't know how to approach this without being able to grab each item individually.

You can use a list comprehension to extract the required fields for each match.
res = [o[:4] for o in x]

Related

Regex "AND" in an expression extract this and that

I'm struggling to write a regex that extracts the following numbers in bold below.
I set up 3 different regex for each value, but since the last value might have a space in between I don't know how to accommodate an "AND" here.
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
I have tried this and it is working for the first 2 but not for the last one. I'd like to have the last one in a single regex.
p1 = re.compile(r'(\d+)/')
p2 = re.compile(r'/(\d+)')
p3 = re.compile(r'(?=.*[R](\d+))(?=.*[R]\s(\d+))')
I've tried different stuff and this is the last code I tried with unsuccessful results
if I do this
p1.findall(tire), p2.findall(tire), p3.findall(tire)
I would like to see this:
(['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17'])
You were almost there! You don't need three separate regular expressions.
Instead, use multiple capturing groups in a single regex.
(\d{3})\/(\d{2})R\s?(\d{2})
Try it: https://regex101.com/r/Xn6bry/1
Explanation:
(\d{3}): Capture three digits
\/: Match a forward-slash
(\d{2}): Capture two digits
R\s?: Match an R followed by an optional whitespace
(\d{2}): Capture two digits.
In Python, do:
p1 = re.compile(r'(\d{3})\/(\d{2})R\s?(\d{2})')
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
matches = re.findall(p1, tire)
Now if you look at matches, you get
[('275', '65', '18'), ('275', '65', '18'), ('265', '70', '17')]
Rearranging this to the format you want should be pretty straightforward:
# Make an empty list-of-list with three entries - one per group
groups = [[], [], []]
for match in matches:
for groupnum, item in enumerate(match):
groups[groupnum].append(item)
Now groups is [['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17']]

removing multiple pipes from a list

so i have some data i have been trying to clean up, its a list and it looks like this
a = [\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain]
i have tried to clean it up by doing this
a.replace("\n", "|")
the output turn out like this :
[london||18||20||30||||japan||6||80||2|||Spain]
if i do this:
a.replace("\n","")
i get this:
[london,"", "", 18,"","",20"","",30,"","","",""japan,"",""6,"","",80,"","",2"","","","",Spain]
can anyone explain why i am having multiple pipes, spaces and whats the best way to clean the data.
Assuming that your input is:
s = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
The issue is that there are multiple '\n' in-between data, therefore just replacing each '\n' with another character (say '|') will give you as many of the new characters as there were '\n'.
The simplest approach is to use str.split() to get the non-blank data:
l = list(s.split())
print(l)
# ['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']
or, combine it with str.join(), if you want to have it separated by '|':
t = '|'.join(s.split())
print(t)
# london|18|20|30|japan|6|80|2|Spain
I tried it and got this:
a = ['\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain']
print(a[0].replace("\n", ""))
Output:
london182030japan6802Spain
Could you please clarify the exact input and the expected output? it does not seem correct yet and I have taken some liberties.
If your input was a string you can use split():
a = '\nlondon\n\n18\n\n20\n\n30\n\n\n\n\njapan\n\n6\n\n80\n\n2\n\n\n\n\nSpain'
print(a.split())
Output:
['london', '18', '20', '30', 'japan', '6', '80', '2', 'Spain']

Arrange list of strings that are divided into 4 parts by the different parts?

I have a list comprised of strings that all follow the same format 'Name%Department%Age'
I would like to order the list by age, then name, then department.
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
after sorting would output:
['Sarah%English%50, 'John%English%31', 'George%Maths%30', 'John%English%30, 'John%Maths%30']
The closest I have found to what I want is the following (found here: How to sort a list by Number then Letter in python?)
import re
def sorter(s):
match = re.search('([a-zA-Z]*)(\d+)', s)
return int(match.group(2)), match.group(1)
sorted(alist, key=sorter)
Out[13]: ['1', 'A1', '2', '3', '12', 'A12', 'B12', '17', 'A17', '25', '29', '122']
This however only sorted my layout of input by straight alphabetical.
Any help appreciated,
Thanks.
You are on the right track.
Personally, I:
would first use string.split() to chop the string up into its constituent parts;
would then make the sort key produce a tuple that reflects the desired sort order.
For example:
def key(name_dept_age):
name, dept, age = name_dept_age.split('%')
return -int(age), name, dept
alist = ['John%Maths%30', 'Sarah%English%50', 'John%English%30', 'John%English%31', 'George%Maths%30']
print(sorted(alist, key=key))
Use name, department, age = item.split('%') on each item.
Make a dict out of them {'name': name, 'department': department, 'age': age}
Then sort them using this code
https://stackoverflow.com/a/1144405/277267
sorted_items = multikeysort(items, ['-age', 'name', 'department'])
Experiment once with that multikeysort function, you will see that it will come in handy in a couple of situations in your programming career.

Unexpected output using lxml `.xpath()` and `for`

I have the following text
testing = """
<div>
<a>11</a>
</div>
<div>
<a>21</a>
<a>23</a>
</div>
"""
And I want to extract the text inside <a></a>. Below is my try,
testing = html.fromstring(testing)
testing = testing.xpath("//div")
[x.xpath("//a/text()") for x in testing]
The output is
[['11', '21', '23'], ['11', '21', '23'], ['11', '21', '23']]
But what I expect and want is
[['11'], ['21', '23']]
How can I do it?
Thank you.
testing.xpath("//div") returns you a list of matching div nodes. For every div node, you ask to find all a elements, but // at the beginning of the expression would start the search from the root of the document tree. You need to make the search specific to every div in the list by prepending a dot:
[x.xpath(".//a/text()") for x in testing]
# HERE^
Or, if applicable in your case, you can just do it in one go in a single expression:
x.xpath("//div/a/text()")

Python Regex match string between specific string and end character

I am building a file stripper to build a config report, and I have a very very long string as my base data. The following is a very small snippet of it, but it at least illustrates what I'm working with.
Snippet Example: DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&
How would I go about matching the following:
between "DEFAULT_GATEWAY=" and "&"
between "DELVRY_AGGREGATION_INTERVAL0=" and "&"
between "DELVRY_AGGREGATION_INTERVAL1=" and "&"
between "DELVRY_SCHEDULE=" and "&"
between "DELVRY_SNI0=" and "&"
between "DELVRY_USE_SSL_TLS1=" and "&"
and building a dict with it like:
{"DEFAULT_GATEWAY":"192.168.88.1",
"DELVRY_AGGREGATION_INTERVAL0":"1",
"DELVRY_AGGREGATION_INTERVAL1":"1",
"DELVRY_SCHEDULE0":"1",
"DELVRY_SNI0":"0",
"DELVRY_USE_SSL_TLS1":"0"}
?
Here is a way to do it.
In [1]: input = 'DEFAULT_GATEWAY=192.168.88.1&DELVRY_AGGREGATION_INTERVAL0=1&DELVRY_AGGREGATION_INTERVAL1=1&DELVRY_SCHEDULE0=1&DELVRY_SNI0=192.168.88.158&DELVRY_USE_SSL_TLS1=0&'
In [2]: input.split('&')
Out[2]:
['DEFAULT_GATEWAY=192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0=1',
'DELVRY_AGGREGATION_INTERVAL1=1',
'DELVRY_SCHEDULE0=1',
'DELVRY_SNI0=192.168.88.158',
'DELVRY_USE_SSL_TLS1=0',
'']
In [3]: [keyval.split('=') for keyval in input.split('&') if keyval]
Out[3]:
[['DEFAULT_GATEWAY', '192.168.88.1'],
['DELVRY_AGGREGATION_INTERVAL0', '1'],
['DELVRY_AGGREGATION_INTERVAL1', '1'],
['DELVRY_SCHEDULE0', '1'],
['DELVRY_SNI0', '192.168.88.158'],
['DELVRY_USE_SSL_TLS1', '0']]
In [4]: dict(keyval.split('=') for keyval in input.split('&') if keyval)
Out[4]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Notes
This is the input line
Split by & to get pairs of key-values. Note the last entry is empty
Split each entry by the equal sign while throwing away empty entries
Build a dictionary
Another Solution
In [8]: import urlparse
In [9]: urlparse.parse_qsl(input)
Out[9]:
[('DEFAULT_GATEWAY', '192.168.88.1'),
('DELVRY_AGGREGATION_INTERVAL0', '1'),
('DELVRY_AGGREGATION_INTERVAL1', '1'),
('DELVRY_SCHEDULE0', '1'),
('DELVRY_SNI0', '192.168.88.158'),
('DELVRY_USE_SSL_TLS1', '0')]
In [10]: dict(urlparse.parse_qsl(input))
Out[10]:
{'DEFAULT_GATEWAY': '192.168.88.1',
'DELVRY_AGGREGATION_INTERVAL0': '1',
'DELVRY_AGGREGATION_INTERVAL1': '1',
'DELVRY_SCHEDULE0': '1',
'DELVRY_SNI0': '192.168.88.158',
'DELVRY_USE_SSL_TLS1': '0'}
Split first by '&' to get a list of strings, then by '=', like this:
d = dict(kv.split('=') for kv in line.split('&'))
import re
keys = {"DEFAULT_GATEWAY",
"DELVRY_AGGREGATION_INTERVAL0",
"DELVRY_AGGREGATION_INTERVAL1",
"DELVRY_SCHEDULE0",
"DELVRY_SNI0",
"DELVRY_USE_SSL_TLS1"}
resdict = {}
for k in keys:
pat = '{}([^&])&'.format(k)
mo = re.search(pat, bigstring)
if mo is None: continue # no match
resdict[k] = mo.group(1)
will leave your desired result in resdict, if bigstring is the string you're searching in.
This assumes you know in advance which keys you'll be looking for, and you keep them in a set keys. If you don't know in advance the keys of interest, that's a very different issue of course.

Categories

Resources