Regex find greedy and lazy matches and all in-between

Regex find greedy and lazy matches and all in-between - python

I have a sequence like such '01 02 09 02 09 02 03 05 09 08 09 ', and I want to find a sequence that starts with 01 and ends with 09, and in-between there can be one to nine double-digit, such as 02, 03, 04 etc. This is what I have tried so far.
I'm using w{2}\s (w{2} for matching the two digits, and \s for the whitespace). This can occur one to nine times, which leads to (\w{2}\s){1,9}. The whole regex becomes
(01\s(\w{2}\s){1,9}09\s). This returns the following result:
<regex.Match object; span=(0, 33), match='01 02 09 02 09 02 03 05 09 08 09 '>
If I use the lazy quantifier ?, it returns the following result:
<regex.Match object; span=(0, 9), match='01 02 09 '>
How can I obtain the results in-between too. The desired result would include all the following:
<regex.Match object; span=(0, 9), match='01 02 09 '>
<regex.Match object; span=(0, 15), match='01 02 09 02 09 '>
<regex.Match object; span=(0, 27), match='01 02 09 02 09 02 03 05 09 '>
<regex.Match object; span=(0, 33), match='01 02 09 02 09 02 03 05 09 08 09 '>

You can extract these strings using
import re
s = "01 02 09 02 09 02 03 05 09 08 09 "
m = re.search(r'01(?:\s\w{2})+\s09', s)
if m:
print( [x[::-1] for x in re.findall(r'(?=\b(90.*?10$))', m.group()[::-1])] )
# => ['01 02 09 02 09 02 03 05 09 08 09', '01 02 09 02 09 02 03 05 09', '01 02 09 02 09', '01 02 09']
See the Python demo.
With the 01(?:\s\w{2})+\s09 pattern and re.search, you can extract the substrings from 01 to the last 09 (with any space separated two word char chunks in between).
The second step - [x[::-1] for x in re.findall(r'(?=\b(90.*?10$))', m.group()[::-1])] - is to reverse the string and the pattern to get all overlapping matches from 09 to 01 and then reverse them to get final strings.
You may also reverse the final list if you add [::-1] at the end of the list comprehension: print( [x[::-1] for x in re.findall(r'(?=\b(90.*?10$))', m.group()[::-1])][::-1] ).

Here would be a non-regex answer that post-processes the matching elements:
s = '01 02 09 02 09 02 03 05 09 08 09 '.trim().split()
assert s[0] == '01' \
and s[-1] == '09' \
and (3 <= len(s) <= 11) \
and len(s) == len([elem for elem in s if len(elem) == 2 and elem.isdigit() and elem[0] == '0'])
[s[:i+1] for i in sorted({s.index('09', i) for i in range(2,len(s))})]
# [
# ['01', '02', '09'],
# ['01', '02', '09', '02', '09'],
# ['01', '02', '09', '02', '09', '02', '03', '05', '09'],
# ['01', '02', '09', '02', '09', '02', '03', '05', '09', '08', '09']
# ]

Related

Python: grouping items already in a list and reverse them

I have a binary file likes this:
00 01 02 04 03 03 03 03 00 05 06 03 03 03 03 03 00 07 03 03 03 03 03 03 ...
and I would like to make groups of 8 items each
[00 01 02 04 03 03 03 03] [00 05 06 03 03 03 03 03] [00 07 03 03 03 03 03 03]...
and then reverse the items inside each group like this:
[03 03 03 03 04 02 01 00] [03 03 03 03 03 06 05 00] [03 03 03 03 03 03 07 00]
I tried reverse() but it reverse all the list.
I've imagined something like that: in a loop I should count until 8 (or 7), make a group, reverse it, and then increment the row, count 8, reverse and so on but I am not able to code that.
I have tried
i=0
for item in (list_reverse):
i+=1
if i>8:
list_reverse.reverse()
i=0
but it doesn't work.
Maybe I should try a nested loop?

.split() the strings, then loop through it.
t = """00 01 02 04 03 03 03 03
00 05 06 03 03 03 03 03
00 07 03 03 03 03 03 03"""
out = ""
lines = t.split("\n")
for n, line in enumerate(lines):
lst = line.split(" ")
c = 0
reversed_lst = ""
while c < len(lst):
reversed_lst += (lst[len(lst)- c -1]) + " "; c+=1
if n != len(lines) - 1:
out += reversed_lst + "\n"
else:
out += reversed_lst
print(out)
Output:
03 03 03 03 04 02 01 00
03 03 03 03 03 06 05 00
03 03 03 03 03 03 07 00

This is a good usecase for Python's builtin bytearray. First, you can open the binary input file and use a bytearray to store its contents:
with open("binary.file", "rb") as f:
bytes = bytearray(f.read())
Then, we'll use your algorithm to loop over the bytes and store them in a variable called group (which is also a bytearray) every 8 iterations:
i = 0
groups = []
group = bytearray()
for byte in bytes:
i += 1
group.append(byte)
if i == 8:
groups.append(group)
i = 0
group = bytearray()
Just as quick sanity check, test the value of variable i because if it's not zero by now the final group would be less than 8 bytes long:
if i != 0:
raise EOFError("Input file does not align to 8 byte boundary!")
Finally, we'll reverse each group and print the output:
for group in groups:
group.reverse()
print(groups)
Depending on your usecase, you could also concatenate the reversed bytes and store them in another file or even overwrite the same file. Although my guess is if you would do that to any ordinary binary files like a JPEG or an EXE you will break them completely. Luckily, you could run the program again to restore them!

Python sorted to arrange according to decimal format

I would like to arrange the list according to the numbering as followed,
01
02
03
001
002
However default sorted command will give me,
001
002
01
02
03

To preserve length ordering over numerical ordering, I believe you need to sort on 2 criteria:
nums = '03 01 002 02 001'
num_array = nums.split()
sorted_nums = sorted(num_array, key=lambda x: [len(x), x])
print(sorted_nums)
Output:
['01', '02', '03', '001', '002']

Or, double-sort the list:
>>> nums = '03 01 002 02 001'
>>> sorted(sorted(nums.split()),key=len)
['01', '02', '03', '001', '002']
>>>

s = '001 01 02 03 002'
l = s.split()
print(sorted(l, key=lambda e: (len(e), int(e) )))
Output:
C:\Users\Desktop>py x.py
['01', '02', '03', '001', '002']

sorted_list = sorted(my_list, key=lambda x: (len(x), x))
First it checks the length of the string and then string itself char by char.

xs = "01 02 03 001 002".split()
print(sorted(xs, key="{:<018s}".format))
# ['001', '002', '01', '02', '03']
Unless you are golfing or decimals have more than 18 decimal places, using two criteria in key is probably the way to go.

How to count certain elements of a string in a large array?

I'm not sure if this is possible but I have a very large array containing dates
a = ['Fri, 19 Aug 2011 19:28:17 -0000',....., 'Wed, 05 Feb 2012 11:00:00 -0000']
I'm trying to find if there is a way to count the frequency of the days and months in the array. In this case I'm trying to count strings for abbreviations of months or days (such as Fri,Mon, Apr, Jul)

You can use Counter() from the collections module.
from collections import Counter
a = ['Fri, 19 Aug 2011 19:28:17 -0000',
'Fri, 09 June 2017 11:11:11 -0000',
'Wed, 05 Feb 2012 11:00:00 -0000']
# this generator splits the dates into words, and cleans word from "".,;-:" characters:
# ['Fri', '19', 'Aug', '2011', '19:28:17', '0000', 'Fri', '09', 'June',
# '2017', '11:11:11', '0000', 'Wed', '05', 'Feb', '2012', '11:00:00', '0000']
# and feeds it to counting:
c = Counter( (x.strip().strip(".,;-:") for word in a for x in word.split() ))
for key in c:
if key.isalpha():
print(key, c[key])
The if prints only those keys from the counter that are pure "letters" - not digits:
Fri 2
Aug 1
June 1
Wed 1
Feb 1
Day-names and Month-names are the only pure isalpha() parts of your dates.
Full c output:
Counter({'0000': 3, 'Fri': 2, '19': 1, 'Aug': 1, '2011': 1,
'19:28:17': 1, '09': 1, 'June': 1, '2017': 1, '11:11:11': 1,
'Wed': 1, '05': 1, 'Feb': 1, '2012': 1, '11:00:00': 1})
Improvement by #AzatIbrakov comment:
c = Counter( (x.strip().strip(".,;-:") for word in a for x in word.split()
if x.strip().strip(".,;-:").isalpha()))
would weed out non-alpha words in the generation step already.

Python has a built in .count method which is very useful here:
lista = [
'Fri, 19 Aug 2011 19:28:17 -0000',
'Fri, 19 Aug 2011 19:28:17 -0000',
'Sun, 19 Jan 2011 19:28:17 -0000',
'Sun, 19 Aug 2011 19:28:17 -0000',
'Fri, 19 Jan 2011 19:28:17 -0000',
'Mon, 05 Feb 2012 11:00:00 -0000',
'Mon, 05 Nov 2012 11:00:00 -0000',
'Wed, 05 Feb 2012 11:00:00 -0000',
'Tue, 05 Nov 2012 11:00:00 -0000',
'Tue, 05 Dec 2012 11:00:00 -0000',
'Wed, 05 Jan 2012 11:00:00 -0000',
]
listb = (''.join(lista)).split()
for index, item in enumerate(listb):
count = {}
for item in listb:
count[item] = listb.count(item)
months = ['Jan', 'Feb', 'Aug', 'Nov', 'Dec']
for k in count:
if k in months:
print(f"{k}: {count[k]}")
Output:
(xenial)vash#localhost:~/python/stack_overflow$ python3.7 count_months.py
Aug: 3
Jan: 3
Feb: 2
Nov: 2
Dec: 1
What happens is we take all the items of the lista and join them into one string. Then we split that string to get all the individual words.
Now we can use the count method and create a dictionary to hold the counts. We can create a list of items we want to retrieve from the dicionary and only retrieve those keys

What is the simplest way to get the sub dataframe?

x is a dataframe whose index is code and contains a column pe.
>>>x
pe
code
01 15
02 30
03 70
04 6
05 40
06 34
07 25
08 65
10 45
12 55
13 32
Get the index of x.
x.index
Index(['01', '02', '03', '04', '05', '06', '07',
'08', '10', '12','13'],
dtype='object', name='code', length=11)
I want to get a sub dataframe whose index is ['01','04','08','10','12'].
x_sub
pe
code
01 15
04 6
08 65
10 45
12 55
What is the simplest way to get the sub dataframe?

Use loc
x_sub = x.loc[['01','04','08','10','12']]
Or if your indexes are integers:
x_sub = x.loc[[01,04,08,10,12]]

Format These Dates And Get Time Passed

I have a Python list of dates and I'm using min and max to find the most recent and the oldest (first, is that the best method?), but also I need to format the dates into something where I can figure out the current time and subtract the oldest date in the list so I can say something like "In the last 27 minutes..." where I can state how many days, hours, or minutes have past since the oldest. Here is my list (the dates change obviously depending on what I'm pulling) so you can see the current format. How do I get the info I need?
[u'Sun Oct 06 18:00:55 +0000 2013', u'Sun Oct 06 17:57:41 +0000 2013', u'Sun Oct 06 17:55:44 +0000 2013', u'Sun Oct 06 17:54:10 +0000 2013', u'Sun Oct 06 17:35:58 +0000 2013', u'Sun Oct 06 17:35:58 +0000 2013', u'Sun Oct 06 17:35:25 +0000 2013', u'Sun Oct 06 17:34:39 +0000 2013', u'Sun Oct 06 17:34:39 +0000 2013', u'Sun Oct 06 17:34:39 +0000 2013', u'Sun Oct 06 17:30:35 +0000 2013', u'Sun Oct 06 17:25:28 +0000 2013', u'Sun Oct 06 17:24:04 +0000 2013', u'Sun Oct 06 17:24:04 +0000 2013', u'Sun Oct 06 17:22:08 +0000 2013', u'Sun Oct 06 17:22:08 +0000 2013', u'Sun Oct 06 17:21:00 +0000 2013', u'Sun Oct 06 17:18:49 +0000 2013', u'Sun Oct 06 17:18:49 +0000 2013', u'Sun Oct 06 17:15:29 +0000 2013', u'Sun Oct 06 17:15:29 +0000 2013', u'Sun Oct 06 17:13:35 +0000 2013', u'Sun Oct 06 17:12:18 +0000 2013', u'Sun Oct 06 17:12:00 +0000 2013', u'Sun Oct 06 17:07:34 +0000 2013', u'Sun Oct 06 17:03:59 +0000 2013']

You won't get the oldest and newest date/time entries from your list with the entries by using min and max - "Fri" will come before "Mon", for example. So you'll want to put things into a data structure that represents date/time stamps properly.
Fortunately, Python's datetime module comes with a method to convert lots of date/time stamp strings into a proper representation - datetime.datetime.strptime. See the guide for how to use it.
Once that's done you can use min and max and then timedelta to compute the difference.
from datetime import datetime
# Start with the initial list
A = [u'Sun Oct 06 18:00:55 +0000 2013', u'Sun Oct 06 17:57:41 +0000 2013', u'Sun Oct 06 17:55:44 +0000 2013', u'Sun Oct 06 17:54:10 +0000 2013', u'Sun Oct 06 17:35:58 +0000 2013', u'Sun Oct 06 17:35:58 +0000 2013', u'Sun Oct 06 17:35:25 +0000 2013', u'Sun Oct 06 17:34:39 +0000 2013', u'Sun Oct 06 17:34:39 +0000 2013', u'Sun Oct 06 17:34:39 +0000 2013', u'Sun Oct 06 17:30:35 +0000 2013', u'Sun Oct 06 17:25:28 +0000 2013', u'Sun Oct 06 17:24:04 +0000 2013', u'Sun Oct 06 17:24:04 +0000 2013', u'Sun Oct 06 17:22:08 +0000 2013', u'Sun Oct 06 17:22:08 +0000 2013', u'Sun Oct 06 17:21:00 +0000 2013', u'Sun Oct 06 17:18:49 +0000 2013', u'Sun Oct 06 17:18:49 +0000 2013', u'Sun Oct 06 17:15:29 +0000 2013', u'Sun Oct 06 17:15:29 +0000 2013', u'Sun Oct 06 17:13:35 +0000 2013', u'Sun Oct 06 17:12:18 +0000 2013', u'Sun Oct 06 17:12:00 +0000 2013', u'Sun Oct 06 17:07:34 +0000 2013', u'Sun Oct 06 17:03:59 +0000 2013']
# This is the format string the date/time stamps are using
# On Python 3.3 on Windows you can use this format
# s_format = "%a %b %d %H:%M:%S %z %Y"
# However, Python 2.7 on Windows doesn't work with that. If all of your date/time stamps use the same timezone you can do:
s_format = "%a %b %d %H:%M:%S +0000 %Y"
# Convert the text list into datetime objects
A = [datetime.strptime(d, s_format) for d in A]
# Get the extremes
oldest = min(A)
newest = max(A)
# If you substract oldest from newest you get a timedelta object, which can give you the total number of seconds between them. You can use this to calculate days, hours, and minutes.
delta = int((newest - oldest).total_seconds())
delta_days, rem = divmod(delta, 86400)
delta_hours, rem = divmod(rem, 3600)
delta_minutes, delta_seconds = divmod(rem, 60)

your question can be divided into three pieces:
A)
how to read string formated dates
B)
how to sort list of dates in python
C)
how to calculate the difference between two dates

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex find greedy and lazy matches and all in-between - python

Related

Python: grouping items already in a list and reverse them

Python sorted to arrange according to decimal format

How to count certain elements of a string in a large array?

What is the simplest way to get the sub dataframe?

Format These Dates And Get Time Passed

Categories

Resources