I'm working on Vertica
I have a problem, looking really easy, but I can't find a way to figure it out.
From a query, I can get 2 fields "Month" and "Year". What I want is to automatically select another field, Date, that I'd build being '01/Month/Year' (as the sql Date format). The goal is :
What I have
SELECT MONTH, YEAR FROM MyTable
Output :
01 2020
11 2019
09 2021
What I want
SELECT MONTH, YEAR, *answer* FROM MyTable
Output :
01 2020 01-01-2020
11 2019 01-11-2019
09 2021 01-09-2021
Sorry, it looks like really dumb and easy, but I didn't find any good way to do it. Thanks in advance.
Don't use string operations to build dates, you can mess up things considerably:
Today could be: 16.07.2021 or 07/16/2021, or also 2021-07-16, and, in France, for example: 16/07/2021 . Then, you could also left-trim the zeroes - or have 'July' instead of 07 ....
Try:
WITH
my_table (mth,yr) AS (
SELECT 1, 2020
UNION ALL SELECT 11, 2019
UNION ALL SELECT 9, 2021
)
SELECT
yr
, mth
, ADD_MONTHS(DATE '0001-01-01',(yr-1)*12+(mth-1)) AS firstofmonth
FROM my_table
ORDER BY 1,2;
yr | mth | firstofmonth
------+-----+--------------
2019 | 11 | 2019-11-01
2020 | 1 | 2020-01-01
2021 | 9 | 2021-09-01
I finally found a way doing that :
SELECT MONTH, YEAR, CONCAT(CONCAT(YEAR, '-'), CONCAT(MONTH, '-01')) FROM MyTable
Try this:
SELECT [MONTH], [YEAR], CONCAT(CONCAT(CONCAT('01-',[MONTH]),'-'),[YEAR]) AS [DATE]
FROM MyTable
Output will be:
| MONTH | YEAR | DATE |
|-------|------|------------|
| 01 | 2020 | 01-01-2020 |
| 11 | 2019 | 01-11-2019 |
| 09 | 2021 | 01-09-2021 |
Mon Apr 27 18:12:54 '<rdpDirect> ITC\109975#win_itsskdcbkp2_10.18.2.104
Mon Apr 27 18:14:31 '<rdpDirect> JDAWMSAPPDEV\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172
Mon Apr 27 18:14:32 '<rdpDirect> JDAWMSAPPDEV\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172
Mon Apr 27 18:24:21 '<rdpDirect> SAAMARTHA\UITSMUSR#win_saamarthaad_10.10.45.147
Mon Apr 27 18:24:21 '<rdpDirect> SAAMARTHA\UITSMUSR#win_saamarthaad_10.10.45.147
From the above text, need to search user name which is mentioned between the "\" and "#" sign. This can be numeric, characters, or alphanumeric.
eg: in the above example, 109975, VINOTHKUMAR, UITSMUSR etc.
Search for the group with zero or more characters within \ and # character.
import re
data = "Mon Apr 27 18:12:54 ' ITC\\109975#win_itsskdcbkp2_10.18.2.104 Mon Apr 27 18:14:31 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:14:32 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147"
usernames = re.findall(r"\\(.*?)#", data)
print(usernames)
Output:
['109975', 'VINOTHKUMAR', 'VINOTHKUMAR', 'UITSMUSR', 'UITSMUSR']
re.findall() method returns list of non overlapping occurrences of the searched string. More can be read in the official documentation of findall method.
A simple regex should do the job. You can try \\(.*?)\#.
This "captures" text between \ and #
buffer = '''
Mon Apr 27 18:12:54 ' ITC\\19975#win_itsskdcbkp2_10.18.2.104 Mon Apr 27 18:14:31 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:14:32 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147'''
import re
print(re.findall('\\\([\d\w\W]+?)#', buffer))
['19975', 'VINOTHKUMAR', 'VINOTHKUMAR', 'UITSMUSR', 'UITSMUSR']
I cannot print components of matched regex.
I am learning python3 and I need to verify that output of my command matches my needs. I have following short code:
#!/usr/bin/python3
import re
text_to_search = '''
1 | 27 23 8 |
2 | 21 23 8 |
3 | 21 23 8 |
4 | 21 21 21 |
5 | 21 21 21 |
6 | 27 27 27 |
7 | 27 27 27 |
'''
pattern = re.compile('(.*\n)*( \d \| 2[17] 2[137] [ 2][178] \|)')
matches = pattern.finditer(text_to_search)
for match in matches:
print (match)
print ()
print ('matched to group 0:' + match.group(0))
print ()
print ('matched to group 1:' + match.group(1))
print ()
print ('matched to group 2:' + match.group(2))
and following output:
<_sre.SRE_Match object; span=(0, 140), match='\n 1 | 27 23 8 |\n 2 | 21 23 8 |\n 3 >
matched to group 0:
1 | 27 23 8 |
2 | 21 23 8 |
3 | 21 23 8 |
4 | 21 21 21 |
5 | 21 21 21 |
6 | 27 27 27 |
7 | 27 27 27 |
matched to group 1: 6 | 27 27 27 |
matched to group 2: 7 | 27 27 27 |
please explain me:
1) why "print (match)" prints only beginning of match, does it have some kind of limit to trim output if its bigger than some threshold?
2) Why group(1) is printed as "6 | 27 27 27 |" ? I was hope (.*\n)* is as greedy as possible so it consumes everything from 1-6 lines, leaving last line of text_to_search to be matched against group(2), but seems (.*\n)* took only 6-th line. Why is that? Why lines 1-5 are not printed when printing group(1)?
3) I was trying to go through regex tutorial but failed to understand those tricks with (?...). How do I verify if numbers in last row are equal (so 27 27 27 is ok, but 21 27 27 is not)?
1) The print(match) only shows an outline of the object. match is an SRE_Match object, so in order to get information from it you need to do something like match.group(0), which is accessing a value stored in the object.
2) to capture lines 1-6 you need to change (.*\n)* to ((?:.*\n)*) according to this regex tester,
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
3) to match specific numbers you need to make it more specific and include these numbers into a seperate group at the end.
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
Above is an example of the content in the text file. I want to extract a string with re.
How should I construct the findall condition to achieve the expected result below? I have tried the following:
match=re.findall(r'[Tue\w]+2018$',data2)
but it is not working. I understand that $ is the symbol for the end of the string. How can I do it?
Expected Result is:
Tue Aug 21 17:02:26 2018
Tue Aug 21 17:31:06 2018
Tue Aug 21 18:10:42 2018
.
.
.
Use the pattern:
^Tue.*?2018
^ Assert position beginning of line.
Tue Literal substring.
.*? Match anything lazily.
2018 Match literal substring.
Since you are working with a multiline string and you want to match pattern at the beginning of a string, you have to use the re.MULTILINE flag.
import re
mystr="""
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
"""
print(re.findall(r'^Tue.*?2018',mystr,re.MULTILINE))
Prints:
['Tue Aug 21 17:02:26 2018', 'Tue Aug 21 17:31:06 2018', 'Tue Aug 21 18:10:42 2018']
For example, if I have the following event data, and want to find clusters of at least 2 events that are within 1 minute of each other in which id_1, id_2, and id_3 are all the same. For reference, I have the epoch timestamp (in microseconds) in addition to the date-time timestamp.
event_id timestamp id_1 id_2 id_3
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
For each cluster found, I want to return a set of (id_1, id_2, id_3, [list of event_ids], min_timestamp_of_cluster, max_timestamp_of_cluster). Additionally, if there's a cluster with (e.g.) 6 events, I'd only want to return a single result with all events, not one for each grouping of 3 events.
If I understood your problem correctly, you can make use of scikit-learn's DBSCAN clustering algorithm with a custom distance (or metric) function. Your distance function should return a very large number if either of the id_1, id_2 or id_3's are different. Otherwise is should return the time difference.
But with this method, the number of clusters are determined by the algorithm and not as an input given to the algorithm. If you are determined to pass the number of clusters as an input, k-means is the clustering algorithm you may need to look into.
In pure python, use a "sliding window" that encompasses all the events in a 1 minute range.
The premise is simple: maintain a queue of events that is a subsequence
of the total list, in order. The "window" (queue) should be all the events you care about. In this case, that is determined by the 60-second time gap requirement.
As you process events, add one event to the end of the queue. If the first event in the queue is more than 60 seconds from the newly-added last event, slide the window forward by dropping the first event from the front of the queue.
This is python3:
import collections
import operator
import itertools
from datetime import datetime
#### FROM HERE: vvv is just faking events. Delete or replace.
class Event(collections.namedtuple('Event', 'event_id timestamp id_1 id_2 id_3 epoch_ts')):
def __str__(self):
return ('{e.event_id} {e.timestamp} {e.id_1} {e.id_2} {e.id_3}'
.format(e=self))
def get_events():
event_list = map(operator.methodcaller('strip'), '''
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
'''.strip().splitlines())
for line in event_list:
idev, *rest = line.split()
ts = rest[:6]
id1, id2, id3 = rest[6:]
id2 = int(id2) # faster when sorting (see find_clustered_events)
id3 = int(id3) # faster when sorting (see find_clustered_events)
ts_str = ' '.join(ts)
dt = datetime.strptime(ts_str.replace('PDT', '-0700'), '%b %d, %Y %I:%M %p %z')
epoch = dt.timestamp()
ev = Event(idev, ts_str, id1, id2, id3, epoch)
yield ev
#### TO HERE: ^^^ was just faking up your events. Delete or replace.
def add_cluster(key, group):
'''Do whatever you want with the clusters. I'll print them.'''
print('Cluster:', key)
print(' ','\n '.join(map(str, group)), sep='')
def find_clustered_events(events, cluster=3, gap_secs=60):
'''Call add_cluster on clusters of events within a maximum time gap.
Args:
events (iterable): series of events, in chronological order
cluster (int): minimum number of events in a cluster
gap_secs (float): maximum time-gap from start to end of cluster
Returns:
None.
'''
window = collections.deque()
evkey = lambda e: (e.id_1, e.id_2, e.id_3)
for ev in events:
window.append(ev)
t0 = window[0].epoch_ts
tn = window[-1].epoch_ts
if tn - t0 < gap_secs:
continue
window.pop()
for k, g in itertools.groupby(sorted(window, key=evkey), key=evkey):
group = tuple(g)
if len(group) >= cluster:
add_cluster(k, group)
window.append(ev)
window.popleft()
# Call find_clustered with event generator, cluster args.
# Note that your data doesn't have any 3-clusters without time seconds. :-(
find_clustered_events(get_events(), cluster=2)
The output looks like this:
$ python test.py
Cluster: ('A', 1, 34567)
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
Cluster: ('A', 2, 12345)
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
Cluster: ('A', 3, 34567)
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
Cluster: ('A', 1, 12345)
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
Cluster: ('C', 2, 34567)
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
Please note: the code above doesn't try to keep track of events already in a cluster. So if you have, for example, an event type that occurs every fifteen seconds, you will have a sequence like this:
event1 t=0:00
event2 t=0:15
event3 t=0:30
event4 t=0:45
event5 t=1:00
And you will get overlapping clusters:
event1, event2, event3 (t=0:00 .. 0:30)
event2, event3, event4 (t=0:15 .. 0:45)
event3, event4, event5 (t=0:30 .. 1:00)
Technically, those are valid clusters, each slightly different. But you may wish to expunge previously-clustered events from the window, if you want the events to only appear in a single cluster.
Alternatively, if the chance of clustering and repetition is low, it might improve performance to implement repeat-checking in the add_cluster() function, to reduce the work done by the main loop.
A final note: this does a LOT of sorting. And the sorting is not efficient, since it gets repeated every time a new event appears. If you have a large data set, the performance will probably be bad. If your event keys are relatively few - that is, if the id1,2,3 values tend to repeat over and over again - you would be better off dynamically creating separate deques for each distinct key (id1+id2+id3) and dispatching the event to the appropriate deque, applying the same window logic, and then checking the length of the deque.
On the other hand, if you are processing something like web-server logs, where the requester is always changing, that might tend to create a memory problem with all the useless deques. So this is a memory vs. time trade-off you'll have to be aware of.