I want to count the unique host out of timestamp

I want to count the unique host out of timestamp - python

I have 2 pandas columns in my temp pandas dataframe I want to count by the date time for a unique host for in a same day inside of my dataframe the log file look like this
10.216.113.172 - - [04/Sep/2009:02:57:16 -0700] "GET /images/filmpics/0000/0053/quietman2.jpeg HTTP/1.1" 200 1077924
10.211.47.159 - - [03/Sep/2009:22:19:49 -0700] "GET /quietman4.jpeg HTTP/1.1" 404 212
10.211.47.159 - - [22/Aug/2009:12:58:27 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 -
10.216.113.172 - - [14/Jan/2010:03:09:17 -0800] "GET /images/filmmediablock/229/Shinjuku5.jpg HTTP/1.1" 200 443005
10.211.47.159 - - [15/Oct/2009:21:21:58 -0700] "GET /assets/img/banner/ten-years-banner-grey.jpg HTTP/1.1" 304 -
10.216.113.172 - - [12/Aug/2009:05:57:55 -0700] "GET /about-us/people/ HTTP/1.1" 200 10773
10.211.47.159 - - [24/Aug/2009:13:16:26 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -
10.211.47.159 - - [03/Sep/2009:21:30:27 -0700] "GET /images/newspics/0000/0017/Mike5_thumb.JPG HTTP/1.1" 304 -
10.211.47.159 - - [15/Oct/2009:20:30:43 -0700] "GET /images/filmpics/0000/0057/quietman4.jpeg HTTP/1.1" 304 -
10.211.47.159 - - [11/Aug/2009:20:34:44 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -
I have those timestamps(03/Sep/2009:22:19:49 -0700) and hosts(10.211.47.159 ) in my dataframe temp the desired output is [2,1,1,2,1,1] the -0700 is what we have to subtract to the time and the time have to push back to one if went back a day.
heres my code
but my output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] may someone help
temp = pandas.DataFrame()
temp['timestamp'] = Mainpanda['timestamp']
temp['host'] = Mainpanda['host']
temp['timestamp'] =
pandas.to_datetime(temp['timestamp'],
format='%d/%b/%Y:%H:%M:%S %z')
temp['timestamp'] = temp['timestamp'] - pandas.Timedelta(hours=7)
counts = temp.groupby('timestamp')['host'].nunique().reset_index()
counts = counts.sort_values(by='timestamp')
counts = counts['host'].tolist()
print(counts)

First convert to DatetimeIndex then align all datetime to UTC: it's important to have the same timezone to process correctly your data:
import pandas as pd
temp['timestamp'] = pd.to_datetime(temp['timestamp'], format='%d/%b/%Y:%H:%M:%S %z')
temp['ts_utc'] = pd.DatetimeIndex([dt.tz_convert('UTC') for dt in temp['timestamp']])
visits = (temp.groupby(['host', pd.Grouper(key='ts_utc', freq='D')]).size()
.to_frame('visit').reset_index())
Output:
>>> visits
host ts_utc visit
0 10.211.47.159 2009-08-12 00:00:00+00:00 1
1 10.211.47.159 2009-08-22 00:00:00+00:00 1
2 10.211.47.159 2009-08-24 00:00:00+00:00 1
3 10.211.47.159 2009-09-04 00:00:00+00:00 2
4 10.211.47.159 2009-10-16 00:00:00+00:00 2
5 10.216.113.172 2009-08-12 00:00:00+00:00 1
6 10.216.113.172 2009-09-04 00:00:00+00:00 1
7 10.216.113.172 2010-01-14 00:00:00+00:00 1

Related

Split raw data in python

I want to cut the raw data from txt file by python. The raw data like this:
-0.156 200
-0.157 300
-0.158 400
-0.156 201
-0.157 305
-0.158 403
-0.156 199
-0.157 308
-0.158 401
I expect to extract the file to many txt file like this.
-0.156 200
-0.157 300
-0.158 400
-0.156 201
-0.157 305
-0.158 403
-0.156 199
-0.157 308
-0.158 401
Would you please help me?

This splits the data into files with three entries in each file
READ_FROM = 'my_file.txt'
ENTRIES_PER_FILE = 3
with open(READ_FROM) as f:
data = f.read().splitlines()
i = 1
n = 1
for line in data:
with open(f'new_file_{n}.txt', 'a') as g:
g.write(line + '\n')
i += 1
if i > ENTRIES_PER_FILE:
i = 1
n += 1
new_file_1.txt
-0.156 200
-0.157 300
-0.158 400
new_file_2.txt
-0.156 201
-0.157 305
-0.158 403
new_file_3.txt
-0.156 199
-0.157 308
-0.158 401

If you have spaces between your lines and you want spaces between your lines in your new file this will work:
with open('path/to/file.txt', 'r') as file:
lines = file.readlines()
cleaned_lines = [line.strip() for line in lines if len(line.strip()) > 0]
num_lines_per_file = 3
for num in range(0, len(cleaned_lines), num_lines_per_file):
with open(f'{num}.txt', 'w') as out_file:
for line in cleaned_lines[num:num + num_lines_per_file]:
out_file.write(f'{line}\n\n')

Inconsistent index value in re module

I have two list which have different value. I tried to put the a list in an organized format with g.split. Although it work fine on the a list, but it cant filter b list properly
a = ['Sehingga 8 Ogos 2021: Jumlah kes COVID-19 yang dilaporkan adalah 18,688 kes (1,262,540 kes)\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 6,565 (465,015)\nWPKL - 1,883 (140,404)\nJohor - 1,308 (100,452)\nSabah -Lagi 1,379 (93,835)\nSarawak - 581 (81,328)\nNegeri Sembilan - 1,140 (78,777)\nKedah - 1,610 (56,598)\nPulau Pinang - 694 (52,368)\nKelantan - 870 (49,433)\nPerak - 861 (43,924)\nMelaka - 526 (35,584)\nPahang - 602 (29,125)\nTerengganu - 598 (20,696)\nWP Labuan - 2 (9,711)\nWP Putrajaya - 63 (4,478)\nPerlis - 6 (812)\n\n- KPK KKM']
b = ['Sehingga 9 Ogos 2021. Jumlah kes COVID-19 yang dilaporkan adalah 17,236 kes (1,279,776 kes).\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 5,740 (470,755)\nWPKL - 1,567 (141,971)\nJohor - 1,232 (101,684)\nSabah -Lagi 1,247 (95,082)\nSarawak - 589 (81,917)\nNegeri Sembilan - 1,215 (79,992)\nKedah - 1,328 (57,926)\nPulau Pinang - 908 (53,276)\nKelantan - 914 (50,347)\nPerak - 935 (44,859)\nMelaka - 360 (35,944)\nPahang - 604 (29,729)\nTerengganu - 501 (21,197)\nWP Labuan - 8 (9,719)\nWP Putrajaya - 66 (4,544)\nPerlis - 22 (834)\n\n- KPK KKM']
My code
out = []
for v in b:
for g in re.findall(r"^(.*?\(.*?\))\n", v, flags=re.M):
out.append(g.split(":")[0])
print(*out[0])
Whenever i print print(out[0]) in b list it only show me Selangor - 5 , 7 4 0 (470,755) which is wrong, it should be Sehingga 9 Ogos 2021.
I tried the same code but this time in a list and it work properly without any issues. However I noticed there is minor differences between the two list, one is the ':' and '.' after the Sehingga 8 Ogos 2021. How can I make the function to work on both list? I'm still new to re and gsplit, does anyone have any idea on this ? Thanks.
`

There are issue with your data format and regex, I am not that good at regex but this works on me.
import re
a = ['Sehingga 8 Ogos 2021: Jumlah kes COVID-19 yang dilaporkan adalah 18,688 kes (1,262,540 kes)\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 6,565 (465,015)\nWPKL - 1,883 (140,404)\nJohor - 1,308 (100,452)\nSabah -Lagi 1,379 (93,835)\nSarawak - 581 (81,328)\nNegeri Sembilan - 1,140 (78,777)\nKedah - 1,610 (56,598)\nPulau Pinang - 694 (52,368)\nKelantan - 870 (49,433)\nPerak - 861 (43,924)\nMelaka - 526 (35,584)\nPahang - 602 (29,125)\nTerengganu - 598 (20,696)\nWP Labuan - 2 (9,711)\nWP Putrajaya - 63 (4,478)\nPerlis - 6 (812)\n\n- KPK KKM']
b = ['Sehingga 9 Ogos 2021. Jumlah kes COVID-19 yang dilaporkan adalah 17,236 kes (1,279,776 kes).\n\nPecahan setiap negeri (Kumulatif):\n\nSelangor - 5,740 (470,755)\nWPKL - 1,567 (141,971)\nJohor - 1,232 (101,684)\nSabah -Lagi 1,247 (95,082)\nSarawak - 589 (81,917)\nNegeri Sembilan - 1,215 (79,992)\nKedah - 1,328 (57,926)\nPulau Pinang - 908 (53,276)\nKelantan - 914 (50,347)\nPerak - 935 (44,859)\nMelaka - 360 (35,944)\nPahang - 604 (29,729)\nTerengganu - 501 (21,197)\nWP Labuan - 8 (9,719)\nWP Putrajaya - 66 (4,544)\nPerlis - 22 (834)\n\n- KPK KKM']
out = []
for v in b:
regex_list = re.findall(r"^(.*?\(.*?\))\n", v.replace('.\n', '\n').replace('.',':'), flags=re.M)
for g in regex_list:
print(g)
out.append(g.split(":")[0])
print(*out[0])

Pandas load text file error: CParserError: Error tokenizing data

Pandas load text file error: CParserError: Error tokenizing data.
I'm a new pandas learner. I'm trying to use pandas to open a text file. I write code in python, then access to the right path and run the python file, but failed.
Here is the original data. There's no field name, all the rows of data are separated with space:
2017-07-02 23:59:51127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=202104&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 986 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=hydrogen-motor 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 2539 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=100005713&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 1172 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=stainless-stand 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 3152 31.7.188.55
Here is my simple python code:
import pandas as pd
DATA_FILE='data.log'
df = pd.read_table(DATA_FILE, sep=" ")
print(df)
But I got error as follows:
Traceback (most recent call last):
File "open.py", line 7, in <module>
df = pd.read_table(DATA_FILE, sep=" ")
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 401, in _read
data = parser.read()
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 848, in pandas.parser.TextReader.read (pandas\parser.c:10415)
File "pandas\parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)
File "pandas\parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas\parser.c:11437)
File "pandas\parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:11308)
File "pandas\parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas\parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 6 fields in line 4, saw 17
There must be something run with my python code. How to get the correct syntax code?

You missed a space on the first line:
2017-07-02 23:59:51127.0.0.1
Replace to:
2017-07-02 23:59:51 127.0.0.1
Just tested:
In [12]: cat data.log
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=202104&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 986 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=hydrogen-motor 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 2539 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=100005713&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 1172 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=stainless-stand 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 3152 31.7.188.55
In [13]: dx = pd.read_table('data.log', sep=" ", header=None)
In [14]: dx
Out[14]:
0 1 2 3 \
0 2017-07-02 23:59:51 127.0.0.1 GET
1 2017-07-02 23:59:51 127.0.0.1 GET
2 2017-07-02 23:59:51 127.0.0.1 GET
3 2017-07-02 23:59:51 127.0.0.1 GET
4 \
0 /ecvv_product/EcvvSearchProduct.aspx
1 /ecvv_product/EcvvHotSearchProduct.aspx
2 /ecvv_product/EcvvSearchProduct.aspx
3 /ecvv_product/EcvvHotSearchProduct.aspx
5 6 7 8 \
0 cid=202104&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1
1 kw=hydrogen-motor 8082 - 127.0.0.1
2 cid=100005713&p=&pageindex=&kw=electric-skateb... 8082 - 127.0.0.1
3 kw=stainless-stand 8082 - 127.0.0.1
9 10 11 12 13 14 \
0 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 986
1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 2539
2 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 1172
3 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 3152
15
0 31.7.188.55
1 31.7.188.55
2 31.7.188.55
3 31.7.188.55

Panda group dataframes into user specified time period

Probably related: pandas dataframe group year index by decade
For example if I have data as follows
status bytes_sent upstream_cache_status \
timestamp
2014-05-26 23:56:30 200 356 MISS
2014-05-26 23:56:30 200 10517 -
2014-05-26 23:57:05 200 6923 MISS
2014-05-26 23:57:14 200 323 -
2014-05-26 23:57:30 200 356 MISS
2014-05-26 23:57:38 200 8107 HIT
2014-05-26 23:57:43 200 369 MISS
2014-05-26 23:57:56 304 401 HIT
2014-05-26 23:57:56 304 401 HIT
2014-05-26 23:57:56 304 387 MISS
2014-05-26 23:57:57 304 401 HIT
2014-05-26 23:57:58 304 401 HIT
2014-05-26 23:58:08 200 507 EXPIRED
2014-05-26 23:58:29 304 338 HIT
2014-05-26 23:58:31 400 409 -
2014-05-26 23:58:45 200 425 MISS
if let say I want to group them such that each group contains logs within 30 seconds (time is user-specified), how do I do that? I have seen this
df.groupby(lambda x: x.hour)
but I highly doubt it is relevant in my case

df.groupby(pd.Grouper(freq='30S', level=0)) should do; for example
>>> aggr = lambda df: df.apply(tuple)
>>> df.groupby(pd.Grouper(freq='30S', level=0)).aggregate(aggr)
status bytes_sent \
timestamp
2014-06-26 23:56:30 (200, 200) (356, 10517)
2014-06-26 23:57:00 (200, 200) (6923, 323)
2014-06-26 23:57:30 (200, 200, 200, 304, 304, 304, 304, 304) (356, 8107, 369, 401, 401, 387, 401, 401)
2014-06-26 23:58:00 (200, 304) (507, 338)
2014-06-26 23:58:30 (400, 200) (409, 425)
upstream_cache_status
timestamp
2014-06-26 23:56:30 (MISS, -)
2014-06-26 23:57:00 (MISS, -)
2014-06-26 23:57:30 (MISS, HIT, MISS, HIT, HIT, MISS, HIT, HIT)
2014-06-26 23:58:00 (EXPIRED, HIT)
2014-06-26 23:58:30 (-, MISS)

Python Function returns wrong value

periodsList = []
su = '0:'
Su = []
sun = []
SUN = ''
I'm formating timetables by converting
extendedPeriods = ['0: 1200 - 1500',
'0: 1800 - 2330',
'2: 1200 - 1500',
'2: 1800 - 2330',
'3: 1200 - 1500',
'3: 1800 - 2330',
'4: 1200 - 1500',
'4: 1800 - 2330',
'5: 1200 - 1500',
'5: 1800 - 2330',
'6: 1200 - 1500',
'6: 1800 - 2330']
into '1200 - 1500/1800 - 2330'
su is the day identifier
Su, sun store some values
SUN stores the converted timetable
for line in extendedPeriods:
if su in line:
Su.append(line)
for item in Su:
sun.append(item.replace(su, '', 1).strip())
SUN = '/'.join([str(x) for x in sun])
Then I tried to write a function to apply my "converter" also to the other days..
def formatPeriods(id, store1, store2, periodsDay):
for line in extendedPeriods:
if id in line:
store1.append(line)
for item in store1:
store2.append(item.replace(id, '', 1).strip())
periodsDay = '/'.join([str(x) for x in store2])
return periodsDay
But the function returns 12 misformatted strings...
'1200 - 1500', '1200 - 1500/1200 - 1500/1800 - 2330',

You can use collections.OrderedDict here, if order doesn't matter then use collections.defaultdict
>>> from collections import OrderedDict
>>> dic = OrderedDict()
for item in extendedPeriods:
k,v = item.split(': ')
dic.setdefault(k,[]).append(v)
...
>>> for k,v in dic.iteritems():
... print "/".join(v)
...
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
1200 - 1500/1800 - 2330
To access a particular day you can use:
>>> print "/".join(dic['0']) #sunday
1200 - 1500/1800 - 2330
>>> print "/".join(dic['2']) #tuesday
1200 - 1500/1800 - 2330

This is your general logic:
from collections import defaultdict
d = defaultdict(list)
for i in extended_periods:
bits = i.split(':')
d[i[0].strip()].append(i[1].strip())
for i,v in d.iteritems():
print i,'/'.join(v)
The output is:
0 1200 - 1500/1800 - 2330
3 1200 - 1500/1800 - 2330
2 1200 - 1500/1800 - 2330
5 1200 - 1500/1800 - 2330
4 1200 - 1500/1800 - 2330
6 1200 - 1500/1800 - 2330
To make it function for a day, simply select d[0] (for Sunday, for example):
def schedule_per_day(day):
d = defaultdict(list)
for i in extended_periods:
bits = i.split(':')
d[i[0].strip()].append(i[1].strip())
return '/'.join(d[day]) if d.get(day) else None

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I want to count the unique host out of timestamp - python

Related

Split raw data in python

Inconsistent index value in re module

Pandas load text file error: CParserError: Error tokenizing data

Panda group dataframes into user specified time period

Python Function returns wrong value

Categories

Resources