Pandas load text file error: CParserError: Error tokenizing data

Pandas load text file error: CParserError: Error tokenizing data - python

Pandas load text file error: CParserError: Error tokenizing data.
I'm a new pandas learner. I'm trying to use pandas to open a text file. I write code in python, then access to the right path and run the python file, but failed.
Here is the original data. There's no field name, all the rows of data are separated with space:
2017-07-02 23:59:51127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=202104&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 986 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=hydrogen-motor 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 2539 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=100005713&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 1172 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=stainless-stand 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 3152 31.7.188.55
Here is my simple python code:
import pandas as pd
DATA_FILE='data.log'
df = pd.read_table(DATA_FILE, sep=" ")
print(df)
But I got error as follows:
Traceback (most recent call last):
File "open.py", line 7, in <module>
df = pd.read_table(DATA_FILE, sep=" ")
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 401, in _read
data = parser.read()
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "C:\Users\hh\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 848, in pandas.parser.TextReader.read (pandas\parser.c:10415)
File "pandas\parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)
File "pandas\parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas\parser.c:11437)
File "pandas\parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:11308)
File "pandas\parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas\parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 6 fields in line 4, saw 17
There must be something run with my python code. How to get the correct syntax code?

You missed a space on the first line:
2017-07-02 23:59:51127.0.0.1
Replace to:
2017-07-02 23:59:51 127.0.0.1
Just tested:
In [12]: cat data.log
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=202104&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 986 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=hydrogen-motor 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 2539 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvSearchProduct.aspx cid=100005713&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 1172 31.7.188.55
2017-07-02 23:59:51 127.0.0.1 GET /ecvv_product/EcvvHotSearchProduct.aspx kw=stainless-stand 8082 - 127.0.0.1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;+DigExt;+DTS+Agent - 200 0 0 3152 31.7.188.55
In [13]: dx = pd.read_table('data.log', sep=" ", header=None)
In [14]: dx
Out[14]:
0 1 2 3 \
0 2017-07-02 23:59:51 127.0.0.1 GET
1 2017-07-02 23:59:51 127.0.0.1 GET
2 2017-07-02 23:59:51 127.0.0.1 GET
3 2017-07-02 23:59:51 127.0.0.1 GET
4 \
0 /ecvv_product/EcvvSearchProduct.aspx
1 /ecvv_product/EcvvHotSearchProduct.aspx
2 /ecvv_product/EcvvSearchProduct.aspx
3 /ecvv_product/EcvvHotSearchProduct.aspx
5 6 7 8 \
0 cid=202104&p=&pageindex=&kw=electric-skateboard 8082 - 127.0.0.1
1 kw=hydrogen-motor 8082 - 127.0.0.1
2 cid=100005713&p=&pageindex=&kw=electric-skateb... 8082 - 127.0.0.1
3 kw=stainless-stand 8082 - 127.0.0.1
9 10 11 12 13 14 \
0 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 986
1 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 2539
2 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 1172
3 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+NT;... - 200 0 0 3152
15
0 31.7.188.55
1 31.7.188.55
2 31.7.188.55
3 31.7.188.55

Related

I want to count the unique host out of timestamp

I have 2 pandas columns in my temp pandas dataframe I want to count by the date time for a unique host for in a same day inside of my dataframe the log file look like this
10.216.113.172 - - [04/Sep/2009:02:57:16 -0700] "GET /images/filmpics/0000/0053/quietman2.jpeg HTTP/1.1" 200 1077924
10.211.47.159 - - [03/Sep/2009:22:19:49 -0700] "GET /quietman4.jpeg HTTP/1.1" 404 212
10.211.47.159 - - [22/Aug/2009:12:58:27 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 -
10.216.113.172 - - [14/Jan/2010:03:09:17 -0800] "GET /images/filmmediablock/229/Shinjuku5.jpg HTTP/1.1" 200 443005
10.211.47.159 - - [15/Oct/2009:21:21:58 -0700] "GET /assets/img/banner/ten-years-banner-grey.jpg HTTP/1.1" 304 -
10.216.113.172 - - [12/Aug/2009:05:57:55 -0700] "GET /about-us/people/ HTTP/1.1" 200 10773
10.211.47.159 - - [24/Aug/2009:13:16:26 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -
10.211.47.159 - - [03/Sep/2009:21:30:27 -0700] "GET /images/newspics/0000/0017/Mike5_thumb.JPG HTTP/1.1" 304 -
10.211.47.159 - - [15/Oct/2009:20:30:43 -0700] "GET /images/filmpics/0000/0057/quietman4.jpeg HTTP/1.1" 304 -
10.211.47.159 - - [11/Aug/2009:20:34:44 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -
I have those timestamps(03/Sep/2009:22:19:49 -0700) and hosts(10.211.47.159 ) in my dataframe temp the desired output is [2,1,1,2,1,1] the -0700 is what we have to subtract to the time and the time have to push back to one if went back a day.
heres my code
but my output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] may someone help
temp = pandas.DataFrame()
temp['timestamp'] = Mainpanda['timestamp']
temp['host'] = Mainpanda['host']
temp['timestamp'] =
pandas.to_datetime(temp['timestamp'],
format='%d/%b/%Y:%H:%M:%S %z')
temp['timestamp'] = temp['timestamp'] - pandas.Timedelta(hours=7)
counts = temp.groupby('timestamp')['host'].nunique().reset_index()
counts = counts.sort_values(by='timestamp')
counts = counts['host'].tolist()
print(counts)

First convert to DatetimeIndex then align all datetime to UTC: it's important to have the same timezone to process correctly your data:
import pandas as pd
temp['timestamp'] = pd.to_datetime(temp['timestamp'], format='%d/%b/%Y:%H:%M:%S %z')
temp['ts_utc'] = pd.DatetimeIndex([dt.tz_convert('UTC') for dt in temp['timestamp']])
visits = (temp.groupby(['host', pd.Grouper(key='ts_utc', freq='D')]).size()
.to_frame('visit').reset_index())
Output:
>>> visits
host ts_utc visit
0 10.211.47.159 2009-08-12 00:00:00+00:00 1
1 10.211.47.159 2009-08-22 00:00:00+00:00 1
2 10.211.47.159 2009-08-24 00:00:00+00:00 1
3 10.211.47.159 2009-09-04 00:00:00+00:00 2
4 10.211.47.159 2009-10-16 00:00:00+00:00 2
5 10.216.113.172 2009-08-12 00:00:00+00:00 1
6 10.216.113.172 2009-09-04 00:00:00+00:00 1
7 10.216.113.172 2010-01-14 00:00:00+00:00 1

Split raw data in python

I want to cut the raw data from txt file by python. The raw data like this:
-0.156 200
-0.157 300
-0.158 400
-0.156 201
-0.157 305
-0.158 403
-0.156 199
-0.157 308
-0.158 401
I expect to extract the file to many txt file like this.
-0.156 200
-0.157 300
-0.158 400
-0.156 201
-0.157 305
-0.158 403
-0.156 199
-0.157 308
-0.158 401
Would you please help me?

This splits the data into files with three entries in each file
READ_FROM = 'my_file.txt'
ENTRIES_PER_FILE = 3
with open(READ_FROM) as f:
data = f.read().splitlines()
i = 1
n = 1
for line in data:
with open(f'new_file_{n}.txt', 'a') as g:
g.write(line + '\n')
i += 1
if i > ENTRIES_PER_FILE:
i = 1
n += 1
new_file_1.txt
-0.156 200
-0.157 300
-0.158 400
new_file_2.txt
-0.156 201
-0.157 305
-0.158 403
new_file_3.txt
-0.156 199
-0.157 308
-0.158 401

If you have spaces between your lines and you want spaces between your lines in your new file this will work:
with open('path/to/file.txt', 'r') as file:
lines = file.readlines()
cleaned_lines = [line.strip() for line in lines if len(line.strip()) > 0]
num_lines_per_file = 3
for num in range(0, len(cleaned_lines), num_lines_per_file):
with open(f'{num}.txt', 'w') as out_file:
for line in cleaned_lines[num:num + num_lines_per_file]:
out_file.write(f'{line}\n\n')

Dropping dataframe rows in time series dataframe using pandas

I have the below sequence of data as a pandas dataframe
id,start,end,duration
303,2012-06-25 17:59:43,2012-06-25 18:01:29,105
404,2012-06-25 18:01:29,2012-06-25 18:01:55,25
303,2012-06-25 18:01:56,2012-06-25 18:02:06,10
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
404,2012-06-25 18:02:45,2012-06-25 18:02:51,6
303,2012-06-25 18:02:54,2012-06-25 18:03:17,23
404,2012-06-25 18:03:24,2012-06-25 18:03:41,17
303,2012-06-25 18:03:43,2012-06-25 18:05:51,128
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104
404,2012-06-25 18:24:24,2012-06-25 18:25:25,61
101,2012-06-25 18:25:25,2012-06-25 18:25:462,21
404,2012-06-25 18:25:49,2012-06-25 18:26:00,11
101,2012-06-25 18:26:01,2012-06-25 18:26:04,3
404,2012-06-25 18:26:05,2012-06-25 18:28:49,164
202,2012-06-25 18:28:52,2012-06-25 18:28:57,5
404,2012-06-25 18:29:00,2012-06-25 18:29:24,24
It should always be the case that id 404 gets repeated after another different id.
For example if the above is motion sensors in a house e.g. 404:hallway, 202:bedroom, 303:kitchen, 201:studyroom, where the hallway is in the middle, then moving from bedroom to kitchen to studyroom and back to bedroom should trigger 202, 404, 303, 404, 201, 404, 202 in that order because one always passes through the hallway (404) to any room. My output has cases that violate this sequence and I want to drop such rows.
For example from the snippet dataframe above the below rows violate this:
303,2012-06-25 18:01:56,2012-06-25 18:02:06,10
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
303,2012-06-25 18:03:43,2012-06-25 18:05:51,128
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104
and therefore the rows below should be droped (but of course I have a much larger dataset).
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104
I have tried shift and drop but the result still has some inconsistencies.
df['id_ns'] = df['id'].shift(-1)
df['id_ps'] = df['id'].shift(1)
if (df['id'] != 404):
df.drop(df[(df.id_ns != 404) & (df.id_ps != 404)].index, axis=0, inplace=True)
How best can I approach this?

Use Series.ne + Series.shift along with optional parameter fill_value to create a boolean mask, use this mask to filter/drop the rows:
mask = df['id'].ne(404) & df['id'].shift(fill_value=404).ne(404)
df = df[~mask]
Result:
print(df)
id start end duration
0 303 2012-06-25 17:59:43 2012-06-25 18:01:29 105
1 404 2012-06-25 18:01:29 2012-06-25 18:01:55 25
2 303 2012-06-25 18:01:56 2012-06-25 18:02:06 10
4 404 2012-06-25 18:02:45 2012-06-25 18:02:51 6
5 303 2012-06-25 18:02:54 2012-06-25 18:03:17 23
6 404 2012-06-25 18:03:24 2012-06-25 18:03:41 17
7 303 2012-06-25 18:03:43 2012-06-25 18:05:51 128
9 404 2012-06-25 18:24:24 2012-06-25 18:25:25 61
10 101 2012-06-25 18:25:25 2012-06-25 18:25:46 21
11 404 2012-06-25 18:25:49 2012-06-25 18:26:00 11
12 101 2012-06-25 18:26:01 2012-06-25 18:26:04 3
13 404 2012-06-25 18:26:05 2012-06-25 18:28:49 164
14 202 2012-06-25 18:28:52 2012-06-25 18:28:57 5
15 404 2012-06-25 18:29:00 2012-06-25 18:29:24 24

Pandas Int64 .loc cannot do slice indexing?

Consider this simple example:
>>> import pandas as pd
>>>
dfA = pd.DataFrame({
"key":[1,3,6,10,15,21],
"columnA":[10,20,30,40,50,60],
"columnB":[100,200,300,400,500,600],
"columnC":[110,202,330,404,550,606],
})
>>> dfA
key columnA columnB columnC
0 1 10 100 110
1 3 20 200 202
2 6 30 300 330
3 10 40 400 404
4 15 50 500 550
5 21 60 600 606
If I want to use .loc here, it works fine:
>>> dfA.set_index('key').loc[2:16]
columnA columnB columnC
key
3 20 200 202
6 30 300 330
10 40 400 404
15 50 500 550
... but if I do a "cast" (.astype) to Int64, it fails:
>>> dfA.astype('Int64').set_index('key').loc[2:16]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1768, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1912, in _getitem_axis
return self._get_slice_axis(key, axis=axis)
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1796, in _get_slice_axis
indexer = labels.slice_indexer(
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4712, in slice_indexer
start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4925, in slice_locs
start_slice = self.get_slice_bound(start, "left", kind)
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4837, in get_slice_bound
label = self._maybe_cast_slice_bound(label, side, kind)
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4789, in _maybe_cast_slice_bound
self._invalid_indexer("slice", label)
File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3075, in _invalid_indexer
raise TypeError(
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2] of <class 'int'>
>>>
Why does this happen - and can I have this kind of .loc indexing with Int64 too? (I have to use Int64, because I read in .csv data which has missing values, and I don't want the values casted to floats - but I'd still like to use .loc as in the above case)
EDIT: a bit more info:
>>> dfA.astype('Int64').loc(0)[0]['key']
1
>>> type(dfA.astype('Int64').loc(0)[0]['key'])
<class 'numpy.int64'>
Ok, so the actual numbers in case of dtype 'Int64' are of class 'numpy.int64' - but that still cannot be used for .loc in this case:
>>> import numpy as np
>>> dfA.astype('Int64').set_index('key').loc[np.int64(2):np.int64(2)]
...
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2] of <class 'numpy.int64'>

You can circumvent this by making key the index first and then converting to Int64:
dfA.set_index('key').astype('Int64').loc[2:16]
columnA columnB columnC
key
3 20 200 202
6 30 300 330
10 40 400 404
15 50 500 550
Or converting only your key column to old-fashioned int64:
df.index = df['key'].astype('int64')
That is, presuming it does not have <NA> values like your other columns apparently do.

Pandas: map dictionary values on an existing column based on key from another column to replace NaN

I've had a good look and I can't seem to find the answer to this question. I am wanting to replace all NaN values in my Department Code Column of my DataFrame with values from a dictionary, using the Job Number column as the Key matching that of the dictionary. The data can be seen Below: Please note there are many extra columns, these are just the two.)
df =
Job Number Department Code
0 3525 403
1 4555 NaN
2 5575 407
3 6515 407
4 7525 NaN
5 8535 102
6 3545 403
7 7455 102
8 3365 NaN
9 8275 403
10 3185 408
dict = {'4555': '012', '7525': '077', '3365': '034'}
What I am hoping the output to look like is:
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408
The two columns are object datatypes and I have tried the replace function which I have used before but that only replaces the value if the key is in the same column.
df['Department Code'].replace(dict, inplace=True)
This does not replace the NaN values.
I'm sure the answer is very simple and I apologies in advance but i'm just stuck.
(Excuse my poor code display, it's handwritten as not sure how to export code from python to here.)

Better is avoid variable dict, because builtin (python code word), then use Series.fillna for replace matched values with Series.map, if no match values return NaN, so no replacement:
d = {'4555': '012', '7525': '077', '3365': '034'}
df['Department Code'] = df['Department Code'].fillna(df['Job Number'].astype(str).map(d))
print (df)
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408

Or another way is using set_index and fillna:
df['Department Code'] = (df.set_index('Job Number')['Department Code']
.fillna(d).values)
print(df)
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas load text file error: CParserError: Error tokenizing data - python

Related

I want to count the unique host out of timestamp

Split raw data in python

Dropping dataframe rows in time series dataframe using pandas

Pandas Int64 .loc cannot do slice indexing?

Pandas: map dictionary values on an existing column based on key from another column to replace NaN

Categories

Resources