I am importing a CSV into a pandas dataframe. One column contains an 18 digit LDAP timestamp. I am trying to convert this timestamp however it appears that it is being rounded causing incorrect calculation.
data.csv:
Event ID,Clock-Time,ProcessID
10,133081599160584000,2824,44
10,133081599160584000,2824,84
10,133081599160667000,2824,44
10,133081599160667000,2824,92
10,133081599160667000,2824,116
10,133081599160667000,2824,132
script.py
#!/usr/bin/python
from datetime import datetime, timedelta
import pandas
pandas.set_option('display.max_colwidth', None)
pandas.set_option('display.float_format','{:.0f}'.format)
pandas.set_option('display.precision', 20)
in_csv = "data.csv"
df = pandas.read_csv(in_csv,sep=',',header=0, float_precision='high')
print(df.dtypes)
print(df)
# convert win32 timestamp to unix timestamp
df['Clock-Time'] = df['Clock-Time'].apply(lambda timestamp: datetime(1601, 1, 1) + timedelta(seconds=(timestamp/10000000)))
print(df)
output:
Event ID int64
Clock-Time float64
ProcessID int64
Size int64
dtype: object
Event ID Clock-Time ProcessID Size
0 10 133082000000000000 2824 44
1 10 133082000000000000 2824 84
2 10 133082000000000000 2824 44
3 10 133082000000000000 2824 92
4 10 133082000000000000 2824 116
5 10 133082000000000000 2824 132
Event ID Clock-Time ProcessID Size
0 10 2022-09-21 02:13:20 2824 44
1 10 2022-09-21 02:13:20 2824 84
2 10 2022-09-21 02:13:20 2824 44
3 10 2022-09-21 02:13:20 2824 92
4 10 2022-09-21 02:13:20 2824 116
5 10 2022-09-21 02:13:20 2824 132
How can I get pandas to respect the full value so I can get the accurate timestamp?
I can't reproduce the issue by creating your .csv from Windows Notepad.
Event ID Clock-Time ProcessID Size
0 10 2022-09-20 15:05:16.058399 2824 44
1 10 2022-09-20 15:05:16.058399 2824 84
2 10 2022-09-20 15:05:16.066700 2824 44
3 10 2022-09-20 15:05:16.066700 2824 92
4 10 2022-09-20 15:05:16.066700 2824 116
5 10 2022-09-20 15:05:16.066700 2824 132
I bet you createad your .csv file from an Excel spreadsheet. By default, when you enter a number over 12 digits (e.g, a 18 digit LDAP timestamp) in a spreadsheet, the number is auto-corrected to scientific notation. For example, "133081599160584000" is converted to "1.33082E+17" (hence the single value "133082000000000000" in your initial dataframe after calling pandas.read_csv). So when you export the spreadsheet to a text file (e.g, .csv), it is the scientific notation (what
Excel sees) that will be exported and not the actual (x>12)-digit value.
You can fix that upstream by changing the type of the concerned column in Excel
before creating the .csv :
Related
I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16
I want to get rows of a most recent day which is in ascending order of time way.
I get dataframe as follows:
label uId adId operTime siteId slotId contentId netType
0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
Cause I get about 100 million rows in this csv file, it is impossible to load all this into my PC memory.
So I want to get rows of a most recent day in ascending order of time way when reading this csv files.
For examples, if the most recent day is on 2019-04-04, it will output as follows:
#this not a real data, just for examples.
label uId adId operTime siteId slotId contentId netType
0 0 u147336431 3887 2019-04-04 00:08:42.315 1 54 2427 2
1 0 u146933269 1462 2019-04-04 01:06:16.417 30 36 1343 6
2 0 u139536523 2084 2019-04-04 02:08:58.079 15 23 1536 7
3 0 u106663472 1460 2019-04-04 03:21:13.050 32 45 1352 2
4 0 u121642861 2295 2019-04-04 04:36:08.653 3 33 3267 4
Could anyone help me?
Thanks in advence.
I'm assuming you can't read the entire file into memory, and the file is in a random order. You can read the file in chunks and iterate through the chunks.
# read 50,000 lines of the file at a time
reader = pd.read_csv(
'csv_file.csv',
parse_dates=True,
chunksize=5e5,
header=0
)
recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]
for chunk in reader:
#check if any rows match the date range
date_rows = chunk.loc[
(chunk['operTime'] >= recent_day]) &\
(chunk['operTime'] < next_day)
]
#append dataframe of matching rows to the list
if date_rows.empty:
pass
else:
df_list.append(date_rows)
final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')
Seconding what anky_91 said, sort_values() will be helpful here.
import pandas as pd
df = pd.read_csv('file.csv')
# >>> df
# label uId adId operTime siteId slotId contentId netType
# 0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]
# >>> sub_df
# label uId adId operTime siteId slotId contentId netType
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
final_df = sub_df.sort_values(by=['operTime'])
# >>> final_df
# label uId adId operTime siteId slotId contentId netType
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
I think you could also use a datetimeindex here; that might be necessary if the file is sufficiently large.
Like #anky_91 mentionned, you can use the sort_values function. Here is a short example of how it works:
df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')
Out :
Date Symbol
2 08/21/2015 A
0 02/20/2015 A
1 01/15/2016 A
Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)
The Scenario
My dataset was in format as follows:
Which I refer as ACTUAL FORMAT
uid iid rat tmp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
and while passing it to other function (KMeans Clustering) it requires to be format like this, which I've created using Pivot mapping:
Which I refer as MATRIX FORMAT
uid 1 2 3 4
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
5 4 3 2.1952622349 3.1913491995
6 4 3.4233243638 3.8255108621 3.948791424
7 4.4983411706 4.0477240538 4.0241460801 5
8 4.1773004578 4.0191412859 4.0442369862 4.1754642909
9 4.2733984521 4.2797130861 4.2682723131 4.2816986988
15 1 3.0554789259 3.2279546684 3.1282278957
16 5 4.3473697565 4.0675394438 5
The Problem:
Now, Since I need the result / MATRIX FORMAT Data to passed again to the First Algorithm, I need to convert it to OLD FORMAT.
Coversion:
For conversion of OLD to MATRIX Format I did:
Pivot_Matrix = source_data.pivot(values='rat', index='uid', columns='iid')
I tried reversing & interchanging of values to get the OLD FORMAT, which has apparently failed. Is there any way to retrieve MATRIX to OLD FORMAT?
You need stack with rename_axis for columns names and last reset_index:
df = df.stack().rename_axis(('uid','iid')).reset_index(name='rat')
print (df.head())
uid iid rat
0 4 1 4.332076
1 4 2 4.340775
2 4 3 4.311200
3 4 4 4.341143
4 5 1 4.000000
total_val_count = dataset[attr].value_counts()
for i in range(len(total_val_count.index)):
print total_val_count[i]
I have written this piece of code which counts occurrences of all distinct values of an attribute in a dataframe. The problem I am facing is that I am unable to access the first value by using index 0. I get a KeyError: 0 error in the first loop run itself.
The total_val_count contains proper values as shown below:
34 2887
4 2708
13 2523
35 2507
33 2407
3 2404
36 2382
26 2378
16 2282
22 2187
21 2141
12 2104
25 2073
5 2052
15 2044
17 2040
14 2027
28 1984
27 1980
23 1979
24 1960
30 1953
29 1936
31 1884
18 1877
7 1858
37 1767
20 1762
11 1740
8 1722
6 1693
32 1692
10 1662
9 1576
19 1308
2 1266
1 175
38 63
dtype: int64
total_val_count is a Series. The index of the Series are values in dataset[attr],
and the values in the Series are the number of times the associated value in dataset[attr] appears.
When you index a Series with total_val_count[i], Pandas looks for i in the index and returns the assocated value. In other words, total_val_count[i] is indexing by index value, not by ordinal.
Think of a Series as a mapping from the index to the values. When using plain indexing, e.g. total_val_count[i], it behaves more like a dict than a list.
You are getting a KeyError because 0 is not a value in the index.
To index by ordinal, use total_val_count.iloc[i].
Having said that, using for i in range(len(total_val_count.index)) -- or, what amounts to the same thing, for i in range(len(total_val_count)) -- is not recommended. Instead of
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
you could use
for value in total_val_count.values:
print(value)
This is more readable, and allows you to access the desired value as a variable, value, instead of the more cumbersome total_val_count.iloc[i].
Here is an example which shows how to iterate over the values, the keys, both the keys and values:
import pandas as pd
s = pd.Series([1, 2, 3, 2, 2])
total_val_count = s.value_counts()
print(total_val_count)
# 2 3
# 3 1
# 1 1
# dtype: int64
for value in total_val_count.values:
print(value)
# 3
# 1
# 1
for key in total_val_count.keys():
print(key)
# 2
# 3
# 1
for key, value in total_val_count.iteritems():
print(key, value)
# (2, 3)
# (3, 1)
# (1, 1)
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
# 3
# 1
# 1