Changing a coded min to a datetime in python pandas - python

I have a data set which looks like this. I must mention that 263 means (0-15 min), 264 means (16-30 min), 265 means (31-45 min), and 266 is (46-60 min). I need to convert these columns to a single column as : YYYY-MM-DD HH:MM:SS
LOCAL_YEAR LOCAL_MONTH LOCAL_DAY LOCAL_HOUR VALUE FLAG STATUS MEAS_TYPE_ELEMENT_ALIAS
2006 4 11 0 0 R 263
2006 4 11 0 0 R 264
2006 4 11 0 0 R 265
2006 4 11 0 0 R 266
2006 4 11 1 0 R 263
2006 4 11 1 0 R 264
2006 4 11 1 0 R 265
2006 4 11 1 0 R 266
I was wondering if anyone could help me with this?
This is the code:
import pandas as pd
import numpy as np
raw_data=pd.read_csv('Squamish_263_264_265_266.csv')
############################################## Reading rainfall and years ###################################
df=raw_data.iloc[:,[2,3,4,5,6,9]]
#print(df)
import datetime
dmap = {263:0,264:16,265:31,266:46}
df['MEAS_TYPE_ELEMENT_ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
for row, v in df.iterrows():
df.loc[row,'date'] = datetime.datetime(v['LOCAL_YEAR'],v['LOCAL_MONTH'],v['LOCAL_DAY'],v['LOCAL_HOUR'],v['MEAS_TYPE_ELEMENT_ALIAS_map'])
but it gives this error:
TypeError: integer argument expected, got float
and
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Use a map to translate the alias into a minute and the iterate to build your dates
dmap = {263:0,264:16,265:31,266:46}
df['ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
df.reset_index(inplace=True)
for row in df.head(50).itertuples():
df.loc[row[0],'date'] = datetime.datetime(int(row[1]),row[2],row[3],row[4],row[-1])

Related

How could I get the rows full of values in a txt file?

I've been struggling on a problem for a few hours.
I've a data-set which looks like this :
References 425 1451 259
/1/J.H.Gibbons,R.L.Macklin:Phys.Rev.B137,1508(1965) 425 1451 260
/2/P.R.Wrean,C.R.Brune,R.W.Kavanagh:Phys.Rev.C49,1205(1994) 425 1451 261
/3/L.Van der Zwan,K.W.Geiger:Nucl.Phys.A152,481(1970) 425 1451 262
/4/T.Murata:JAERI-Conf 98-003,p.215 425 1451 263
3.500000+6 3.844649-2 3.508375+6 3.783472-2 4.000000+6-2.064883-2 425 3 2 7
4.014688+6-2.289045-2 4.403160+6-8.623264-2 4.500000+6-1.022847-1 425 3 2 8
1.450000+7-2.039133-1 1.500000+7-1.930321-1 425 3 2 17
and I wonder how I could read and have in an array or a txt object only the last rows where they're values in each columns. To be sure I would like to have an object like this :
3.500000+6 3.844649-2 3.508375+6 3.783472-2 4.000000+6-2.064883-2 425 3 2 7
4.014688+6-2.289045-2 4.403160+6-8.623264-2 4.500000+6-1.022847-1 425 3 2 8
1.450000+7-2.039133-1 1.500000+7-1.930321-1 425 3 2 17
Sincerely I've found no such thing on StackOverflow so I ask the question directly.
Any answers would be nice.
Thank you !
hh
You'll need two components:
1.) A file reader How to read files in python
2.) A filter. I'd recommend regex for this regex in python
Loop through each line using your file reader. Use your filter to look for and extract data on each line.
This example may work if 425 will be at the start of all your data:
import re
data = []
# Allows you to read from your file
with open('your.txt') as f:
# Will loop through each line
for line in f.readlines():
# Looks for 425 and pulls till end of line:
data_point = re.search('425.*', line).group()
data.append( data_point )
The table columns appear to have fixed widths.
Without having more of the table, it is hard to reliably determine if a row is meaningful or not. I've dropped rows if the value in the first column doesn't appear to be a number like 3.500000+6 by looking for . at the exact position.
The numbers in the first 6 columns seem to be in scientific notation but without the e. Code fixing that is included at the bottom.
import pandas as pd
filename = r"C:\Users\Bobson Dugnutt\Desktop\table.txt"
column_widths = (11, 11, 11, 11, 11, 11, 4, 2, 3, 5)
df = pd.read_fwf(filename, widths=column_widths, header=None)
print(df, end="\n\n")
# Returns True if it looks like a number like "3.500000+6"
def is_meaningful_row(value):
#print(value)
if isinstance(value, str):
if value[-9] == ".":
return True
return False
# Remove rows if the first column doesn't look like a number like "3.500000+6"
df = df[df[0].map(is_meaningful_row)]
# Reset the index so the first row is no longer index 5
df = df.reset_index(drop=True)
print(df, end="\n\n")
# Replace NA/NaN values with empty strings
df = df.fillna("")
print(df, end="\n\n")
# Transforms a string like "-2.039133-1" to "-2.039133e-1"
def fix_scientific_notation(value):
if isinstance(value, str):
return value[0:1] + value[1:].replace("+", "e+").replace("-", "e-")
return value
# For the first 6 columns, fixes the strange scientific notation and converts
# values from string to float64 or int64
for column in df.columns[:6]:
df[column] = df[column].map(fix_scientific_notation)
df[column] = pd.to_numeric(df[column])
print(df)
Output:
0 1 2 3 4 5 6 7 8 9
0 References NaN NaN NaN NaN NaN 425 1 451 259
1 /1/J.H.Gibb ons,R.L.Mac klin:Phys.R ev.B137,150 8(1965) NaN 425 1 451 260
2 /2/P.R.Wrea n,C.R.Brune ,R.W.Kavana gh:Phys.Rev .C49,1205(1 994) 425 1 451 261
3 /3/L.Van de r Zwan,K.W. Geiger:Nucl .Phys.A152, 481(1970) NaN 425 1 451 262
4 /4/T.Murata :JAERI-Conf 98-003,p.2 15 NaN NaN 425 1 451 263
5 3.500000+6 3.844649-2 3.508375+6 3.783472-2 4.000000+6 -2.064883-2 425 3 2 7
6 4.014688+6 -2.289045-2 4.403160+6 -8.623264-2 4.500000+6 -1.022847-1 425 3 2 8
7 1.450000+7 -2.039133-1 1.500000+7 -1.930321-1 NaN NaN 425 3 2 17
0 1 2 3 4 5 6 7 8 9
0 3.500000+6 3.844649-2 3.508375+6 3.783472-2 4.000000+6 -2.064883-2 425 3 2 7
1 4.014688+6 -2.289045-2 4.403160+6 -8.623264-2 4.500000+6 -1.022847-1 425 3 2 8
2 1.450000+7 -2.039133-1 1.500000+7 -1.930321-1 NaN NaN 425 3 2 17
0 1 2 3 4 5 6 7 8 9
0 3.500000+6 3.844649-2 3.508375+6 3.783472-2 4.000000+6 -2.064883-2 425 3 2 7
1 4.014688+6 -2.289045-2 4.403160+6 -8.623264-2 4.500000+6 -1.022847-1 425 3 2 8
2 1.450000+7 -2.039133-1 1.500000+7 -1.930321-1 425 3 2 17
0 1 2 3 4 5 6 7 8 9
0 3500000.0 0.038446 3508375.0 0.037835 4000000.0 -0.020649 425 3 2 7
1 4014688.0 -0.022890 4403160.0 -0.086233 4500000.0 -0.102285 425 3 2 8
2 14500000.0 -0.203913 15000000.0 -0.193032 NaN NaN 425 3 2 17

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

Get index of where group starts and ends pandas

I grouped my data by month. Now I need to know at which observation/index my group starts and ends.
What I have is the following output where the second column represents the number of observation in each month:
date
01 145
02 2232
03 12785
04 16720
Name: date, dtype: int64
with this code:
leave.groupby([leave['date'].dt.strftime('%m')])['date'].count()
What I want though is an index range I could access later. Somehow like that (the format doesn't really matter and I don't mind if it returns a list or a data frame)
date
01 0 - 145
02 146 - 2378
03 2378 - 15163
04 15164 - 31884
try the following - using shift
df['data'] = df['data'].shift(1).add(1).fillna(0).apply(int).apply(str) + ' - ' + df['data'].apply(str)
OUTPUT:
data
date
1 0 - 145
2 146 - 2232
3 2233 - 12785
4 12786 - 16720
5 16721 - 30386
6 30387 - 120157
I think you are asking for a data frame containing the indices of first and last occurrences of each value.
How about something like this.
Example data (note -- it's better to include reproducible data in your question so I don't have to guess):
import pandas as pd
import numpy as np
np.random.seed(123)
n = 500
df = pd.DataFrame(
{'date':pd.to_datetime(
pd.DataFrame( { 'year': np.random.choice(range(2017,2019), size=n),
'month': np.random.choice(range(1,13), size=n),
'day': np.random.choice(range(1,28), size=n)
} )
) }
)
Approach:
pd.DataFrame( ( { '_month_':x,'firstIndex':y[0],'lastIndex':y[-1]}
for x, y in df.index.groupby(df['date'].dt.month).items()
)
)
Result:
_month_ firstIndex lastIndex
0 1 0 495
1 2 21 499
2 3 1 488
3 4 5 498
4 5 14 492
5 6 12 470
6 7 15 489
7 8 2 494
8 9 18 475
9 10 3 491
10 11 10 473
11 12 7 497
If you are only going use it for indexing in a loop, you wouldn't have to wrap it in pd.DataFrame() -- you could just leave it as a generator.

Changing of Data format from Pivoted data in Dataframes using Pandas Python

The Scenario
My dataset was in format as follows:
Which I refer as ACTUAL FORMAT
uid iid rat tmp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
and while passing it to other function (KMeans Clustering) it requires to be format like this, which I've created using Pivot mapping:
Which I refer as MATRIX FORMAT
uid 1 2 3 4
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
5 4 3 2.1952622349 3.1913491995
6 4 3.4233243638 3.8255108621 3.948791424
7 4.4983411706 4.0477240538 4.0241460801 5
8 4.1773004578 4.0191412859 4.0442369862 4.1754642909
9 4.2733984521 4.2797130861 4.2682723131 4.2816986988
15 1 3.0554789259 3.2279546684 3.1282278957
16 5 4.3473697565 4.0675394438 5
The Problem:
Now, Since I need the result / MATRIX FORMAT Data to passed again to the First Algorithm, I need to convert it to OLD FORMAT.
Coversion:
For conversion of OLD to MATRIX Format I did:
Pivot_Matrix = source_data.pivot(values='rat', index='uid', columns='iid')
I tried reversing & interchanging of values to get the OLD FORMAT, which has apparently failed. Is there any way to retrieve MATRIX to OLD FORMAT?
You need stack with rename_axis for columns names and last reset_index:
df = df.stack().rename_axis(('uid','iid')).reset_index(name='rat')
print (df.head())
uid iid rat
0 4 1 4.332076
1 4 2 4.340775
2 4 3 4.311200
3 4 4 4.341143
4 5 1 4.000000

Categories

Resources