Get index of where group starts and ends pandas - python

I grouped my data by month. Now I need to know at which observation/index my group starts and ends.
What I have is the following output where the second column represents the number of observation in each month:
date
01 145
02 2232
03 12785
04 16720
Name: date, dtype: int64
with this code:
leave.groupby([leave['date'].dt.strftime('%m')])['date'].count()
What I want though is an index range I could access later. Somehow like that (the format doesn't really matter and I don't mind if it returns a list or a data frame)
date
01 0 - 145
02 146 - 2378
03 2378 - 15163
04 15164 - 31884

try the following - using shift
df['data'] = df['data'].shift(1).add(1).fillna(0).apply(int).apply(str) + ' - ' + df['data'].apply(str)
OUTPUT:
data
date
1 0 - 145
2 146 - 2232
3 2233 - 12785
4 12786 - 16720
5 16721 - 30386
6 30387 - 120157

I think you are asking for a data frame containing the indices of first and last occurrences of each value.
How about something like this.
Example data (note -- it's better to include reproducible data in your question so I don't have to guess):
import pandas as pd
import numpy as np
np.random.seed(123)
n = 500
df = pd.DataFrame(
{'date':pd.to_datetime(
pd.DataFrame( { 'year': np.random.choice(range(2017,2019), size=n),
'month': np.random.choice(range(1,13), size=n),
'day': np.random.choice(range(1,28), size=n)
} )
) }
)
Approach:
pd.DataFrame( ( { '_month_':x,'firstIndex':y[0],'lastIndex':y[-1]}
for x, y in df.index.groupby(df['date'].dt.month).items()
)
)
Result:
_month_ firstIndex lastIndex
0 1 0 495
1 2 21 499
2 3 1 488
3 4 5 498
4 5 14 492
5 6 12 470
6 7 15 489
7 8 2 494
8 9 18 475
9 10 3 491
10 11 10 473
11 12 7 497
If you are only going use it for indexing in a loop, you wouldn't have to wrap it in pd.DataFrame() -- you could just leave it as a generator.

Related

data cleansing - 2 columns change the data in one column if criteria is met

I have columns with vehicle data, for vehicles greater than 1 year old with mileage less than 100 I want to replace mileage less than 100 with 1000.
my attempts -
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)], 1000
Error - AttributeError: 'tuple' object has no attribute
and
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)]
mileage_corr['mileage'].where(mileage_corr['mileage'] <= 100, 1000, inplace=True)
error -
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return self._where(
Without complete information, assuming your vehicle_data_all DataFrame looks something like this,
years mileage
0 2019 192
1 2014 78
2 2010 38
3 2018 119
4 2019 4
5 2012 122
6 2005 50
7 2015 69
8 2004 56
9 2003 194
Pandas has a way of assigning based on a filter result. This is referred to as setting values.
df.loc[condition, "field_to_change"] = desired_change
Applied to your dataframe would look something like this,
vehicle_data_all.loc[((vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)), "mileage"] = 1000
This was my result,
years mileage
0 2019 192
1 2014 1000
2 2010 1000
3 2018 119
4 2019 1000
5 2012 122
6 2005 1000
7 2015 1000
8 2004 1000
9 2003 194

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

Changing a coded min to a datetime in python pandas

I have a data set which looks like this. I must mention that 263 means (0-15 min), 264 means (16-30 min), 265 means (31-45 min), and 266 is (46-60 min). I need to convert these columns to a single column as : YYYY-MM-DD HH:MM:SS
LOCAL_YEAR LOCAL_MONTH LOCAL_DAY LOCAL_HOUR VALUE FLAG STATUS MEAS_TYPE_ELEMENT_ALIAS
2006 4 11 0 0 R 263
2006 4 11 0 0 R 264
2006 4 11 0 0 R 265
2006 4 11 0 0 R 266
2006 4 11 1 0 R 263
2006 4 11 1 0 R 264
2006 4 11 1 0 R 265
2006 4 11 1 0 R 266
I was wondering if anyone could help me with this?
This is the code:
import pandas as pd
import numpy as np
raw_data=pd.read_csv('Squamish_263_264_265_266.csv')
############################################## Reading rainfall and years ###################################
df=raw_data.iloc[:,[2,3,4,5,6,9]]
#print(df)
import datetime
dmap = {263:0,264:16,265:31,266:46}
df['MEAS_TYPE_ELEMENT_ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
for row, v in df.iterrows():
df.loc[row,'date'] = datetime.datetime(v['LOCAL_YEAR'],v['LOCAL_MONTH'],v['LOCAL_DAY'],v['LOCAL_HOUR'],v['MEAS_TYPE_ELEMENT_ALIAS_map'])
but it gives this error:
TypeError: integer argument expected, got float
and
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Use a map to translate the alias into a minute and the iterate to build your dates
dmap = {263:0,264:16,265:31,266:46}
df['ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
df.reset_index(inplace=True)
for row in df.head(50).itertuples():
df.loc[row[0],'date'] = datetime.datetime(int(row[1]),row[2],row[3],row[4],row[-1])

pandas: how to run a pivot with a multi-index?

I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.
The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?
Minimal code copied below. Thanks a lot!
PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.
import pandas as pd
import numpy as np
df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)
df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))
# This doesn't work:
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')
# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')
# This below works but is not ideal:
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']
mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)
mypiv['month'] = mypiv['new field'] % 100
You can group and then unstack.
>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
Or use pivot_table:
>>> df.pivot_table(
values='value',
index=['year', 'month'],
columns='item',
aggfunc=np.sum)
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
I believe if you include item in your MultiIndex, then you can just unstack:
df.set_index(['year', 'month', 'item']).unstack(level=-1)
This yields:
value
item item 1 item 2
year month
2004 1 21 277
2 43 244
3 12 262
4 80 201
5 22 287
6 52 284
7 90 249
8 14 229
9 52 205
10 76 207
11 88 259
12 90 200
It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.
The following worked for me:
mypiv = df.pivot(index=['year','month'],columns='item')[['values1','values2']]
thanks to gmoutso comment you can use this:
def multiindex_pivot(df, index=None, columns=None, values=None):
if index is None:
names = list(df.index.names)
df = df.reset_index()
else:
names = index
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
return df
usage:
df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')
You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:
(df
.infer_objects() # coerce to the intended column type
.rename_axis(None, axis=1)) # flatten column headers

Categories

Resources