I have a dataframe where the index is a timestamp.
DATE VALOR
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
... ...
2021-04-30 19:00:00 0.77059
2021-04-30 20:00:00 0.49285
2021-04-30 21:00:00 0.49057
2021-04-30 22:00:00 0.50339
2021-04-30 23:00:00 0.48792
I´m searching for a specific date
drop.loc['2020-12-01 04:00:00']
VALOR 0.0108
Name: 2020-12-01 04:00:00, dtype: float64
I want the return for the index of search above.
In this case is line 5. After I want to use this value to do a slice in the dataframe
drop[:5]
Thanks!
It looks like you want to subset drop up to index '2020-12-01 04:00:00'.
Then simply do this: drop.loc[:'2020-12-01 04:00:00']
No need to manually get the line number.
output:
VALOR
DATE
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
If you really want to get the position:
pos = drop.index.get_loc(key='2020-12-01 04:00:00') ## returns: 4
drop[:pos+1]
Related
I am new to data analysis with pandas/pandas, coming from a Matlab background. I am trying to group data and then process the individual groups. However, I cannot figure out how to actually access the grouping result.
Here is my setup: I have a pandas dataframe df with a regular-spaced DateTime index timestamp of 10 minutes frequency. My data spans several weeks in total. I now want to group the data by days, like so:
grouping = df.groupby([pd.Grouper(level="timestamp", freq="D",)])
Note that I do not want to aggregate the groups (contrary to most examples and tutorials, it seems). I simply want to take each group in turn and process it individually, like so (does not work):
for g in grouping:
g_df = d.toDataFrame()
some_processing(g_df)
How do I do that? I haven't found any way to extract daily dataframe objects from the DataFrameGroupBy object.
Expand your groups into a dictionary of dataframes:
data = dict(list(df.groupby(df.index.date.astype(str))))
>>> data.keys()
dict_keys(['2021-01-01', '2021-01-02'])
>>> data['2021-01-01']
value
timestamp
2021-01-01 00:00:00 0.405630
2021-01-01 01:00:00 0.262235
2021-01-01 02:00:00 0.913946
2021-01-01 03:00:00 0.467516
2021-01-01 04:00:00 0.367712
2021-01-01 05:00:00 0.849070
2021-01-01 06:00:00 0.572143
2021-01-01 07:00:00 0.423401
2021-01-01 08:00:00 0.931463
2021-01-01 09:00:00 0.554809
2021-01-01 10:00:00 0.561663
2021-01-01 11:00:00 0.537471
2021-01-01 12:00:00 0.461099
2021-01-01 13:00:00 0.751878
2021-01-01 14:00:00 0.266371
2021-01-01 15:00:00 0.954553
2021-01-01 16:00:00 0.895575
2021-01-01 17:00:00 0.752671
2021-01-01 18:00:00 0.230219
2021-01-01 19:00:00 0.750243
2021-01-01 20:00:00 0.812728
2021-01-01 21:00:00 0.195416
2021-01-01 22:00:00 0.178367
2021-01-01 23:00:00 0.607105
Note: I changed your groups to be easier indexing: '2021-01-01' instead of Timestamp('2021-01-01 00:00:00', freq='D')
Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")
I have a dataframe which contains data that were measured at two hours interval each day, some time intervals are however missing. My dataset looks like below:
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
I'm trying to insert the missing time intervals and fill their value with Nan.
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 Nan
2020-12-01 14:00:00 Nan
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
I will appreciate any help on how to achieve this in python as i'm a newbie starting out with python
Create DatetimeIndex and use DataFrame.asfreq:
print (df)
date val
0 2020-12-01 08:00:00 145.9
1 2020-12-01 10:00:00 100.0
2 2020-12-01 16:00:00 99.3
3 2020-12-01 18:00:00 91.0
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('2H')
print (df)
val
date
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 NaN
2020-12-01 14:00:00 NaN
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
assuming your df looks like
datetime value
0 2020-12-01T08:00:00 145.9
1 2020-12-01T10:00:00 100.0
2 2020-12-01T16:00:00 99.3
3 2020-12-01T18:00:00 91.0
make sure datetime column is dtype datetime;
df['datetime'] = pd.to_datetime(df['datetime'])
so that you can now resample to 2-hourly frequency:
df.resample('2H', on='datetime').mean()
value
datetime
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 NaN
2020-12-01 14:00:00 NaN
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
Note that you don't need to set the on= keyword if your df already has a datetime index. The df resulting from resampling will have a datetime index.
Also note that I'm using .mean() as aggfunc, meaning that if you have multiple values within the two hour intervals, you'll get the mean of that.
You can try the following:
I have used datetime and timedelta for this,
from datetime import datetime, timedelta
# Asuming that the data is given like below.
data = ['2020-12-01 08:00:00 145.9',
'2020-12-01 10:00:00 100.0',
'2020-12-01 16:00:00 99.3',
'2020-12-01 18:00:00 91.0']
# initialize the start time using data[0]
date = data[0].split()[0].split('-')
time = data[0].split()[1].split(':')
start = datetime(int(date[0]), int(date[1]), int(date[2]), int(time[0]), int(time[1]), int(time[2]))
newdata = []
newdata.append(data[0])
i = 1
while i < len(data):
cur = start
nxt = start + timedelta(hours=2)
if (str(nxt) != (data[i].split()[0] + ' ' + data[i].split()[1])):
newdata.append(str(nxt) + ' NaN')
else:
newdata.append(data[i])
i+=1
start = nxt
newdata
NOTE : temedelta(hours=2) will add 2 hours to the existing time.
I have a Dataframe with a users indicated by the column: 'user_id'. Each of these users have several entries in the dataframe based on the date on which they did something, which is also a column. The dataframe looks somthing like
df:
user_id date
0 2019-04-13 02:00:00
0 2019-04-13 03:00:00
3 2019-02-18 22:00:00
3 2019-02-18 23:00:00
3 2019-02-19 00:00:00
3 2019-02-19 02:00:00
3 2019-02-19 03:00:00
3 2019-02-19 04:00:00
8 2019-04-05 04:00:00
8 2019-04-05 05:00:00
8 2019-04-05 06:00:00
8 2019-04-05 15:00:00
15 2019-04-28 19:00:00
15 2019-04-28 20:00:00
15 2019-04-29 01:00:00
23 2019-06-24 02:00:00
23 2019-06-24 05:00:00
23 2019-06-24 06:00:00
24 2019-03-27 12:00:00
24 2019-03-27 13:00:00
What I want to do is, for example, select the first 3 users. I wanted to do this with a code like this:
df.groupby('user_id').iloc[:3]
I know that groupby doesn't have an iloc so how could I achieve the same thing like an iloc in the groups, so I am able to slice them?
I found a way based on crayxt's answer:
df[df['user_id'].isin(df['user_id'].unique()[:3])]
i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64