I have a requirement where I need to find out the most popular start hour.
Following is the code that has helped me in finding the correct solution.
import time
import pandas as pd
import numpy as np
# bunch of code comes
# here
# that help in reaching the following steps
df = pd.read_csv(CITY_DATA[selected_city])
# convert the Start Time column to datetime
df['Start Time'] = pd.to_datetime(df['Start Time'])
# extract hour from the Start Time column to create an hour column
df['hour'] = df['Start Time'].dt.hour
# extract month and day of week from Start Time to create new columns
df['month'] = df['Start Time'].dt.month
df['day_of_week'] = df['Start Time'].dt.weekday_name
# find the most popular hour
popular_hour = df['hour'].mode()[0]
here is a sample o/p that i get when i try to run this query
"print(df['hour'])"
0 15
1 17
2 8
3 13
4 14
5 9
6 9
7 17
8 16
9 17
10 7
11 17
Name: hour, Length: 300000, dtype: int64
The o/p that i get when i use
print(type(df['hour']))
<class 'pandas.core.series.Series'>
The value of the most popular start hour is stored in popular_hour which is equal to "17" (It is the correct value)
However I am not able to understand the part of .mode()[0]
What does this .mode() do and why [0] ?
And will the same concept be to calculate popular month and popular day of the week also irrespective of their datatype
mode returns a Series:
df.mode()
0 17
dtype: int64
From this, you take the first item by calling
df.mode()[0]
17
Note that a Series is always returned, and sometimes if there are multiple values for mode, they are all returned:
pd.Series([1, 1, 2, 2, 3, 3]).mode()
0 1
1 2
2 3
dtype: int64
You would still take the first value each time and discard the rest. Note that when multiple modes are returned, they are always sorted.
Read the documentation on mode for more info.
Related
I have the following issue in Python. I have constructed a dataframe that has a column named elapsed days and 15 columns that represent the next months starting from the current month. I have filled that dataframe with values. However now I want to change the values but only for the values that the elapsed days will be more than 365 in the next months (I want them to become 1000 instead of 1)
For example we have the following dataframe
Elapsed Days month 1 month 2 .... month 15
1000 1 1 1
600 1 1 1
In order to see if based on the elapsed days we will have months that will need a change I did the following:
df['Dif'] = df['Elapsed Days'] - 450
df['Month Change'] = round(df['Dif']/30)
Now the Month Change column will have the month that something will change
In our case it will show
Elapsed Days Month Change month 1 month 2 ... month 15
1000 18 1 1 1
600 3 1 1 1
Which means that the second row the values on the 3rd month and after should be changed to 1000
I tried the following to do the changes
less_than_15_months = df['Month Change'] <=15
positive_months = df['Month Change'] >0
months.loc[less_than_15_months & positive_months , months.columns[21+df['Month Change']]: ] = 1000
The reason I add 21 is because I have 21 columns before the columns with the month start in my dataframe. However I get the following error
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I have a dataframe with more than 4 million rows and 30 columns. I am just providing a sample of my patient dataframe
df = pd.DataFrame({
'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
'1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
'11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})
I would like to do two things
1) Find the subjects and their records which are missing in the sequence
2) Get the count of item_name for each subjects
For q2, this is what I tried
df.groupby(['subject_ID','item_name']).count() # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?
For q1, this is what I am trying
df['day'].le(df['shift_date'].add(1))
I expect my output to be like as shown below
You can get the first part with:
In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name Fio2 PEEP
subject_ID
1 2 3
2 0 5
3 3 0
EDIT:
I think you've still got your date formats a bit messed up in your sample output, and strongly recommend switching everything to the ISO 8601 standard since that prevents problems like that down the road. pandas won't correctly parse that 11/01/2022 entry on its own, so I've manually fixed it in the sample.
Using what I assume these dates are supposed to be, you can find the gaps by grouping and using .resample():
In [73]: df['dates'] = pd.to_datetime(df['date_visit'])
In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")
In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())
In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
subject_ID dates
0 2 2020-01-02
1 2 2020-01-04
2 2 2020-01-05
3 2 2020-01-06
4 2 2020-01-07
5 3 2022-01-12
6 3 2022-01-14
7 3 2022-01-15
You can then add seq status to that first frame by checking whether the ID shows up in this new frame.
My pandas dataframe consists of a Column "timeStamp", the elements of which are of type datetime.datetime. I'm trying to obtain the difference between two consecutive rows of this column to obtain the time spent in seconds. I use the following piece of code for it.
df["Time"] = df["timeStamp"].diff(0).dt.total_seconds()
Generally it's working fine, however, I keep getting 0.0 as a result of this operation in quite a few instances even when it's not the case.
Examples values that result in 0.0:
import pandas as pd
import datetime
import numpy as np
df = pd.DataFrame({'S.No.': [1, 2, 3, 4], 'ABC': [datetime.datetime(2019,2,25,11,49,50), datetime.datetime(2019,2,25,11,50,0),datetime.datetime(2019,2,25,11,50,7),datetime.datetime(2019,2,25,11,50,12)]})
df["Time"] = df["ABC"].diff(0).dt.seconds
print df
Note: using python2.7
Try this:
print(df["timestamp"].diff().fillna(0).dt.seconds)
0 0
1 10
2 7
3 5
df['difference']=df["timestamp"].diff().fillna(0).dt.seconds
print(df)
timestamp difference
0 2019-02-25 11:49:50 0
1 2019-02-25 11:50:00 10
2 2019-02-25 11:50:07 7
3 2019-02-25 11:50:12 5
Use
df["Time"] = df["timeStamp"].diff().dt.total_seconds()
instead.
The argument in diff specifies the number of rows above of the row with which you want to calculate the difference. Now, you're filling it with 0, so your subtracting a value from itself, which will always give 0. By leaving it empty, it uses the default value 1, so the difference with 1 row above.
I have a list(actually a column in pandas DataFrame if this matters) of Timestamps and I'm trying to convert every element of the list to ordinal format. So I run a for loop through the list(is there a faster way?) and use:
import datetime as dt
a = a.toordinal()
or
import datetime as dt
a = dt.datetime.toordinal(a)
however the following happened(for simplicity):
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: Timestamp('1970-01-01 00:00:00.000737418')
The result makes absolutely non sense to me. Obviously what I was trying to get is:
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: 737418
What went wrong?
console output screenshot
What went wrong?
Your question is a bit misleading, and the screenshot shows what is going on.
Normally, when you write
a = b
in Python, it will bind the name a to the object bound to b. In this case, you will have
id(a) == id(b)
In your case, however, contrary to your question, you're actually doing the assignment
a[0] = b
This will call a method of a, assigning b to its 0 index. The object's class determines what happens in this case. Here, specifically, a is a pandas.Series, and it converts the object in order to conform to its dtype.
Please don't loop. It's not necessary.
#!/usr/bin/env python
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'dates': [datetime(1990, 4, 28),
datetime(2018, 4, 13),
datetime(2017, 11, 4)]})
print(df)
print(df['dates'].dt.weekday_name)
print(df['dates'].dt.weekday)
print(df['dates'].dt.month)
print(df['dates'].dt.year)
gives the dataframe:
dates
0 1990-04-28
1 2018-04-13
2 2017-11-04
And the printed values
0 Saturday
1 Friday
2 Saturday
Name: dates, dtype: object
0 5
1 4
2 5
Name: dates, dtype: int64
0 4
1 4
2 11
Name: dates, dtype: int64
0 1990
1 2018
2 2017
Name: dates, dtype: int64
For the toordinal, you need to "loop" with apply:
print(df['dates'].apply(lambda x: x.toordinal()))
gives the following pandas series
0 726585
1 736797
2 736637
Name: dates, dtype: int64
At first I have two problems, the first will follow now:
I a dataframe df with many times the same userid and along with it a date and some unimportant other columns:
userid date
0 243 2014-04-01
1 234 2014-12-01
2 234 2015-11-01
3 589 2016-07-01
4 589 2016-03-01
I am currently trying to groupby them by userid and sort the dates descending and cut out the twelve oldest. My code looks like this:
df = df.groupby(['userid'], group_keys=False).agg(lambda x: x.sort_values(['date'], ascending=False, inplace=False).head(12))
And I get this error:
ValueError: cannot copy sequence with size 6 to array axis with dimension 12
At the moment my aim is to avoid to split the dataframe in individual ones.
My second problem is more complex:
I try to find out if the sorted dates (respectively per group of userids) are monthly consecutive. This means if there is an date for one group of userid, for example userid: 234 and date: 2014-04-01, the next entry below must be userid: 234 and date:2014-03-01. There is no focus on the day, only the year and month are important.
And only this consecutive 12 dates should be copied in another dataframe.
A second dataframe df2 contains the same userid, but they are unique and another column is 'code'. Here is an example:
userid code
0 433805 1
24 5448 0
48 3434 1
72 34434 1
96 3202 1
120 23766 1
153 39457 0
168 4113 1
172 3435 5
374 34093 1
I summarize: I try to check if there are 12 consecutive months per userid and copy every correct sequence in another dataframe. For this I have also compare the 'code' from df2.
This is a version of my code:
df['YearMonthDiff'] = df['date'].map(lambda x: 1000*x.year + x.month).diff()
df['id_before'] = df['userid'].shift()
final_df = pd.DataFrame()
for group in df.groupby(['userid'], group_keys=False):
fi = group[1]
if (fi['userid'] <> fi['id_before']) & group['YearMonthDiff'].all(-1.0) & df.loc[fi.userid]['code'] != 5:
final_df.append(group['userid','date', 'consum'])
At first calculated from the date an integer and made diff(). On other posts I saw they shift the column to compare the values from the current row and the row before. Then I made groupby(userid) to iterate over the single groups. Now it's extra ugly I tried to find the beginning of such an userid-group, try to check if there are only consecutive months and the correct 'code'. And at least I append it on the final dataframe.
On of the biggest problems is to compare the row with the following row. I can iterate over them with iterrow(), but I cannot compare them without shift(). There exits a calendar function, but on these I will take a look on the weekend. Sorry for the mess I am new to pandas.
Has anyone an idea how to solve my problem?
for your first problem, try this
df.groupby(by='userid').apply(lambda x: x.sort_values(by='date',ascending=False).iloc[[e for e in range(12) if e <len(x)]])
Using groupby and nlargest, we get the index values of those largest dates. Then we use .loc to get just those rows
df.loc[df.groupby('userid').date.nlargest(12).index.get_level_values(1)]
Consider the dataframe df
dates = pd.date_range('2015-08-08', periods=10)
df = pd.DataFrame(dict(
userid=np.arange(2).repeat(4),
date=np.random.choice(dates, 8, False)
))
print(df)
date userid
0 2015-08-12 0 # <-- keep
1 2015-08-09 0
2 2015-08-11 0
3 2015-08-15 0 # <-- keep
4 2015-08-13 1
5 2015-08-10 1
6 2015-08-17 1 # <-- keep
7 2015-08-16 1 # <-- keep
We'll keep the latest 2 dates per user id
df.loc[df.groupby('userid').date.nlargest(2).index.get_level_values(1)]
date userid
0 2015-08-12 0
3 2015-08-15 0
6 2015-08-17 1
7 2015-08-16 1
Maybe someone is interested, I solved my second problem thus:
I cast the date to an int, calculated the difference and I shift the userid one row, like in my example. And then follows this... found a solution on stackoverflow
gr_ob = df.groupby('userid')
gr_dict = gr_ob.groups
final_df = pd.DataFrame(columns=['userid', 'date', 'consum'])
for group_name in gr_dict.keys():
new_df = gr_ob.get_group(group_name)
if (new_df['userid'].iloc[0] <> new_df['id_before'].iloc[0]) & (new_df['YearMonthDiff'].iloc[1:] == -1.0).all() & (len(new_df) == 12):
final_df = final_df.append(new_df[['userid', 'date', 'consum']])