How to convert a 2 column pandas dataframe to datetime? - python

I have a pandas dataframe that has 2 columns: The first column is the minutes, and the second column is the seconds. It looks like this:
min s
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 0 6
7 0 7
8 0 8
9 0 9
10 0 10
11 0 11
12 0 12
How would I convert this to datetime in the format %M:%S?

I believe because you just have time only columns, you should avoid using datetime and use timedelta instead.
I would try something like this:
import pandas as pd
df=pd.DataFrame({"minute":[0,1,2,3,4,5],"second":[30,40,50,60,10,20]})
df['time'] = df.agg('{0[minute]}:{0[second]}'.format, axis=1)
df['time'] = pd.to_timedelta('00:'+df['time'])
print(df) OUTPUTS, you could delete the minute and second columns afterwards if they're unecessary
minute second time
0 0 30 00:00:30
1 1 40 00:01:40
2 2 50 00:02:50
3 3 60 00:04:00
4 4 10 00:04:10
5 5 20 00:05:20

l = []
for MinSec in list(zip(df['min'],df['s'])):
l.append(':'.join(map(str,MinSec)))
pd.to_datetime(pd.Series(l), format='%M:%S')
Just zip min and s columns, join them using a : as separator and convert the series with to_datetime() into your datetime desired format.

Related

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Sort part of DataFrame in Python Panda, return new column with order depending on row values

My first question here, I hope this is understandable.
I have a Panda DataFrame:
order_numbers
x_closest_autobahn
0
34
3
1
11
3
2
5
3
3
8
12
4
2
12
I would like to get a new column with the order_number per closest_autobahn in ascending order:
order_numbers
x_closest_autobahn
order_number_autobahn_x
2
5
3
1
1
11
3
2
0
34
3
3
4
2
12
1
3
8
12
2
I have tried:
df['order_number_autobahn_x'] = ([df.loc[(df['x_closest_autobahn'] == 3)]].sort_values(by=['order_numbers'], ascending=True, inplace=True))
I have looked at slicing, sort_values and reset_index
df.sort_values(by=['order_numbers'], ascending=True, inplace=True)
df = df.reset_index() # reset index to the order after sort
df['order_numbers_index'] = df.index
but I can't seem to get the DataFrame I am looking for.
Use DataFrame.sort_values by both columns and for counter use GroupBy.cumcount:
df = df.sort_values(['x_closest_autobahn','order_numbers'])
df['order_number_autobahn_x'] = df.groupby('x_closest_autobahn').cumcount().add(1)
print (df)
order_numbers x_closest_autobahn order_number_autobahn_x
2 5 3 1
1 11 3 2
0 34 3 3
4 2 12 1
3 8 12 2

Subtracting Rows based on ID Column - Pandas

I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37

Create a new column based on the state of another column in pandas

Suppose I want to create a new column that counts the number of days since the state was 1. As an example, the current columns would be the first three below. The forth column is what I'm trying to get.
Index State Days Since_Days
1 1 0 0
2 0 20 20
3 0 40 40
4 1 55 55
5 1 60 5
6 1 70 10
Without resorting to for-loop, what is a pandas way to approach this?
You can also try following where first you group by State and for those that have State == 1, fill by difference. Then, for those which has State == 0 will be na which can be filled by corresponding Days column value
df.loc[df.State == 1, 'Since_Days'] = df.groupby('State')['Days'].diff().fillna(0)
df['Since_Days'].fillna(df['Days'],inplace=True)
print(df)
Result:
Index State Days Since_Days
0 1 1 0 0.0
1 2 0 20 20.0
2 3 0 40 40.0
3 4 1 55 55.0
4 5 1 60 5.0
5 6 1 70 10.0
The values to be subtracted can be formed with:
ser = df['Days'].where(df['State']==1, np.nan).ffill().shift()
If you subtract this from the original Days column, you'll have:
df['Days'].sub(ser, fill_value=0).astype('int')
Out:
0 0
1 20
2 40
3 55
4 5
5 10
Name: Days, dtype: int64

Categories

Resources