unstack multiindex dataframe to flat data frame in pandas - python

I have a multi index df called groupt3 in pandas which looks like this when I enter groupt3.head():
datetime song sum rat
artist datetime
2562 8 2 2 26 0
46 19 19 26 0
47 3 3 26 0
4Hero 1 2 2 32 0
26 20 20 32 0
9 10 10 32 0
I would like to have a "flat" data frame which took the artist index and the date time index and "repeats it" to form this:
artist date time song sum rat
2562 8 2 26 0
2562 46 19 26 0
2562 47 3 26 0
etc...
Thanks.

Using pandas.DataFrame.to_records().
Example:
import pandas as pd
import numpy as np
arrays = [['Monday','Monday','Tursday','Tursday'],
['Morning','Noon','Morning','Evening']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Weekday', 'Time'])
df = pd.DataFrame(np.random.randint(5, size=(4,2)), index=index)
In [39]: df
Out[39]:
0 1
Weekday Time
Monday Morning 1 3
Noon 2 1
Tursday Morning 3 3
Evening 1 2
In [40]: pd.DataFrame(df.to_records())
Out[40]:
Weekday Time 0 1
0 Monday Morning 1 3
1 Monday Noon 2 1
2 Tursday Morning 3 3
3 Tursday Evening 1 2

I think you can use reset_index:
import pandas as pd
import numpy as np
np.random.seed(0)
arrays = [['Monday','Monday','Tursday','Tursday'],
['Morning','Noon','Morning','Evening']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Weekday', 'Time'])
df = pd.DataFrame(np.random.randint(5, size=(4,2)), index=index)
print df
0 1
Weekday Time
Monday Morning 4 0
Noon 3 3
Tursday Morning 3 1
Evening 3 2
print df.reset_index()
Weekday Time 0 1
0 Monday Morning 4 0
1 Monday Noon 3 3
2 Tursday Morning 3 1
3 Tursday Evening 3 2

Related

Creating Quarters column with months columns

So using python in Jupyter notebook, I have this data df with a column named as "Month"(more than 100,000 rows) having individual numbers upto 12. I want to create another column in that same data set named as "Quarters" so that it can display Quarters for those respective months.
I extracted month from "review_time" Column using ".dt.strftime('%m')"
I am sorry if the provided information was not enough. New to stack flow.
So I extracted month from column name :"date". I created a variable a and then added that variable a to the main table.
a = df['review_time'].dt.strftime('%m')
df.insert(2, "month",a, True)
this is the output for month.info() column
<class 'pandas.core.series.Series'>
Int64Index: 965093 entries, 1 to 989508
Series name: month
Non-Null Count Dtype
-------------- -----
965093 non-null object
dtypes: object(1)
memory usage: 14.7+ MB
You could use pandas.cut
Example with a generic dataframe:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'Month': [1,2,3,4,5,6,7,8,9,10,11,12]})
df['Quarter'] = pd.cut(df['Month'], [0,3,6,9,12], labels = [1,2,3,4])
print(df)
This prints:
Month Quarter
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
8 9 3
9 10 4
10 11 4
11 12 4
An alternative is to calculate the quarter from the month number. qtr = ( month -1 ) // 3 + 1
import numpy as np
import pandas as pd
from datetime import datetime
# lo and hi used to generate random dates in 2022
lo = datetime( 2022, 1, 1 ).toordinal()
hi = datetime( 2022, 12, 31 ).toordinal()
np.random.seed( 1234 )
dates = [ datetime.fromordinal( np.random.randint( lo, hi )) for _ in range( 20 )]
df = pd.DataFrame( { 'Date': dates } )
df['Qtr'] = ( df['Date'].dt.month - 1 ) // 3 + 1
print( df )
Result
Date Qtr
0 2022-10-31 4
1 2022-07-31 3
2 2022-10-22 4
3 2022-02-23 1
4 2022-07-24 3
5 2022-06-02 2
6 2022-05-24 2
7 2022-06-27 2
8 2022-10-07 4
9 2022-08-22 3
10 2022-06-04 2
11 2022-01-31 1
12 2022-06-21 2
13 2022-06-08 2
14 2022-08-25 3
15 2022-10-10 4
16 2022-05-01 2
17 2022-11-22 4
18 2022-12-03 4
19 2022-09-04 3

Python Monthly Change Calculation (Pandas)

Here is data
id
date
population
1
2021-5
21
2
2021-5
22
3
2021-5
23
4
2021-5
24
1
2021-4
17
2
2021-4
24
3
2021-4
18
4
2021-4
29
1
2021-3
20
2
2021-3
29
3
2021-3
17
4
2021-3
22
I want to calculate the monthly change regarding population in each id. so result will be:
id
date
delta
1
5
.2353
1
4
-.15
2
5
-.1519
2
4
-.2083
3
5
.2174
3
4
.0556
4
5
-.2083
4
4
.3182
delta := (this month - last month) / last month
How to approach this in pandas? I'm thinking of groupby but don't know what to do next
remember there might be more dates. but results is always
Use GroupBy.pct_change with sorting columns first before, last remove misisng rows by column delta:
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date'], ascending=[True, False])
df['delta'] = df.groupby('id')['population'].pct_change(-1)
df = df.dropna(subset=['delta'])
print (df)
id date population delta
0 1 2021-05-01 21 0.235294
4 1 2021-04-01 17 -0.150000
1 2 2021-05-01 22 -0.083333
5 2 2021-04-01 24 -0.172414
2 3 2021-05-01 23 0.277778
6 3 2021-04-01 18 0.058824
3 4 2021-05-01 24 -0.172414
7 4 2021-04-01 29 0.318182
Try this:
df.groupby('id')['population'].rolling(2).apply(lambda x: (x.iloc[0] - x.iloc[1]) / x.iloc[0]).dropna()
maybe you could try something like:
data['delta'] = data['population'].diff()
data['delta'] /= data['population']
with this approach the first line would be NaNs, but for the rest, this should work.

Python delta for 7 day difference

I have a dataframe, df, that I am wanting to calculate the delta over a 7 day time period:
Monday Tuesday Wednesday Thursday Friday Sat Sun
5 10 15 20 25 30 35
1 2 3 4 5 6 7
I would like to find the delta for the first row, starting with Monday (5) and ending on Sun (35)
The delta for the first 7 day time period would be: 35 - 5 = 30
The next 7 day window delta would be: 7 - 1 = 6 and so on
The date would begin on 1/1/2020 and continue by 7 day or weekly increments.
Desired output: (New dataframe with the newly created Date and Delta columns)
Date Delta
1/1/2020 30
1/8/2020 6
This is what I am doing:
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv')
df['Delta'] = df['Sunday'] - df['Monday]
df['Date'] = pd.date_range(start='1/1/2020', periods=len(df), freq='Weeks')
df2.to_csv('df2.csv')
Any suggestion is appreciated
Lets Try calculate date_range by incorporating multiples in the freq
df['Delta']=df.Sun.sub(df.Monday)
df['Date']=pd.Series(pd.date_range(pd.Timestamp('2020-01-01'), periods=7, freq='7d'))
or simply
df=df.assign(Delta=df.Sun.sub(df.Monday),Date=pd.Series\
(pd.date_range(pd.Timestamp('2020-01-01'), periods=7, freq='7d')))
Monday Tuesday Wednesday Thursday Friday Sat Sun Delta Date
0 5 10 15 20 25 30 35 30 2020-01-01
1 1 2 3 4 5 6 7 6 2020-01-08
# necessary imports
import datetime
import pandas
Can do:
numdays=5
base = datetime.datetime(2020,1,1)
date_list = [base + datetime.timedelta(days=7*x) for x in range(numdays)]
Then:
df=pd.DataFrame({'Date':date_list})
If you have another list of values, ie Deltas_list you want to include in this dataframe:
Deltas_list=[0,1,2,3,4]
Deltas=pd.Series(Deltas_list)
df['Delta']=Deltas
df will be:
Date Delta
0 2020-01-01 0
1 2020-01-08 1
2 2020-01-15 2
3 2020-01-22 3
4 2020-01-29 4

How to convert a 2 column pandas dataframe to datetime?

I have a pandas dataframe that has 2 columns: The first column is the minutes, and the second column is the seconds. It looks like this:
min s
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 0 6
7 0 7
8 0 8
9 0 9
10 0 10
11 0 11
12 0 12
How would I convert this to datetime in the format %M:%S?
I believe because you just have time only columns, you should avoid using datetime and use timedelta instead.
I would try something like this:
import pandas as pd
df=pd.DataFrame({"minute":[0,1,2,3,4,5],"second":[30,40,50,60,10,20]})
df['time'] = df.agg('{0[minute]}:{0[second]}'.format, axis=1)
df['time'] = pd.to_timedelta('00:'+df['time'])
print(df) OUTPUTS, you could delete the minute and second columns afterwards if they're unecessary
minute second time
0 0 30 00:00:30
1 1 40 00:01:40
2 2 50 00:02:50
3 3 60 00:04:00
4 4 10 00:04:10
5 5 20 00:05:20
l = []
for MinSec in list(zip(df['min'],df['s'])):
l.append(':'.join(map(str,MinSec)))
pd.to_datetime(pd.Series(l), format='%M:%S')
Just zip min and s columns, join them using a : as separator and convert the series with to_datetime() into your datetime desired format.

Subtracting Rows based on ID Column - Pandas

I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37

Categories

Resources