Create common columns and transform time series like data - python

I have an excel sheet which contains more than 30 sheets for different parameters like BP, Heart rate etc.
One of the dataframe (df1 - created from one sheet of excel) looks like as shown below
df1= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H3Date','H3'],
'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
'2106-10-30 00:00:00','16.6']})
Another dataframe (df2) from another sheet of excel file can be generated using the code below
df2= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF3Date','GluF3'],
'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
'2106-10-30 00:00:00','16.6']})
Similarly there are more than 30 dataframes like this with values of the same format (Date & measurement value) but column names (H1, GluF1, H1Date,H100,H100Date, GluF1Date,P1,PDate,UACRDate,UACR100, etc) are different
What I am trying to do based on SO search is as shown below
g = df1.level_1.str[-2:] # Extracting column names
df1['lvl'] = df1.level_1.apply(lambda x: int(''.join(filter(str.isdigit, x)))) # Extracting level's number
df1= df1.pivot_table(index=['person_id', 'lvl'], columns=g, values='values', aggfunc='first')
final = df1.reset_index(level=1).drop(['lvl'], axis=1)
The above code gives an output like this which is not expected
This doesn't work as g doesn't result in same string output (column names) for all records. My code would work if the substring extract has resulted in same output but since the data is like sequence, I am not able to make it uniform
I expect my output to be like as shown below for each dataframe. Please note that a person can have 3 records (H1..H3)/10 records (H1..H10) / 100 records (ex: H1...H100). It is all possible.
updated screenshot

Concat all even and all odd rows without using column names, then name the columns as needed:
res = pd.concat([df2.iloc[0::2,0:3:2].reset_index(drop=True), df2.iloc[1::2,2].reset_index(drop=True)], axis=1)
res.columns = ['Person_ID', 'Date', 'Value']
Output:
Person_ID Date Value
0 1 2006-10-30 00:00:00 6.6
1 1 2006-08-30 00:00:00 4.6
2 2 2005-10-30 00:00:00 6.9
3 2 2016-11-30 00:00:00 6.6
4 3 2006-10-30 00:00:00 6.6
5 3 2006-11-30 00:00:00 8.6
6 3 2106-10-30 00:00:00 16.6

Here is one way using unstack() with a little modification:
Assign a dummy column using ,df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()
Change level_1 to level_1=df1.level_1.str[:2]
Set index as ['person_id','level_1','k'] and unstack on the 3rd index.
m=(df1.assign(k=df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()
,level_1=df1.level_1.str[:2]).
set_index(['person_id','level_1','k']).unstack(2)).droplevel(1)
m.columns=['Date','Values']
print(m)
Date Values
person_id
1 2006-10-30 00:00:00 6.6
1 2006-08-30 00:00:00 4.6
2 2005-10-30 00:00:00 6.9
2 2016-11-30 00:00:00 6.6
3 2006-10-30 00:00:00 6.6
3 2006-11-30 00:00:00 8.6
3 2106-10-30 00:00:00 16.6

Related

Substract one datetime column after a groupby with a time reference for each group from a second Pandas dataframe

I have one dataframe df1 with one admissiontime for each id.
id admissiontime
1 2117-04-03 19:15:00
2 2117-10-18 22:35:00
3 2163-10-17 19:15:00
4 2149-01-08 15:30:00
5 2144-06-06 16:15:00
And an another dataframe df2 with several datetame for each id
id datetime
1 2135-07-28 07:50:00.000
1 2135-07-28 07:50:00.000
2 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
I would like to substract for each id, datetimes with his specific admissiontime, in a column of the second dataframe.
I think I have to use d2.group.by('id')['datetime']- something but I struggle to connect with the df1.
Use Series.sub with mapping by Series.map by another DataFrame:
df1['admissiontime'] = pd.to_datetime(df1['admissiontime'])
df2['datetime'] = pd.to_datetime(df2['datetime'])
df2['diff'] = df2['datetime'].sub(df2['id'].map(df1.set_index('id')['admissiontime']))

How to create a new column showing when a change to an observation occurred?

I have a data-frame formatted like so:
Contract
Agreement_Date
Date
A
2017-02-10
2020-02-03
A
2017-02-10
2020-02-04
A
2017-02-11
2020-02-09
A
2017-02-11
2020-02-10
A
2017-02-11
2020-04-21
B
2017-02-14
2020-08-01
B
2017-02-15
2020-08-11
B
2017-02-17
2020-10-14
C
2017-02-11
2020-12-12
C
2017-02-11
2020-12-16
In this data-frame I have multiple observations for each contract. For some of the contracts, their Agreement_Date changes as new amendments occur. As an example, Contract A had its agreements change from 2017-02-10 to 2017-02-11, and Contract B had its agreement_date change 3 times. Contract C had no change to Agreement_Date
What I would like is an output that looks like so:
Contract
Date
Number_of_Changes
A
2020-02-09
1
B
2017-08-11
2
B
2017-10-14
2
Where the Date column shows when the change to Agreement_Date occurs (e.g. for contract A the Agreement_Date first went from 2017-02-10 to 2017-02-11 on 2020-02-09). This is shown in bold in my first table. I then want a Number_of_Changes column which simply shows how many times the Agreement_Date changed for that contract.
I have been working on this for a few hours to no avail, so any help would be appreciated.
Thanks :)
I posted a previous answer, but realised it's not what you expected. Would this one work out though ?
#Create 'progressive' number of changes column per Contract
df['Changes']=df.groupby('Contract')['Agreement_Date'].transform(lambda x:(x!=x.shift()).cumsum())-1
#Assign to new df, filter for changes and drop duplicates assuming it's already sorted per 'Date'
newdf=df[df['Changes']>0].drop_duplicates(subset=['Contract','Changes'])[['Contract','Date','Changes']]
#Reassign values of 'changes' for max 'Change' per Contract
newdf['Changes']=newdf.groupby('Contract')['Changes'].transform('max')
newdf
This problem revolves around setting up some pieces for later computational use. You'll need multiple passes to
shift the dates & retrieve the records where the changes occur
calculate the number of changes that occurred
We can do this by working with the groupby object in 2 steps.
contract_grouped = df.groupby('Contract')['Agreement_Date']
# subset data based on whether or not a date change occurred
shifted_dates = contract_grouped.shift()
changed_df = df.loc[
shifted_dates.ne(df['Agreement_Date']) & shifted_dates.notnull()
].copy()
# calculate counts and assign back to df
changed_df['count'] = changed_df['Contract'].map(contract_grouped.nunique() - 1)
del changed_df['Date'] # unneeded column
print(changed_df)
Contract Agreement_Date count
2 A 2017-02-11 1
6 B 2017-02-15 2
7 B 2017-02-17 2
Here is the same approach written out with method chaining & assignment expression syntax. If the above is more readable to you, please use that. I put this here mainly because I enjoy writing my pandas answers both ways.
changed_df = (
df.groupby('Contract')['Agreement_Date']
.pipe(lambda grouper:
df.loc[
(shifted := grouper.shift()).ne(df['Agreement_Date'])
& shifted.notnull()
]
.assign(count=lambda d: d['Contract'].map(grouper.nunique().sub(1)))
.drop(columns='Date')
)
)
print(changed_df)
Contract Agreement_Date count
2 A 2017-02-11 1
6 B 2017-02-15 2
7 B 2017-02-17 2
This gives the desired output: first, generate the difference between one row and the one before and locate the index of the rows when the value is neither 0 nor NaT; then create a 'change' column based on the count
df.Agreement_Date = pd.to_datetime(df.Agreement_Date)
out = df.loc[np.where((df.groupby('Contract')['Agreement_Date'].diff().notna())&(df['Agreement_Date'].diff()!='0 days'))][['Contract', 'Date']]
out['Change'] = out.groupby('Contract').transform('count').values
out.set_index('Contract',drop=True, inplace=True)
Output:
Date Change
Contract
A 2020-02-09 1
B 2020-08-11 2
B 2020-10-14 2

Mapping two rows to one row in pandas

I have a dataframe a with 14 rows and another dataframe comp1sum with 7 rows. a has date column for 7 days in 12hr interval. So that makes it 14 rows. Also, comp1sum has a column with 7 days.
This is the comp1sum dataframe
And this is the a dataframe
I want to map 2 rows of a dataframe to single rows of comp1sum dataframe. So, that one day of dataframe a is mapped to one day of comp1sum dataframe.
I have the following code for that
j=0
for i in range(0,7):
a.loc[i,'comp1_sum'] = comp_sum.iloc[j]['comp1sum']
a.loc[i,'comp2_sum'] = comp_sum.iloc[j]['comp2sum']
j=j+1
And its output is
dt_truncated comp1_sum
3 2015-02-01 00:00:00 142.0
10 2015-02-01 12:00:00 144.0
12 2015-02-03 00:00:00 145.0
2 2015-02-05 00:00:00 141.0
14 2015-02-05 12:00:00 NaN
The code is mapping the days from comp1sum based on index of a and not based on dates of a. I want 2015-02-01 00:00:00 to have the values 139.0 and 2015-02-02 00:00:00 to have the value 140.0 and so on such that increasing dates have increasing values.
I am not able to map in such a way. please help.
Edit1- As per #Ssayan answer, I am getting this error-
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-255-77e55efca5f9> in <module>
3 # use the sorted index to iterate through the sorted dataframe
4 for i, idx in enumerate(a.index):
----> 5 a.loc[idx, 'comp1_sum'] = b.iloc[i//2]['comp1sum']
6 a.loc[idx,'comp2_sum'] = b.iloc[i//2]['comp2sum']
IndexError: single positional indexer is out-of-bounds
Your issue is that your DataFrame a is not sorted by date so the index 0 does not match the earliest date. When you use loc it uses the value of the index, not the order in which the table is, so even with sorting the DataFrame the issue remains.
One way out is to sort the DataFrame a by date and then to use the sorted index to apply the value in the order you need.
# sort the dataframe by date
a = a.sort_values("dt_truncated")
# use the sorted index to iterate through the sorted dataframe
for i, idx in enumerate(a.index):
a.loc[idx, 'val_1'] = b.iloc[i//2]['val1']
a.loc[idx,'val_2'] = b.iloc[i//2]['val2']

sort_values function in pandas dataframe not working properly

I have a dataset of 1281695 rows and 4 columns in which I have 6 years of monthly data from 2013 to 2019. So, it's obvious to have repeated dates in the dataset. I want to arrange data as dates in ascending order like Jan 2013, Feb 2013,..Dec 2013, Jan 2014,......Dec 2019(6 years of data).I want to achieve ascending order for all of the dataset but it shows ascending order for some data and random order for the remaining data.
I tried sort_values of pandas library.
I tried something like this :
data = df.sort_values(['SKU', 'Region', 'FMonth'], axis=0, ascending=[False, True, True]).reset_index()
where SKU, Region, FMonth are my independent variables. FMonth is the date variable.
And the code arranges the starting of data but not the end of data. Like when I tried:
data.head()
result:
index SKU Region FMonth sh
0 8264 855019.133127 3975.495636 2013-01-01 67640.0
1 20022 855019.133127 3975.495636 2013-02-01 73320.0
2 31972 855019.133127 3975.495636 2013-03-01 86320.0
3 43897 855019.133127 3975.495636 2013-04-01 98040.0
4 55642 855019.133127 3975.495636 2013-05-01 73240.0
And,
data.tail()
result:
index SKU Region FMonth sh
1281690 766746 0.000087 7187.170501 2017-03-01 0.0
1281691 881816 0.000087 7187.170501 2017-09-01 0.0
1281692 980113 0.000087 7187.170501 2018-02-01 0.0
1281693 1020502 0.000087 7187.170501 2018-04-01 0.0
1281694 1249130 0.000087 7187.170501 2019-03-01 0.0
where 'sh' is my dependent variable.
Data is not really attractive but please focus on FMonth(date) column only.
As we can see the last rows are not arranged in ascending order but the starting rows are arranged in specified order. And if I change the ascending attribute of FMonth column in the above code, means in descending order the data shows the order in the starting rows but not in the last rows.
What am I doing wrong? What to do to achieve ascending order in all of the dataset? And what is happening and why?
Do you just need to prioritize Month?
z = pd.read_clipboard()
z.columns = [i.strip() for i in z.columns]
z.sort_values(['FMonth', 'Region', 'SKU'], axis=0, ascending=[True, True, True])
Out[23]:
index SKU Region FMonth sh
1 20022 8 52 1/1/2013 73320
0 8264 1 67 1/1/2013 67640
3 43897 5 34 3/1/2013 98040
2 31972 3 99 3/1/2013 86320
4 55642 4 98 5/1/2013 73240

Get values from .csv closest in time to values in another dataframe

I have 2 dataframes I have created using pandas and stored as .csv. Each row of both dataframes has columns with date and times, but the timestamps aren't necessarily same. So, I want to create a combined pandas dataframe such that the 2 are joined on the basis of CLOSEST times.
This is my first dataframe. This is my second dataframe. I want to get kp and f107 values for each filename which are closest in date and time to the Avg_time column for each row in the first dataframe. How do I do this? Is there a merge with method='nearest' type way to do this with pandas?
You can use pd.merge_asof in Pandas 0.20.2 with a direction='nearest':
pd.merge_asof(df1.sort_values(by='file_date'),df2.sort_values(by='AST'), left_on='file_date', right_on='AST', direction='nearest')
Output:
Filename file_date Avg_time AST f107 kp
0 Na1998319 1998-11-16 2:14 1998-11-15 23:00:00 121.8 2.3
1 Na1998320 1998-11-17 2:01 1998-11-16 23:00:00 118.0 2.3
2 Na1998321 1998-11-18 0:38 1998-11-17 23:00:00 112.2 2.3
3 Na1998322 1998-11-18 20:51 1998-11-17 23:00:00 112.2 2.3
4 Na1999020 1999-01-20 22:53 1999-01-19 23:00:00 231.3 0.7

Categories

Resources