How to reshape dataframe by extracting partial name of header in python? - python

I am not sure how to reform a data frame with the certain columns based on part of header's name.
Here is data frame I've got.
Date 990986_125001_AA1234 990986_125002_AB2586 990986_125003_AA1234
2020-01-01 439.9 398.9 435.8
2020-05-25 443.8 390.9 438.8
2020-09-11 438.9 387.9 436.8
2020-03-27 435.2 399.2 431.5
2020-07-30 434.6 387.2 422.5
2020-08-05 432.7 377.1 432.7
I want to form a three separate data frame based on the header.
for example df1 should only contains columns starting with 990986_125001_******
df2 should only contains columns starting with 990986_125002_******
df3 should only contains columns starting with 990986_125003_******
The sepearator is middle number (12500*) so df1 ends with 1 and df2 ends with 2 and df3 ends with 3.
I have 100 of columns.
The desired output will be
df1
Date 990986_125001_AA1234
2020-01-01 439.9
2020-05-25 443.8
2020-09-11 438.9
2020-03-27 435.2
2020-07-30 434.6
2020-08-05 432.7
second dataframe
df2
Date 990986_125002_AB2586
2020-01-01 398.9
2020-05-25 390.9
2020-09-11 387.9
2020-03-27 399.2
2020-07-30 387.2
2020-08-05 377.1
third data frame
df3
Date 990986_125003_AA1234
2020-01-01 435.8
2020-05-25 438.8
2020-09-11 436.8
2020-03-27 431.5
2020-07-30 422.5
2020-08-05 432.7
I have searched in google and stack overflow but they only showed me to reshape the columns with calling header's name or index or iloc.
can someone please help me to reshape the data frame with satisfy condition.
Thanks

import pandas as pd
dict1 = {}
df = pd.read_csv("data.csv")
for index, item in enumerate(df.columns.tolist()[1:]):
if "1250" in item.split("_")[1]:
dict1["df"+str(index)] = df[['Date', item]]
for keys in dict1.keys():
print(keys)
print(dict1[keys])
This will create your desired answer as per the given test data.

Related

Substract one datetime column after a groupby with a time reference for each group from a second Pandas dataframe

I have one dataframe df1 with one admissiontime for each id.
id admissiontime
1 2117-04-03 19:15:00
2 2117-10-18 22:35:00
3 2163-10-17 19:15:00
4 2149-01-08 15:30:00
5 2144-06-06 16:15:00
And an another dataframe df2 with several datetame for each id
id datetime
1 2135-07-28 07:50:00.000
1 2135-07-28 07:50:00.000
2 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
3 2135-07-28 07:57:15.900
I would like to substract for each id, datetimes with his specific admissiontime, in a column of the second dataframe.
I think I have to use d2.group.by('id')['datetime']- something but I struggle to connect with the df1.
Use Series.sub with mapping by Series.map by another DataFrame:
df1['admissiontime'] = pd.to_datetime(df1['admissiontime'])
df2['datetime'] = pd.to_datetime(df2['datetime'])
df2['diff'] = df2['datetime'].sub(df2['id'].map(df1.set_index('id')['admissiontime']))

How do I delete specific dataframe rows based on a columns value?

I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.
you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']
Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]
If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

Create common columns and transform time series like data

I have an excel sheet which contains more than 30 sheets for different parameters like BP, Heart rate etc.
One of the dataframe (df1 - created from one sheet of excel) looks like as shown below
df1= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H3Date','H3'],
'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
'2106-10-30 00:00:00','16.6']})
Another dataframe (df2) from another sheet of excel file can be generated using the code below
df2= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF3Date','GluF3'],
'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
'2106-10-30 00:00:00','16.6']})
Similarly there are more than 30 dataframes like this with values of the same format (Date & measurement value) but column names (H1, GluF1, H1Date,H100,H100Date, GluF1Date,P1,PDate,UACRDate,UACR100, etc) are different
What I am trying to do based on SO search is as shown below
g = df1.level_1.str[-2:] # Extracting column names
df1['lvl'] = df1.level_1.apply(lambda x: int(''.join(filter(str.isdigit, x)))) # Extracting level's number
df1= df1.pivot_table(index=['person_id', 'lvl'], columns=g, values='values', aggfunc='first')
final = df1.reset_index(level=1).drop(['lvl'], axis=1)
The above code gives an output like this which is not expected
This doesn't work as g doesn't result in same string output (column names) for all records. My code would work if the substring extract has resulted in same output but since the data is like sequence, I am not able to make it uniform
I expect my output to be like as shown below for each dataframe. Please note that a person can have 3 records (H1..H3)/10 records (H1..H10) / 100 records (ex: H1...H100). It is all possible.
updated screenshot
Concat all even and all odd rows without using column names, then name the columns as needed:
res = pd.concat([df2.iloc[0::2,0:3:2].reset_index(drop=True), df2.iloc[1::2,2].reset_index(drop=True)], axis=1)
res.columns = ['Person_ID', 'Date', 'Value']
Output:
Person_ID Date Value
0 1 2006-10-30 00:00:00 6.6
1 1 2006-08-30 00:00:00 4.6
2 2 2005-10-30 00:00:00 6.9
3 2 2016-11-30 00:00:00 6.6
4 3 2006-10-30 00:00:00 6.6
5 3 2006-11-30 00:00:00 8.6
6 3 2106-10-30 00:00:00 16.6
Here is one way using unstack() with a little modification:
Assign a dummy column using ,df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()
Change level_1 to level_1=df1.level_1.str[:2]
Set index as ['person_id','level_1','k'] and unstack on the 3rd index.
m=(df1.assign(k=df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()
,level_1=df1.level_1.str[:2]).
set_index(['person_id','level_1','k']).unstack(2)).droplevel(1)
m.columns=['Date','Values']
print(m)
Date Values
person_id
1 2006-10-30 00:00:00 6.6
1 2006-08-30 00:00:00 4.6
2 2005-10-30 00:00:00 6.9
2 2016-11-30 00:00:00 6.6
3 2006-10-30 00:00:00 6.6
3 2006-11-30 00:00:00 8.6
3 2106-10-30 00:00:00 16.6

Categories

Resources