I have a dataframe:
date C P
0 15.4.21 0.06 0.94
1 16.4.21 0.15 1.32
2 2.5.21 0.06 1.17
3 8.5.21 0.20 0.82
4 9.6.21 0.04 -5.09
5 1.2.22 0.05 7.09
I need to create 2 columns where I Sum both C and P for each month.
So new df will have 2 columns, for example for the month 4 (April) (0.06+0.94+0.15+1.32) = 2.47, so new df:
4/21 5/21 6/21 2/22
0 2.47 2.25 .. ..
Columns names and order doesn't matter, actualy a string month name even better(April 22).
I was playing with something like this, which is not what i need:
df[['C','P']].groupby(df['date'].dt.to_period('M')).sum()
You almost had it, you need to convert first to_datetime:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date'], day_first=True)
.dt.to_period('M'))
.sum()
)
Output:
C P
date
2021-02 0.06 1.17
2021-04 0.21 2.26
2021-08 0.20 0.82
2021-09 0.04 -5.09
2022-01 0.05 7.09
If you want the grand total, sum again:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date']).dt.to_period('M'))
.sum().sum(axis=1)
)
Output:
date
2021-02 1.23
2021-04 2.47
2021-08 1.02
2021-09 -5.05
2022-01 7.14
Freq: M, dtype: float64
as "Month year"
If you want a string, better convert it in the end to keep the order:
out.index = out.index.strftime('%B %y')
Output:
date
February 21 1.23
April 21 2.47
August 21 1.02
September 21 -5.05
January 22 7.14
dtype: float64
I am working with the John Hopkins Covid data for personal use to create charts. The data shows cumulative deaths by country, I want deaths per day. Seems to me the easiest way is to create two dataframes and subtract one from the other. But the file has column names as dates and the code, e.g. df3 = df2 - df1 subtracts the columns with the matching dates. So I want to rename all the columns with some easy index, for example, 1, 2, 3, ....
I cannot figure out how to do this?
new_names=list(range(data.shape[1]))
data.columns=new_names
This renames the columns of data from 0 upwards.
You could re-shape the data: use dates and row labels, and use country, province as column labels.
import pandas as pd
covid_csv = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
df_raw = (pd.read_csv(covid_csv)
.set_index(['Country/Region', 'Province/State'])
.drop(columns=['Lat', 'Long'])
.transpose())
df_raw.index = pd.to_datetime(df_raw.index)
print( df_raw.iloc[-5:, 0:5] )
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 1269 144 1163 52 41
2020-07-28 1270 148 1174 52 47
2020-07-29 1271 150 1186 52 48
2020-07-30 1271 154 1200 52 51
2020-07-31 1272 157 1210 52 52
Now, you can use the rich set of pandas tools for time-series analysis. For example, use diff() to go from cumulative deaths to per-day rates. Or, you could compute N-day moving averages, create time-series plots, ...
print(df_raw.diff().iloc[-5:, 0:5])
Country/Region Afghanistan Albania Algeria Andorra Angola
Province/State NaN NaN NaN NaN NaN
2020-07-27 10.0 6.0 8.0 0.0 1.0
2020-07-28 1.0 4.0 11.0 0.0 6.0
2020-07-29 1.0 2.0 12.0 0.0 1.0
2020-07-30 0.0 4.0 14.0 0.0 3.0
2020-07-31 1.0 3.0 10.0 0.0 1.0
Finally, df_raw.sum(level='Country/Region', axis=1) will aggregate all Provinces within a Country.
Thanks for the time and effort but I figured out a simple way.
for i, row in enumerate(df):
df.rename(columns = { row : str(i)}, inplace = True)
to change the columns names and then
for i, row in enumerate(df):
df.rename(columns = { row : str( i + 43853)}, inplace = True)
to change them back to the dates I want.
I am trying to index match 2 dataframes and write the data back to excel. The Excel file that has to be filled looks like this:
Name Location Date Open High TimeH Low TimeL Close
1 Orange New York 20200501.0 5.5 5.58 18:00 5.45 16:00 5.7
0 Apple Minsk 20200505.0 3.5 3.85 NaN 3.45 NaN 3.65
2 Steak Dallas 20200506.0 8.5 8.85 NaN 8.45 NaN 8.65
The 'TimeH' and 'TimeL' should be index'd from a dataframe that looks like this
Name Date Time Open High Low Close Volume VWAP Trades
4 Apple 20200505 15:30:00 3.50 3.85 3.45 3.70 1500 3.73 95
5 Apple 20200505 17:00:00 3.65 3.70 3.50 3.60 1600 3.65 54
6 Apple 20200505 20:00:00 3.80 3.85 3.35 3.81 1700 3.73 41
7 Apple 20200505 22:00:00 3.60 3.84 3.45 3.65 1800 3.75 62
4 Steak 20200506 10:00:00 8.50 8.85 8.45 8.70 1500 8.73 95
5 Steak 20200506 12:00:00 8.65 8.70 8.50 8.60 1600 8.65 54
6 Steak 20200506 14:00:00 8.80 8.85 8.45 8.81 1700 8.73 41
7 Steak 20200506 16:00:00 8.60 8.84 8.45 8.65 1800 8.75 62
And then be pasted to the excel file, which should look like this after everything has worked:
Name Location Date Open High TimeH Low TimeL Close
1 Orange New York 20200501.0 5.5 5.58 18:00:00 5.45 16:00:00 5.7
0 Apple Minsk 20200505.0 3.5 3.85 10:00:00 3.45 20:00:00 3.65
2 Steak Dallas 20200506.0 8.5 8.85 15:30:00 8.45 14:00:00 8.65
I was using the following code to index the values 'Open', 'High', 'Low', 'Close', which works great:
rdf13 = rdf12.groupby(['Name','Date']).agg(Open=('Open','first'),High=('High','max'),Low=('Low','min'), Close=('Close','last'),Volume=('Volume','sum'),VWAP=('VWAP','mean'),Trades=('Trades','sum')).reset_index()
result11 = pd.merge(rdf13, rdf11, how='inner', on=['Name', 'Date']).iloc[:,:-4].dropna(1).rename(columns = {"Open_x": "Open", "High_x": "High", "Low_x": "Low", "Close_x": "Close", "Volume_x": "Volume", "VWAP_x": "VWAP", "Trades_x": "Trades"})
result12 = result11.reindex(index=result11.index[::-1])
result13 = result12[['Name', 'Location', 'Date', 'Check_2','Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', 'Trades']].reset_index()
readfile11 = pd.read_excel("Trackers\TEST Tracker.xlsx")
readfile11['Count'] = np.arange(len(readfile11))
df11 = readfile11.set_index(['Name', 'Location', 'Date'])
df12 = result13.set_index(['Name', 'Location', 'Date'])
fdf11 = df12.combine_first(df11).reset_index().reindex(readfile11.columns, axis=1).sort_values('Count')
print("Updated Day1 Data Frame")
print(fdf11)
writefdf10 = fdf11.to_excel("Trackers\TEST Tracker.xlsx", "Entries", index=False)
But when I append it to index the TimeH value with the following code:
colnames40 = rdf12.rename(columns = {"Time": "TimeH"})
result41 = pd.merge(colnames40, rdf11, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
result42 = result41.reindex(index=result41.index[::-1])
result43 = result42[['Name', 'Location', 'Date', 'Check_2', 'High', 'TimeH']].reset_index()
readfile41 = pd.read_excel("Trackers\TEST Tracker.xlsx")
readfile41['Count'] = np.arange(len(readfile41))
df41 = readfile41.set_index(['Name', 'Location', 'Date', 'High'])
df42 = result43.set_index(['Name', 'Location', 'Date', 'High'])
fdf41 = df42.combine_first(df41).reset_index().reindex(readfile41.columns, axis=1).sort_values('Count')
print("Updated Day3 Data Frame")
print(fdf41)
writefdf40 = fdf41.to_excel("Trackers\TEST Tracker.xlsx", "Entries", index=False)
it does not seem to work for some reason and returns nothing, so the 'NaN' values in the 'TimeH' column stay 'NaN'. I messed around with the variables, but I either got errors because I did something wrong or it still returned 'NaN' values to me.
Can someone here help me to make python index the time values?
Apparently I just had a little typo in my code.
esult41 = pd.merge(colnames40, rdf11, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
should have been
esult41 = pd.merge(colnames40, rdf31, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
Problem now is that the data returns duplicate values, which makes sense because of the reference rdf31, but the issue is that df.drop_duplicates(keep='first', inplace=False) returns 'None' values for some reason, but thats outside the scope of this question.
I have a timeseries df comprised of daily Rates in column A and the relative change from one day to the next in column B.
DF looks something like the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 20.2% 292%
May/28/2019 20.5% 1.4%
May/29/2019 20% -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
I would like to delete all values in column A which occur between between large relative shifts,> +/- 50%.
So the above DF should look as the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 np.nan 292%
May/28/2019 np.nan 1.4%
May/29/2019 np.nan -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
This is where I've got to so far.... would appreciate some help
for i, j in df1.iterrows():
if df1['Shift'][i] > .50 :
x = df1['IR'][i]
if df1['Shift'][j] < -.50 :
y = df1['IR'][j]
df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'],
np.nan)
Error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
We can locate rows between pairs ([1st-2nd), [3rd-4th), ...) of outlier values to then mask the entire DataFrame at once.
Setup
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))
IR Shift
May/24/2019 5.9 NaN
May/25/2019 6.0 1.67
May/26/2019 5.9 -1.67
May/27/2019 20.2 292.00
May/28/2019 20.5 1.40
May/29/2019 20.0 -1.60
May/30/2019 5.1 -292.00
May/31/2019 5.1 0.00
Code
# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)
# Get the indices between consecutive pairs.
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1
df.loc[m, 'IR'] = np.NaN
# IR Shift
#May/24/2019 5.9 NaN
#May/25/2019 6.0 1.67
#May/26/2019 5.9 -1.67
#May/27/2019 NaN 292.00
#May/28/2019 NaN 1.40
#May/29/2019 NaN -1.60
#May/30/2019 5.1 -292.00
#May/31/2019 5.1 0.00
Here I've added a few more rows to show how this will behave in the case of multiple spikes. IR_modified is how IR will be masked with the above logic.
IR Shift IR_modified
May/24/2019 5.9 NaN 5.9
May/25/2019 6.0 1.67 6.0
May/26/2019 5.9 -1.67 5.9
May/27/2019 20.2 292.00 NaN
May/28/2019 20.5 1.40 NaN
May/29/2019 20.0 -1.60 NaN
May/30/2019 5.1 -292.00 5.1
May/31/2019 5.1 0.00 5.1
June/1/2019 7.0 415.00 NaN
June/2/2019 17.0 15.00 NaN
June/3/2019 27.0 12.00 NaN
June/4/2019 17.0 315.00 17.0
June/5/2019 7.0 -12.00 7.0
You can also np.where function from numpy as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)
In [8]: df
Out[8]:
Date IR Shift
0 2019-05-24 NaN NaN
1 2019-05-25 0.0167 0.0167
2 2019-05-26 NaN -0.0167
3 2019-05-27 2.9200 2.9200
4 2019-05-28 0.0140 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 NaN -2.9200
Here's an attempt. There could be more "proper" ways to do it but I'm not familiar with all the pandas built-in functions.
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 0.202 2.9200
4 2019-05-28 0.205 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 0.051 -2.9200
df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 NaN -2.9200
Using df.at to access a single value for a row/column label pair.
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})
print("DataFrame Before :")
print(df)
count = 1
while (count < len(df.index)):
if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
df.at[count, 'IR'] = np.nan
count = count + 1
print("DataFrame After :")
print(df)
Output of program:
DataFrame Before :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 20.2 292.00
4 2019-05-28 20.5 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 5.1 -292.00
7 2019-05-31 5.1 0.00
DataFrame After :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 NaN 292.00
4 2019-05-28 NaN 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 NaN -292.00
7 2019-05-31 NaN 0.00
As per your description of triggering this on any large shift, positive or negative, you could do this:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 0.051 -2.9200
Steps:
abs(df.Shift) > .5: Find shift of above +/- 50%
.cumsum(): Gives unique values to each period, where the odd numbered periods are the ones we want to omit.
% 2 == 1: Checks which rows have odd numbers for cumsum().
Note: This does not work if what you want is to constrain this so that every positive spike needs to be followed by a negative spike, or vice versa.
Was not sure about your shift, so calculated again. Does this works for you?
import pandas as pd
import numpy as np
df.drop(columns=['Shift'], inplace=True) ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)
def shift(current, previous):
return (current-previous)/previous * 100
indexlist=[] ## to save index that will be set to null
prior=0 ## temporary flag to store value prior to a peak
flag=False
for index, row in df.iterrows():
if index==0: ## to skip first row of data
continue
if flag==False and (shift(row[1], row[2])) > 50: ## to check for start of peak
prior=row[2]
indexlist.append(index)
flag=True
continue
if flag==True: ## checking until when the peak lasts
if (shift(row[1], prior)) > 50:
indexlist.append(index)
df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan
Output on print(df)
date IR nextval
0 May/24/2019 5.9 NaN
1 May/25/2019 6.0 5.9
2 May/26/2019 5.9 6.0
3 May/27/2019 NaN 5.9
4 May/28/2019 NaN 20.2
5 May/29/2019 NaN 20.5
6 May/30/2019 5.1 20.0
7 May/31/2019 5.1 5.1
df.loc[df['Shift']>0.5,'IR'] = np.nan