Add columns dataframe python plus Multiplication by a number from an array - python

How can I multiply an array to the columns of a dataframe and then sum these columns to a new column in a dataframe?
I tried it with the code below but somehow get wrong numbers:
AAPL Portfolio ACN
Date
2017-01-03 116.150002 1860.880008 116.459999
2017-01-04 116.019997 1862.079960 116.739998
2017-01-05 116.610001 1852.799992 114.989998
2017-01-06 117.910004 1873.680056 116.300003
...
How it should look like is the following:
AAPL Portfolio ACN
Date
2017-01-03 116.150002 1046.900003 116.459999
2017-01-04 116.019997 1047.779978 116.739998
2017-01-05 116.610001 1041.389994 114.989998
2017-01-06 117.910004 1053.140031 116.300003
...
The code looks like the following. Might be that I am thinking too complicated and therefore the code makes no sense:
import pandas_datareader.data as pdr
import pandas as pd
import datetime
start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2017, 3, 17)
ticker_list = ["AAPL","ACN"]
position_size = [4,5]
for i in range(0,len(ticker_list)):
#print(i)
DataInitial = pdr.DataReader(ticker_list[i], 'yahoo', start, end)
ClosingPrices[ticker_list[i]] = DataInitial[['Close']]
ClosingPrices['Portfolio'] = ClosingPrices['Portfolio'] + ClosingPrices[ticker_list[i]]*position_size[i]
print(ClosingPrices)
What I want is actually:
2017-01-03: 116.150002*4+116.150002*5
2017-01-03: 116.019997*4+116.739998*5
etc...

If need:
2017-01-03: 116.150002*4+116.150002*5
2017-01-03: 116.019997*4+116.739998*5
then use concat of multiple columns by value from dict and last sum all columns together:
ticker_list = ["AAPL","ACN"]
position_size = [4,5]
d = dict(zip(ticker_list,position_size))
print (pd.concat([ClosingPrices[col] * d[col] for col in ticker_list], axis=1))
AAPL ACN
Date
2017-01-03 400.000000 500.000000
2017-01-04 464.079988 583.699990
2017-01-05 466.440004 574.949990
2017-01-06 471.640016 581.500015
ClosingPrices['Portfolio'] = pd.concat([ClosingPrices[col] * d[col] for col in ticker_list],
axis=1).sum(axis=1)
print (ClosingPrices)
AAPL Portfolio ACN
Date
2017-01-03 100.000000 900.000000 100.000000 <-for testing values was changed to 100
2017-01-04 116.019997 1047.779978 116.739998
2017-01-05 116.610001 1041.389994 114.989998
2017-01-06 117.910004 1053.140031 116.300003

Related

Create a dataframe from a date range in python

Given an interval from two dates, which will be a Python TimeStamp.
create_interval('2022-01-12', '2022-01-17', 'Holidays')
Create the following dataframe:
date
interval_name
2022-01-12 00:00:00
Holidays
2022-01-13 00:00:00
Holidays
2022-01-14 00:00:00
Holidays
2022-01-15 00:00:00
Holidays
2022-01-16 00:00:00
Holidays
2022-01-17 00:00:00
Holidays
If it can be in a few lines of code I would appreciate it. Thank you very much for your help.
If you're open to using Pandas, this should accomplish what you've requested
import pandas as pd
def create_interval(start, end, field_val):
#setting up index date range
idx = pd.date_range(start, end)
#create the dataframe using the index above, and creating the empty column for interval_name
df = pd.DataFrame(index = idx, columns = ['interval_name'])
#set the index name
df.index.names = ['date']
#filling out all rows in the 'interval_name' column with the field_val parameter
df.interval_name = field_val
return df
create_interval('2022-01-12', '2022-01-17', 'holiday')
I hope I coded exactly what you need.
import pandas as pd
def create_interval(ts1, ts2, interval_name):
ts_list_dt = pd.date_range(start=ts1, end=ts2).to_pydatetime().tolist()
ts_list = list(map(lambda x: ''.join(str(x)), ts_list_dt))
d = {'date': ts_list, 'interval_name': [interval_name]*len(ts_list)}
df = pd.DataFrame(data=d)
return df
df = create_interval('2022-01-12', '2022-01-17', 'Holidays')
print(df)
output:
date interval_name
0 2022-01-12 00:00:00 Holidays
1 2022-01-13 00:00:00 Holidays
2 2022-01-14 00:00:00 Holidays
3 2022-01-15 00:00:00 Holidays
4 2022-01-16 00:00:00 Holidays
5 2022-01-17 00:00:00 Holidays
If you want DataFrame without Index column, use df = df.set_index('date') after creating DataFrame df = pd.DataFrame(data=d). And then you will get:
date interval_name
2022-01-12 00:00:00 Holidays
2022-01-13 00:00:00 Holidays
2022-01-14 00:00:00 Holidays
2022-01-15 00:00:00 Holidays
2022-01-16 00:00:00 Holidays
2022-01-17 00:00:00 Holidays

Filling in missing hourly data in Pandas

I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.
So I'm expecting a table like this:
name time output
x 2018-02-22 00:00:00 100
...
x 2018-02-22 23:00:00 200
x 2018-02-24 00:00:00 300
...
x 2018-02-24 23:00:00 300
y 2018-02-22 00:00:00 100
...
y 2018-02-22 23:00:00 200
y 2018-02-25 00:00:00 300
...
y 2018-02-25 23:00:00 300
For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.
def add_missing_hours(df):
start_date = df.time.iloc[0].date()
end_date = df.time.iloc[-1].date()
dates_range = pd.date_range(start_date, end_date, freq = '1H')
new_dates = set(dates_range) - set(df.time)
name = df["name"].iloc[0]
df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
return df
For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?
Edit 1:
This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.
Edit 2:
I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:
start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')
mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))
The last line removes any days that were completely missing for the specific name in the initial dataframe.
try:
1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.
df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)
df4:
name output
2018-02-22 00:00:00 x 100.0
2018-02-22 00:00:00 y 100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y 300.0
98 rows × 2 columns

pandas period_range - gaining access to the

pd.period_range(start='2017-01-01', end='2017-01-01', freq='Q') gives me the following:
PeriodIndex(['2017Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
I would like to gain access to '2017Q1' to put it into a different column in the dataframe.
I have a date column with dates, e.g., 1/1/2017. I have other column where I'd like to put the string calculated by the period range. It seems like that would be an efficient way to update the column with the fiscal quarter. I can't seem to access it. It isn't subscriptable, and I can't get at it even when I assign it to a variable. Just wondering what I'm missing.
pandas.PeriodIndex:
You were almost there. It just needed to be assigned to a column
import numpy as np
from datetime import timedelta, datetime
import pandas as pd
# list of dates - test data
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
# create the dataframe
df = pd.DataFrame({'dates': list_of_dates})
dates
2017-01-01
2017-01-05
2017-01-09
2017-01-13
2017-01-17
df['Quarters'] = pd.PeriodIndex(df.dates, freq='Q-DEC')
Output:
print(df.head())
dates Quarters
2017-01-01 2017Q1
2017-01-05 2017Q1
2017-01-09 2017Q1
2017-01-13 2017Q1
2017-01-17 2017Q1
print(df.tail())
dates Quarters
2019-08-31 2019Q3
2019-09-04 2019Q3
2019-09-08 2019Q3
2019-09-12 2019Q3
2019-09-16 2019Q3

Pandas normalize column indexed by datetimeindex by sum of groupby date

If given a dataframe that's indexed with a datetimeindex, is there an efficient way to normalize the values within a given day? For example I'd like to sum all values for each day, and then divide each columns values by the resulting sum for the day.
I can easily group by date and calculate the divisor (sum of values of each column for each date) but I'm not entirely sure the best way to divide the original dataframe by the resulting sum df.
Example dataframe with datetimeindex and resulting df from sum
I attempted to do something like
df / df.groupby(df.index.to_period('D')).sum()
however it isn't behaving as I would have hoped for.
Instead I'm getting a df with NaN everywhere and Date appended as new indexes.
i.e
Results from above division
Toy recreation:
df = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],columns=['a','b'],
index=pd.to_datetime(['2017-01-01 14:30:00','2017-01-01 14:31:00',
'2017-01-02 14:30:00', '2017-01-02 14:31:00']))
df / df.groupby(df.index.to_period('D')).sum()
results in
a b
2017-01-01 14:30:00 NaN NaN
2017-01-01 14:31:00 NaN NaN
2017-01-02 14:30:00 NaN NaN
2017-01-02 14:31:00 NaN NaN
2017-01-01 NaN NaN
2017-01-02 NaN NaN
You will need to copy and paste your dataframe as text and not an image so I can help further but here is an example:
sample df
df1 = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'),
index=pd.date_range('2017-01-03', '2017-01-07'))
df2 = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'),
index=pd.date_range('2017-01-03', '2017-01-07'))
df = pd.concat([df1,df2])
A B C D E
2017-01-03 1.393874 1.933301 0.215026 -0.412957 -0.293925
2017-01-04 0.825777 0.315449 2.317292 -0.347617 -2.427019
2017-01-05 -0.372916 -0.931185 0.049707 0.635828 -0.774566
2017-01-06 1.564714 -1.582461 1.455403 0.521305 -2.175344
2017-01-07 1.255747 1.967338 -0.766391 -0.021921 0.672704
2017-01-03 0.620301 -1.521681 -0.352800 -1.394239 -1.206983
2017-01-04 -0.041829 -0.870871 -0.402440 0.268725 1.499321
2017-01-05 -1.098647 1.690136 1.004087 0.304037 1.235310
2017-01-06 0.305645 -0.327096 0.280591 -0.476904 1.652096
2017-01-07 1.251927 0.469697 0.047694 1.838995 -0.258889
then what you are currently doing:
df / df.groupby(df.index).sum()
A B C D E
2017-01-03 0.692032 4.696817 -1.560723 0.228507 0.195831
2017-01-03 0.307968 -3.696817 2.560723 0.771493 0.804169
2017-01-04 1.053357 -0.567944 1.210167 4.406211 2.616174
2017-01-04 -0.053357 1.567944 -0.210167 -3.406211 -1.616174
2017-01-05 0.253415 -1.226937 0.047170 0.676510 -1.681122
2017-01-05 0.746585 2.226937 0.952830 0.323490 2.681122
2017-01-06 0.836585 0.828706 0.838369 11.740853 4.157386
2017-01-06 0.163415 0.171294 0.161631 -10.740853 -3.157386
2017-01-07 0.500762 0.807267 1.066362 -0.012064 1.625615
2017-01-07 0.499238 0.192733 -0.066362 1.012064 -0.625615
Take a look at the first row col A
1.393874 / (1.393874 + 0.620301) = 0.6920322216292031 so your example of df / df.groupby(df.index).sum() is working as expected.
Also be careful if your data contains NaNs because np.nan / a number = nan
update per comment:
df = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]],columns=['a','b'],
index=pd.to_datetime(['2017-01-01 14:30:00','2017-01-01 14:31:00',
'2017-01-02 14:30:00', '2017-01-02 14:31:00']))
# create multiindex with level 1 being just dates
df.set_index(df.index.floor('D'), inplace=True, append=True)
# divide df by the group sum matching the index values of level 1
df.div(df.groupby(level=1).sum(), level=1).reset_index(level=1, drop=True)
a b
2017-01-01 14:30:00 0.250000 0.333333
2017-01-01 14:31:00 0.750000 0.666667
2017-01-02 14:30:00 0.416667 0.428571
2017-01-02 14:31:00 0.583333 0.571429

Time series prediction, make new X, new row from some past rows

For times series prediction I am using pandas.
this is a some sample of my data frame:
Close Price
DateTime
2017-01-02 23:00:00 1.04630
2017-01-02 23:30:00 1.04575
2017-01-03 00:00:00 1.04672
2017-01-03 00:30:00 1.04662
2017-01-03 01:00:00 1.04766
......
in my X matrix for sklearn prediction I want to have something like this:
use 3 past row as input for making a new row
X:
ClosePrice ClosePrice-1 ClosePrice-2 ClosePrice-3
2017-01-03 00:30:00 1.04662 1.04672 1.04575 1.04630
2017-01-03 01:00:00 1.04766 1.04662 1.04672 1.04575
...
what is the best method?
is there a way to use pandas function to do this?
thanks a lot
if I want to use n instead of 3 what is best method?
this is worked:
for i in range(1, NumberOfLastData):
ColunmNameHighLowBin = 'HighLowBin-' + str(i)
ColunmNameOpenCloseBin = 'OpenCloseBin-' + str(i)
X[ColunmNameHighLowBin] = X['HighLowBin'].shift(i)
X[ColunmNameOpenCloseBin] = X['OpenCloseBin'].shift(i)

Categories

Resources