I have to monthly normalize values of one dataframe column Allocation.
data=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.001905 9.55 0.0 0.0
2018-11-01 00:15:00 0.001794 9.55 0.0 0.0
2018-11-01 00:30:00 0.001700 9.55 0.0 0.0
2018-11-01 00:45:00 0.001607 9.55 0.0 0.0
This means, if we have 2018-11, divide Allocation by 11.116, while in 2018-12, divide Allocation by 2473.65, and so on... (These values come from a list Volume, where Volume[0] corresponds to 2018-11 untill Volume[7] corresponds to 2019-06).
Date_From is a index and a timestamp.
data_normalized=
Allocation Temperature Precipitation Radiation
Date_From
2018-11-01 00:00:00 0.000171 9.55 0.0 0.0
2018-11-01 00:15:00 0.000097 9.55 0.0 0.0
...
My approach was the use of itertuples:
for row in data.itertuples(index=True,name='index'):
if row.index =='2018-11':
data['Allocation']/Volume[0]
Here, the if statement is never true...
Another approach was
if ((row.index >='2018-11-01 00:00:00') & (row.index<='2018-11-31 23:45:00')):
However, here I get the error TypeError: '>=' not supported between instances of 'builtin_function_or_method' and 'str'
Can I solve my problem with this approach or should I use a different approach? I am happy about any help
Cheers!
Maybe you can put your list Volume in a dataframe where the date (or index) is the first day of every month.
import pandas as pd
import numpy as np
N = 16
date = pd.date_range(start='2018-01-01', periods=N, freq="15d")
df = pd.DataFrame({"date":date, "Allocation":np.random.randn(N)})
# A dataframe where at every month associate a volume
df_vol = pd.DataFrame({"month":pd.date_range(start="2018-01-01", periods=8, freq="MS"),
"Volume": np.arange(8)+1})
# convert every date with the beginning of the month
df["month"] = df["date"].astype("datetime64[M]")
# merge
df1 = pd.merge(df,df_vol, on="month", how="left")
# divide allocation by Volume.
# Now it's vectorial as to every date we merged the right volume.
df1["norm"] = df1["Allocation"]/df1["Volume"]
Related
I have two separate DataFrames, which both contain rainfall amounts and dates corresponding to them.
df1:
time tp
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 0.0
3 2013-01-01 03:00:00 0.0
4 2013-01-01 04:00:00 0.0
... ...
8755 2013-12-31 19:00:00 0.0
8756 2013-12-31 20:00:00 0.0
8757 2013-12-31 21:00:00 0.0
8758 2013-12-31 22:00:00 0.0
8759 2013-12-31 23:00:00 0.0
[8760 rows x 2 columns]
df2:
time tp
0 2013-07-18T18:00:01 0.002794
1 2013-07-18T20:00:00 0.002794
2 2013-07-18T21:00:00 0.002794
3 2013-07-18T22:00:00 0.002794
4 2013-07-19T00:00:00 0.000000
... ...
9656 2013-12-30T13:30:00 0.000000
9657 2013-12-30T23:30:00 0.000000
9658 2013-12-31T00:00:00 0.000000
9659 2013-12-31T00:00:00 0.000000
9660 2014-01-01T00:00:00 0.000000
[9661 rows x 2 columns]
I'm trying to plot a scatter graph comparing the two data frames. The way I'm doing it is by choosing a specific date and time and plotting the df1 tp on one axis and df2 tp on the other axis.
For example,
If the date/time on both dataframes = 2013-12-31 19:00:00, then plot tp for df1 onto x-axis, and tp for df2 on the y-axis.
To solve this, I tried using the following:
df1['dates_match'] = np.where(df1['time'] == df2['time'], 'True', 'False')
which will tell me if the dates match, and if they do I can plot. The problem arises as I have a different number of rows on each dataframe, and most methods only allow comparison of dataframes with exactly the same amount of rows.
Does anyone know of an alternative method I could use to plot the graph?
Thanks in advance!
The main goal is to plot two time series with that apparently don't have the same frequency to be able to compare them.
Since the main issue here is the different timestamps let's tackle that with pandas resample so we have a more uniform timestamps for each observation. To take the sum of 30 minutes intervals you can do (feel free to change the time interval and the agg function if you want to)
df1.set_index("time", inplace=True)
df2.set_index("time", inplace=True)
df1_resampled = df1.resample("30T").sum() # taking the sum of 30 minutes intervals
df2_resampled = df2.resample("30T").sum() # taking the sum of 30 minutes intervals
Now that the timestamps are more organized you can either merge the newer resampled dataframes if you want to and then plot i
df_joined = df1_resampled.join(df2_resampled, lsuffix="_1", rsuffix="_2")
df_joined.plot(marker="o", figsize=(12,6))
# df_joined.plot(subplots=True) if you want to plot them separately
Since df1 starts on 2013-01-01 and df2 on 2013-07-18 you'll have a first period where only df1 will exist if you want to plot only the overlapped period you can pass how="outer" to when joining both dataframes.
I am having some trouble managing and combining columns in order to get one datetime column out of three columns containing the date, the hours and the minutes.
Assume the following df (copy and type df= = pd.read_clipboard() to reproduce) with the types as noted below:
>>>df
date hour minute
0 2021-01-01 7.0 15.0
1 2021-01-02 3.0 30.0
2 2021-01-02 NaN NaN
3 2021-01-03 9.0 0.0
4 2021-01-04 4.0 45.0
>>>df.dtypes
date object
hour float64
minute float64
dtype: object
I want to replace the three columns with one called 'datetime' and I have tried a few things but I face the following problems:
I first create a 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time and then I try to concatenate it with the 'date' df['datetime']= df['date'] + ' ' + df['time'] (with the purpose of converting the 'datetime' column pd.to_datetime(df['datetime']). However, I get
TypeError: can only concatenate str (not "datetime.time") to str
If I convert 'hour' and 'minute' to str to concatenate the three columns to 'datetime', then I face the problem with the NaN values, which prevents me from converting the 'datetime' to the corresponding type.
I have also tried to first convert the 'date' column df['date']= df['date'].astype('datetime64[ns]') and again create the 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time to combine the two: df['datetime']= pd.datetime.combine(df['date'],df['time']) and it returns
TypeError: combine() argument 1 must be datetime.date, not Series
along with the warning
FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Is there a generic solution to combine the three columns and ignore the NaN values (assume it could return 00:00:00).
What if I have a row with all NaN values? Would it possible to ignore all NaNs and 'datetime' be NaN for this row?
Thank you in advance, ^_^
First convert date to datetimes and then add hour and minutes timedeltas with replace missing values to 0 timedelta:
td = pd.Timedelta(0)
df['datetime'] = (pd.to_datetime(df['date']) +
pd.to_timedelta(df['hour'], unit='h').fillna(td) +
pd.to_timedelta(df['minute'], unit='m').fillna(td))
print (df)
date hour minute datetime
0 2021-01-01 7.0 15.0 2021-01-01 07:15:00
1 2021-01-02 3.0 30.0 2021-01-02 03:30:00
2 2021-01-02 NaN NaN 2021-01-02 00:00:00
3 2021-01-03 9.0 0.0 2021-01-03 09:00:00
4 2021-01-04 4.0 45.0 2021-01-04 04:45:00
Or you can use Series.add with fill_value=0:
df['datetime'] = (pd.to_datetime(df['date'])
.add(pd.to_timedelta(df['hour'], unit='h'), fill_value=0)
.add(pd.to_timedelta(df['minute'], unit='m'), fill_value=0))
I would recommend converting hour and minute columns to string and constructing the datetime string from the provided components.
Logically, you need to perform the following steps:
Step 1. Fill missing values for hour and minute with zeros.
df['hour'] = df['hour'].fillna(0)
df['minute'] = df['minute'].fillna(0)
Step 2. Convert float values for hour and minute into integer ones, because your final output should look like 2021-01-01 7:15, not 2021-01-01 7.0:15.0.
df['hour'] = df['hour'].astype(int)
df['minute'] = df['minute'].astype(int)
Step 3. Convert integer values for hour and minute to the string representation.
df['hour'] = df['hour'].astype(str)
df['minute'] = df['minute'].astype(str)
Step 4. Concatenate date, hour and minute into one column of the correct format.
df['result'] = df['date'].str.cat(df['hour'].str.cat(df['minute'], sep=':'), sep=' ')
Step 5. Convert your result column to datetime object.
pd.to_datetime(df['result'])
It is also possible to fullfill all of this steps in one command, though it will read a bit messy:
df['result'] = pd.to_datetime(df['date'].str.cat(df['hour'].fillna(0).astype(int).astype(str).str.cat(df['minute'].fillna(0).astype(int).astype(str), sep=':'), sep=' '))
Result:
date hour minute result
0 2020-01-01 7.0 15.0 2020-01-01 07:15:00
1 2020-01-02 3.0 30.0 2020-01-02 03:30:00
2 2020-01-02 NaN NaN 2020-01-02 00:00:00
3 2020-01-03 9.0 0.0 2020-01-03 09:00:00
4 2020-01-04 4.0 45.0 2020-01-04 04:45:00
I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.
Here is the python code which is trying to read the CSV file from alphavantage URL and converts it to pandas data frame. Multiple issues are there with this.
Before raising the issue, here is the code below.
dailyurl = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=NSE:{}&apikey=key&outputsize=full&datatype=csv'.format(Ticker)
cols = ['timestamp', 'open', 'high', 'low', 'close','adjusted_close','volume','dividend_amount','split_coefficient']
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols)
dfmonthly = pd.read_csv(monthlyurl, skiprows=0, header=None,names=cols)
dfdaily.rename(columns = {'timestamp':'date'}, inplace = True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.drop(dfdaily.index[:1], inplace=True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.reset_index(inplace=True, drop=False)
print(dfdaily.head(6))
Issues:
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols) return values seems to not match with pandas dataframe (looks like it contains a string) hence when I use this dataframe I am getting error "high is not dobule"
This URL return value contains multi-index as below
0 1 2 3 4
0 Timestamp open High Low close
1 09-02-2017 100 110 99 96
In the above first 0,1,2,3,4 column index not wanted hence added
dfdaily.drop(dfdaily.index[:1], inplace=True) now ,is there a better way to get the dataframe output converting this from csv to pddataframe.
As i see the read values are string i just tried making the dataframe as numeric value by using this line
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
this converts the date value to 0.0 so lost the purpose the date should be retain as its.And with this many lines of code for converting pandasdata frame it takes lot of time,so really a better way of doing to get the desired output is needed.
The output I am getting is :
index date open high low close adjusted_close volume
0 1 0.0 1629.05 1655.00 1617.30 1639.40 1639.40 703720.0
1 2 0.0 1654.00 1679.00 1638.05 1662.15 1662.15 750746.0
2 3 0.0 1680.00 1687.00 1620.60 1641.65 1641.65 1466983.0
3 4 0.0 1530.00 1683.75 1511.20 1662.15 1662.15 2109416.0
4 5 0.0 1600.00 1627.95 1546.50 1604.95 1604.95 1472164.0
5 6 0.0 1708.05 1713.00 1620.20 1628.90 1628.90 1645045.0
Multiindex is not required and date shall be as date not "0"
and other open high low close shall be in numerical format.
light on this optimization , a nice code which will give pandas numerical dataframe with an index as "date" so that it can be used for arithmetic logical execution further.
I think you need omit parameter names, because csv has header. Also for DatetimeIndex add parameter index_col for set first column to index and parse_dates for convert it to datetimes. Last rename_axis rename timestamp to date:
dfdaily = pd.read_csv(dailyurl, index_col=[0], parse_dates=[0])
dfdaily = dfdaily.rename_axis('date')
print (dfdaily.head())
open high low close adjusted_close volume \
date
2018-02-09 20.25 21.0 20.25 20.25 20.25 21700
2018-02-08 20.50 20.5 20.25 20.50 20.50 1688900
2018-02-07 20.50 20.5 20.25 20.50 20.50 301800
2018-02-06 20.25 21.0 20.25 20.25 20.25 39400
2018-02-05 20.50 21.0 20.25 20.50 20.50 5400
dividend_amount split_coefficient
date
2018-02-09 0.0 1.0
2018-02-08 0.0 1.0
2018-02-07 0.0 1.0
2018-02-06 0.0 1.0
2018-02-05 0.0 1.0
print (dfdaily.dtypes)
open float64
high float64
low float64
close float64
adjusted_close float64
volume int64
dividend_amount float64
split_coefficient float64
dtype: object
print (dfdaily.index)
DatetimeIndex(['2018-02-09', '2018-02-08', '2018-02-07', '2018-02-06',
'2018-02-05', '2018-02-02', '2018-02-01', '2018-01-31',
'2018-01-30', '2018-01-29',
...
'2000-01-14', '2000-01-13', '2000-01-12', '2000-01-11',
'2000-01-10', '2000-01-07', '2000-01-06', '2000-01-05',
'2000-01-04', '2000-01-03'],
dtype='datetime64[ns]', name='date', length=4556, freq=None)
Sorry if this seems like a stupid question,
I have a dataset which looks like this
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed
T 2017-10-07 10:44:48 28.750766667 77.088805000 783.5 0.0 2017-10-07_10-44-48 0.0 00:00:00
T 2017-10-07 10:44:58 28.752345000 77.087840000 853.5 7.8 198.70532 00:00:10
T 2017-10-07 10:45:00 28.752501667 77.087705000 854.5 7.7 220.53915 00:00:12
Im not exactly sure how to approach this,calculating acceleration requires taking difference of speed and time,any suggestions on what i may try?
Thanks in advance
Assuming your data was loaded from a CSV as follows:
type,time,latitude,longitude,altitude (m),speed (km/h),name,desc,currentdistance,timeelapsed
T,2017-10-07 10:44:48,28.750766667,77.088805000,783.5,0.0,2017-10-07_10-44-48,,0.0,00:00:00
T,2017-10-07 10:44:58,28.752345000,77.087840000,853.5,7.8,,,198.70532,00:00:10
T,2017-10-07 10:45:00,28.752501667,77.087705000,854.5,7.7,,,220.53915,00:00:12
The time column is converted to a datetime object, and the timeelapsed column is converted into seconds. From this you could add an acceleration column by
calculating the difference in speed (km/h) between each row and dividing by the difference in time between each row as follows:
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv', parse_dates=['time'], dtype={'name':str, 'desc':str})
df['timeelapsed'] = (pd.to_datetime(df['timeelapsed'], format='%H:%M:%S') - datetime(1900, 1, 1)).dt.total_seconds()
df['acceleration'] = (df['speed (km/h)'] - df['speed (km/h)'].shift(1)) / (df['timeelapsed'] - df['timeelapsed'].shift(1))
print df
Giving you:
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed acceleration
0 T 2017-10-07 10:44:48 28.750767 77.088805 783.5 0.0 2017-10-07_10-44-48 NaN 0.00000 0.0 NaN
1 T 2017-10-07 10:44:58 28.752345 77.087840 853.5 7.8 NaN NaN 198.70532 10.0 0.78
2 T 2017-10-07 10:45:00 28.752502 77.087705 854.5 7.7 NaN NaN 220.53915 12.0 -0.05