I'm trying to change a date frame with the following contents:
Date
Change
1802
2017-09-14
-1.14%
462
2021-05-16
NaN
935
2020-01-29
0.04%
713
2020-09-07
2.39%
1471
2018-08-11
NaN
[1460 rows × 2 columns]
Into this:
TimeSeries (DataArray) (Month: 144component: 1sample: 1)
array([[[112.]],
[[118.]],
[[132.]],
[[129.]],
[[121.]],
[[135.]],
[[148.]],
[[148.]],
[[136.]],
Coordinates:
Month
(Month)
datetime64[ns].
2019-01-01 ... 2021-12-01
component
(component)
object
'Change'
Attributes:
static_covariates: None
hierarchy: None
In order to run a neural network model on multiple time series.
Any help or advice is greatly appreciated!
The solution required removing the '%' sign from the column values. Then converting the column to a float.
ftse_change['Change'] = ftse_change['Change'].str.rstrip('%').astype('float') / 100.0
did the trick
Related
I have a pandas df with 5181 rows and with a column of customer names and I have a separate list of 383 customer names from within that column whose corresponding rows I want to drop from the df. I tried to write a piece of code that would iterate through all the names in the customer column and drop each of the rows with customer names matching those on the list. My result is TypeError: 'NoneType' object is not subscriptable.
The list is called Retail_Customer_Tracking and the df is called df_final and looks like:
index Customer First_Order_Date Last_Order_Date
0 0 0 2022-09-15 2022-09-15
1 1 287 2018-02-19 2020-11-30
2 2 606 2017-10-31 2017-12-07
3 3 724 2021-12-28 2022-09-15
4 4 1025 2015-08-13 2015-08-13
... ... ... ... ...
5176 5176 tulips little pop up shop 2021-10-25 2022-10-08
5177 5177 unboxed 2021-06-24 2022-10-10
5178 5178 upMADE 2021-09-10 2022-03-31
5179 5179 victorias floral design 2021-07-12 2021-07-12
5180 5180 vintique marketplace 2021-03-16 2022-10-15
5181 rows × 4 columns
The code i wrote looks like
i = 0
for x in Retail_Customer_Tracking:
while i < 5182:
if df_final["Customer"].iloc[i] == x:
df_final = df_final.drop(df_final[i], axis=0, inplace=True)
else:
i = i + 1
I was hoping that the revised df_final would not have the rows I wanted to drop...
i'm very new at coding and any help would be greatly appreciated. Thanks!
I'm pretty new to time series.
This is the dataset I'm working on:
Date Price Location
0 2012-01-01 1771.0 Marche
1 2012-01-01 1039.0 Calabria
2 2012-01-01 2193.0 Campania
3 2012-01-01 2015.0 Emilia-Romagna
4 2012-01-01 1483.0 Friuli-Venezia Giulia
... ... ... ...
2475 2022-04-01 1963.0 Lazio
2476 2022-04-01 1362.0 Friuli-Venezia Giulia
2477 2022-04-01 1674.0 Emilia-Romagna
2478 2022-04-01 1388.0 Marche
2479 2022-04-01 1103.0 Abruzzo
I'm trying to build an LSTM for price prediction, but I don't know how to manage the Location categorical feature: do I have to use one-hot encoding or a groupby?
What I want to predict is the price based on the location.
How can I achieve that? A Python solution is particularly appreciated.
Thanks in advance.
Suppose my dataset (df) is analogous to yours:
Date Price Location
0 2021-01-01 791.076890 Campania
1 2021-01-01 705.702464 Lombardia
2 2021-01-01 719.991382 Sicilia
3 2021-02-01 825.760917 Lombardia
4 2021-02-01 747.734309 Sicilia
... ... ... ...
31 2021-11-01 886.874348 Lombardia
32 2021-11-01 935.040583 Campania
33 2021-12-01 771.165378 Sicilia
34 2021-12-01 952.255227 Campania
35 2021-12-01 939.754515 Lombardia
In my case I have a Price record for 3 regions (Campania, Lombardia, Sicilia) every month. My Idea is to treat the different region as different features, so I would transform df as:
df = df.set_index(["Date", "Location"]).Price.unstack()
Now my dataset is like:
Location Campania Lombardia Sicilia
Date
2021-01-01 791.076890 705.702464 719.991382
2021-02-01 758.872755 825.760917 747.734309
2021-03-01 880.038005 803.165998 837.738419
... ... ... ...
2021-10-01 908.402345 805.081193 792.369610
2021-11-01 935.040583 886.874348 736.862025
2021-12-01 952.255227 939.754515 771.165378
Please, after this, make sure there are no NaN values (df.isna().sum()).
Now you can pass this data to a multi feature RNN (or LSTM), as made in this example, or to a multi-channel 1D-CNN (choosing an appropriate kernel size). The only problem in both cases could be the small size of the dataset, so try to not to over-parameterize the model (for example reducing the number of neurons and layers), otherwise the over-fitting will be unavoidable. About this you can test the model on the last 20% of your time-series:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, shuffle=False, test_size=.2)
The last part is to build a matching (X, Y) for the supervised learning, but this depends on what model are you using and what is your prediction task. Another example here.
I am attempting to create a downward velocity model for offshore drilling which uses the variables Depth (which increases every 1 foot) and DateTime data which is more intermittent and is only updated every foot of depth:
Dept DateTime
1141 5/24/2017 04:31
1142 5/24/2017 04:32
1143 5/24/2017 04:40
1144 5/24/2017 04:42
1145 5/25/2017 04:58
I am trying to get something like this:
Where Velocity iterated down dept/(DateTime gap)
If you are happy to use a 3rd party library, this is straightforward with Pandas:
import pandas as pd
# read file into dataframe
df = pd.read_csv('file.csv')
# convert series to datetime
df['DateTime'] = pd.to_datetime(df['DateTime'])
# perform calculation
df['Velocity'] = df['Dept'].diff() / (df['DateTime'].diff().dt.total_seconds() / 60)
# expert to csv
df.to_csv('file_out.csv', index=False)
print(df)
# Dept DateTime Velocity
# 0 1141 2017-05-24 04:31:00 NaN
# 1 1142 2017-05-24 04:32:00 1.000000
# 2 1143 2017-05-24 04:40:00 0.125000
# 3 1144 2017-05-24 04:42:00 0.500000
# 4 1145 2017-05-25 04:58:00 0.000687
I have a large dateset that includes categorical data which are my labels ( non-uniform timestamp). I have another dataset which is aggregate of the measurement.
When I want to assemble these two dataset, they have two different timestamp ( aggregated vs non-aggregated).
Categorical dataframe (df_Label)
count 1185
unique 10
top ABCD
freq 1165
Aggregated Dataset (MeasureAgg),
In order to assemble the label dataframe with measurement dataframe.
I use df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
The issue is that the result of this reindexing will eliminate many of my labels, so the df.describe() will be:
count 4
unique 2
top ABCD
freq 3
I looked in two several lines of where the labels get replaced by nan but couldn't find any indication of where this come from.
I was suspicious that this issue might be due clustering of the labels in between two timestamp which eliminate many of them but this is not the case.
I tried this for fabricated dataset and it work as expected but not sure why is not working in my case df_Label=df_Label.reindex(MeasureAgg.index, method='nearest')
my apology on vague nature of my question, I couldn't replicate the issue with fabricated dataset( for fabricated dataset it worked fine). I would greatly appreciate if any one can guide me with alternative way/modified way that I can assemble these two dataframes.
Thanks in advance
Update:
There is only timestamp since it is mostly missing data
df_Label.head(5)
Time
2000-01-01 00:00:10.870 NaN
2000-01-01 00:00:10.940 NaN
2000-01-01 00:00:11.160 NaN
2000-01-01 00:00:11.640 NaN
2000-01-01 00:00:12.460 NaN
Name: SUM, dtype: object
df_Label.describe()
count 1185
unique 10
top 9_33_2_0_0_0
freq 1165
Name: SUM, dtype: object
MeasureAgg.head(5)
Time mean std skew kurt
2000-01-01 00:00:00 0.0 0.0
2010-01-01 00:00:00 0.0
2015-01-01 00:00:00
2015-12-01 00:00:00
2015-12-01 12:40:00 0.0
MeasureAgg.describe()
mean std skew kurt
count 407.0 383.0 382.0 382.0
mean 487.3552791234544 35.67631749396375 -0.7545081710390299 2.52171909979003
std 158.53524231679074 43.66050329988979 1.3831195437535115 6.72280956322486
min 0.0 0.0 -7.526780108501018 -1.3377292623812096
25% 474.33696969696973 11.5126181533734 -1.1790982769904146 -0.4005545816076801
50% 489.03428571428566 13.49696931937243 -0.2372819584684056 -0.017202890096714274
75% 532.3371929824561 51.40084557371704 0.12755009341999793 1.421205718986767
max 699.295652173913 307.8822231525122 1.2280152015331378 66.9243304128838
Here is what I'm trying to do in Pandas:
load CSV file containing information about stocks for certain days
find the earliest and latest dates in the column date
create a new dataframe where all the days between the earliest and latest are filled (NaN or something like "missing" for all columns would be fine)
Currently it looks like this:
import pandas as pd
import dateutil
df = pd.read_csv("https://dl.dropboxusercontent.com/u/84641/temp/berkshire_new.csv")
df['date'] = df['date'].apply(dateutil.parser.parse)
new_date_range = pd.date_range(df['date'].min(), df['date'].max())
df = df.set_index('date')
df.reindex(new_date_range)
Unfortunately this throws the following error which I don't quite understand:
ValueError: Shape of passed values is (3, 4825), indices imply (3, 4384)
I've tried a dozen variations of this - without any luck. Any help would be much appreciated.
Edit:
After investigating this further, it looks like the problem is caused by duplicate indexes. The CSV does contain several entries for each date, which is probably causing the errors.
The question is still relevant though: How can I fill the gaps in between, although there are duplicate entries for each date?
So you have duplicates when considering symbol,date,action.
In [99]: df.head(10)
Out[99]:
symbol date change action
0 FDC 2001-08-15 00:00:00 15.069360 new
1 GPS 2001-08-15 00:00:00 19.653780 new
2 HON 2001-08-15 00:00:00 8.604316 new
3 LIZ 2001-08-15 00:00:00 6.711568 new
4 NKE 2001-08-15 00:00:00 22.686257 new
5 ODP 2001-08-15 00:00:00 5.686902 new
6 OSI 2001-08-15 00:00:00 5.893340 new
7 USB 2001-08-15 00:00:00 15.694478 new
8 NEE 2001-11-15 00:00:00 100.000000 new
9 GPS 2001-11-15 00:00:00 142.522231 increase
Create the new date index
In [102]: idx = pd.date_range(df.date.min(),df.date.max())
In [103]: idx
Out[103]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2001-08-15 00:00:00, ..., 2013-08-15 00:00:00]
Length: 4384, Freq: D, Timezone: None
This will, group by symbol and action
Then reindex that set to the full dates (idx)
Select out the only remaining column (change)
As now the index is symbol/date
In [100]: df.groupby(['symbol','action']).apply(
lambda x: x.set_index('date').reindex(idx)
)['change'].reset_index(level=1).head()
Out[100]:
action change
symbol
ADM 2001-08-15 decrease NaN
2001-08-16 decrease NaN
2001-08-17 decrease NaN
2001-08-18 decrease NaN
2001-08-19 decrease NaN
In [101]: df.groupby(['symbol','action']).apply(lambda x: x.set_index('date').reindex(idx))['change'].reset_index(level=1)
Out[101]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 977632 entries, (ADM, 2001-08-15 00:00:00) to (svm, 2013-08-15 00:00:00)
Data columns (total 2 columns):
action 977632 non-null values
change 490 non-null values
dtypes: float64(1), object(1)
You can then fill forward or whatever you need. FYI, not sure what you are going to do with this, but this is not a very common type of operation as you have mostly empty data.
I'm having a similar problem at the moment, I think you shouldn't use reindex but something like asfreq or resample.
with them you don't need to create an index, thy will.