I have the following code:
import fxcmpy
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
end = datetime.datetime.today()
today = date.today()
data = con.get_candles(ticker, period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%B-%d')
data = data.set_index(data.index.normalize())
data = data.reindex(full_dates)
When i print data
i get this:
bidopen bidclose bidhigh bidlow askopen askclose askhigh asklow tickqty
2008-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2008-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2008-01-03 13261.82 13043.96 13279.54 12991.37 13261.82 13043.96 13279.54 12991.37 0.0
2008-01-04 13044.12 13056.72 13137.93 13023.56 13044.12 13056.72 13137.93 13023.56 0.0
2008-01-05 13046.56 12800.18 13046.72 12789.04 13046.56 12800.18 13046.72 12789.04 0.0
... ... ... ... ... ... ... ... ... ...
2019-12-19 28272.45 28401.75 28414.05 28245.65 28277.00 28405.45 28418.65 28248.35 378239.0
2019-12-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-21 28401.60 28472.20 28518.80 28369.90 28405.30 28474.30 28520.30 28371.30 513987.0
2019-12-22 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4375 rows × 9 columns
My question is that since the format i had used was format ='%Y-%B-%d' for the date why is it not showing in that format?
The format you were using in 'data.index = pd.to_datetime(data.index, format ='%Y-%B-%d')' was used to interpret the data in index as datetime. To display the output you will need something like data.index.dt.strftime('%Y-%B%-%d').
Related
I am trying to reindex the dates in pandas. This is because there are dates which are missing, such as weekends or national hollidays.
To do this I am using the following code:
import pandas as pd
import yfinance as yf
import datetime
start = datetime.date(2015,1,1)
end = datetime.date.today()
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index = df.index.strftime('%Y-%m-%d')
full_dates = pd.date_range(start, end)
df.reindex(full_dates)
This code is producing this dataframe:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 NaN NaN NaN NaN NaN NaN
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2023-01-13 NaN NaN NaN NaN NaN NaN
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
Could you please advise why is it not reindexing the data and showing NaN values instead?
===Edit ===
Could it be a python version issue? I ran the same code in python 3.7 and 3.10
In python 3.7
In python 3.10
In python 3.10 - It is datetime as you can see from the image.
Getting datetime after yf.download('F', start, end, interval ='1d', progress = False) without strftime
Remove converting DatetimeIndex to strings by df.index = df.index.strftime('%Y-%m-%d'), so can reindex by datetimes.
df = yf.download('F', start, end, interval ='1d', progress = False)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407450 44079700.0
... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]
print (df.index)
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
'2015-01-09', '2015-01-10',
...
'2023-01-08', '2023-01-09', '2023-01-10', '2023-01-11',
'2023-01-12', '2023-01-13', '2023-01-14', '2023-01-15',
'2023-01-16', '2023-01-17'],
dtype='datetime64[ns]', length=2939, freq='D')
EDIT: There is timezones difference, for remove it use DatetimeIndex.tz_convert:
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index= df.index.tz_convert(None)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)
You need to use strings in reindex to keep an homogeneous type, else pandas doesn't match the string (e.g., 2015-01-02) with the Timestamp (e.g., pd.Timestamp('2015-01-02')):
df.reindex(full_dates.astype(str))
#or
df.reindex(full_dates.strftime('%Y-%m-%d'))
Output:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407451 44079700.0
... ... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]
All,
I have a dataframe (df_live) with the following structure:
live live.updated live.latitude live.longitude live.altitude live.direction live.speed_horizontal live.speed_vertical live.is_ground
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
95 NaN NaN NaN NaN NaN NaN NaN NaN NaN
96 NaN 2022-10-11T17:46:19+00:00 -45.35 169.88 5791.2 44.0 518.560 0.0 False
97 NaN 2022-10-11T17:45:54+00:00 -27.55 143.20 11277.6 139.0 853.772 0.0 False
98 NaN NaN NaN NaN NaN NaN NaN NaN NaN
99 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I would like to iterate through this dataframe such that I only obtain rows for which numerical values are available (e.g. rows 96 and 97).
The code I am using is as follows:
import boto3
import json
from datetime import datetime
import calendar
import random
import time
import requests
import pandas as pd
aircraftdata = ''
params = {
'access_key': 'KEY',
'limit': '100',
'flight_status':'active'
}
url = "http://api.aviationstack.com/v1/flights"
api_result = requests.get('http://api.aviationstack.com/v1/flights', params)
api_statuscode = api_result.status_code
api_response = api_result.json()
df = pd.json_normalize(api_response["data"])
df_live = df[df.loc[:, df.columns.str.contains("live", case=False)].columns]
df_dep = df[df.loc[:, df.columns.str.contains("dep", case=False)].columns]
print(df_live)
for index, row in df_live.iterrows():
if df_live["live_updated"] != "NaN":
print (row)
else:
print ("Not live")
This yields the following error
KeyError: 'live_updated'
instead of iterating with the for loop, how about removing rows with all NaN in one go?
df_live = df_live[df_live.notnull().any(1)]
print(df_live)
Be careful with the column names. The key error
KeyError: 'live_updated'
means that there are no columns in the dataframe with the name of 'live_updated'.
If you check your dataframe columns, the actual name you probably want to refer to is 'live.updated', so just change the column name you are referring to on the code:
for index, row in df_live.iterrows():
if df_live["live.updated"] != "NaN":
print (row)
else:
print ("Not live")
Another solution could be to rename the dataframe columns before you refer to them:
df_live = df_live.rename(columns={'live.updated': 'live_updated'})
I have several csv files in a directory of folders and subfolders. All the csv files have headers and time stamp as 1st column, whether time series data is present or not. I want to read all the csv files and should return status as empty if no data is present.
When I used df.empty function to check, it returns False even there is no data (the file has only header row and 1st column with time stamp).
The code I used is:
import pandas as pd
df1 = pd.read_csv("D://sirifort_with_data.csv", index_col=0)
df2 = pd.read_csv("D://sirifort_without_data.csv", index_col=0)
print(df1.empty)
print(df2.empty)
print(df2)
The result is:
False
False
PM2.5(ug/m3) PM10(ug/m3) ... NOx(ppb) NH3(ug/m3)
Time_Stamp ...
26/02/2022 0:00 NaN NaN ... NaN NaN
26/02/2022 0:15 NaN NaN ... NaN NaN
26/02/2022 0:30 NaN NaN ... NaN NaN
26/02/2022 0:45 NaN NaN ... NaN NaN
26/02/2022 1:00 NaN NaN ... NaN NaN
26/02/2022 1:15 NaN NaN ... NaN NaN
26/02/2022 1:30 NaN NaN ... NaN NaN
26/02/2022 1:45 NaN NaN ... NaN NaN
26/02/2022 2:00 NaN NaN ... NaN NaN
26/02/2022 2:15 NaN NaN ... NaN NaN
26/02/2022 2:30 NaN NaN ... NaN NaN
26/02/2022 2:45 NaN NaN ... NaN NaN
[12 rows x 6 columns]
Use the sum of one of the columns. in the case of empty df it is zero.
def col_check(col):
if df[col].sum()!=0:
return 1
for col in df.columns:
if col_check(col):
print('not empty')
break
The documentation clearly indicates:
If we only have NaNs in our DataFrame, it is not considered empty! We
will need to drop the NaNs to make the DataFrame empty:
df = pd.DataFrame({'A' : [np.nan]})
df.empty
False
and then suggests:
df.dropna().empty
True
I forward fill values in the following df using:
df = (df.resample('d') # ensure data is daily time series
.ffill()
.sort_index(ascending=True))
df before forward fill
id a b c d
datadate
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1 NaN 3 4
1980-05-31 NaN NaN NaN NaN
... ... ... ...
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20 33
However, I wish to only forward fill one year after (date is datetime) the last observation and then the remaining rows simply be NaN. I am not sure what is the best way to introduce this criteria in this task. Any help would be super!
Thanks
If I understand you correctly, you want to forward-fill the values on Dec 31, 2019 to the next year. Try this:
end_date = df.index.max()
new_end_date = end_date + pd.offsets.DateOffset(years=1)
new_index = df.index.append(pd.date_range(end_date, new_end_date, closed='right'))
df = df.reindex(new_index)
df.loc[end_date:, :] = df.loc[end_date:, :].ffill()
Result:
a b c d
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2.0 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1.0 NaN 3.0 4.0
1980-05-31 NaN NaN NaN NaN
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20.0 33.0
2020-01-01 NaN NaN 20.0 33.0
2020-01-02 NaN NaN 20.0 33.0
...
2020-12-31 NaN NaN 20.0 33.0
One solution is to forward fill using a limit parameter, but this wont handle the leap-year:
df.fillna(mehotd='ffill', limit=365)
The second solution is to define a more robust function to do the forward fill in the 1-year window:
from pandas.tseries.offsets import DateOffsets
def fun(serie_df):
serie = serie_df.copy()
indexes = serie[~serie.isnull()].index
for idx in indexes:
mask = (serie.index >= idx) & (serie.index < idx+DateOffset(years=1))
serie.loc[mask] = serie[mask].fillna(method='ffill')
return serie
df_filled = df.apply(fun, axis=0)
If a column has multiple non-nan values in the same 1-year window, then the first fill will stop once the most recent value is encounter. The second solution will treat the consecutive value as if they were independent.
I have a datetime issue where I am trying to match up a dataframe
with dates as index values.
For example, I have dr which is an array of numpy.datetime.
dr = [numpy.datetime64('2014-10-31T00:00:00.000000000'),
numpy.datetime64('2014-11-30T00:00:00.000000000'),
numpy.datetime64('2014-12-31T00:00:00.000000000'),
numpy.datetime64('2015-01-31T00:00:00.000000000'),
numpy.datetime64('2015-02-28T00:00:00.000000000'),
numpy.datetime64('2015-03-31T00:00:00.000000000')]
Then I have dataframe with returndf with dates as index values
print(returndf)
1 2 3 4 5 6 7 8 9 10
10/31/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11/30/2014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please ignore the missing values
Whenever I try to match date in dr and dataframe returndf, using the following code for just 1 month returndf.loc[str(dr[1])],
I get an error
KeyError: 'the label [2014-11-30T00:00:00.000000000] is not in the [index]'
I would appreciate if someone can help with me on how to convert numpy.datetime64('2014-10-31T00:00:00.000000000') into 10/31/2014 so that I can match it to the data frame index value.
Thank you,
Your index for returndf is not a DatetimeIndex. Make is so:
returndf = returndf.set_index(pd.to_datetime(returndf.index))
Your dr is a list of Numpy datetime64 objects. That bothers me:
dr = pd.to_datetime(dr)
Your sample data clearly shows that the index of returndf does not include all the items in dr. In that case, use reindex
returndf.reindex(dr)
1 2 3 4 5 6 7 8 9 10
2014-10-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-11-30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-12-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-02-28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-03-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN