My Dataframe all have NaN except the last column

My Dataframe all have NaN except the last column - python

I'm trying to loop over multiple JSON data and then for each value in list add it to the DataFrame. For each JSON data, I create a column header. I seem to always only get the data for the last column, so there is clearly something wrong with the way I append the data I believe.
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
df = pd.DataFrame()
timePeriod = 120
for x in range(10):
try:
data = cg.get_coin_market_chart_by_id(id=geckoList[x],
vs_currency ='btc', days = 'timePeriod')
for y in range(timePeriod):
df = df.append({geckoList[x]: data['prices'][y][1]},
ignore_index= True)
print(geckoList[x])
except:
pass
Geckolist example:
['bitcoin',
'ethereum',
'xrp',
'bitcoin-cash',
'litecoin',
'binance-coin']
Example JSON of one the coins:
'prices': [[1565176840078, 0.029035263522626625],
[1565177102060, 0.029079747150763842],
[1565177434439, 0.029128983083947863],
[1565177700686, 0.029136960678700433],
[1565178005716, 0.0290826667213779],
[1565178303855, 0.029173025688296675],
[1565178602640, 0.029204331218623796],
[1565178911561, 0.029211943928343167],
The expected result would be a DataFrame with columns and rows of data for each crypto coin. Right now only the last column shows data
Currently, it looks like this:
bitcoin ethereum bitcoin-cash
0 NaN NaN 0.33
1 NaN NaN 0.32
2 NaN NaN 0.21
3 NaN NaN 0.22
4 NaN NaN 0.25
5 NaN NaN 0.26
6 NaN NaN 0.22
7 NaN NaN 0.22

Ok I think I found the issue.
The problem is you append data structures row by row that contained only one column to the frame, so all the other columns were filled with NaN. What i think you want is to join the columns by their timestamp. This is what i did in my example below. Let me know if this is what you need:
from pycoingecko import CoinGeckoAPI
import pandas as pd
cg = CoinGeckoAPI()
timePeriod = 120
gecko_list = ['bitcoin',
'ethereum',
'xrp',
'bitcoin-cash',
'litecoin',
'binance-coin']
data = {}
for coin in gecko_list:
try:
nested_lists = cg.get_coin_market_chart_by_id(
id=coin, vs_currency='btc', days='timePeriod')['prices']
data[coin] = {}
data[coin]['timestamps'], data[coin]['values'] = zip(*nested_lists)
except Exception as e:
print(e)
print('coin: ' + coin)
frame_list = [pd.DataFrame(
data[coin]['values'],
index=data[coin]['timestamps'],
columns=[coin])
for coin in gecko_list
if coin in data]
df = pd.concat(frame_list, axis=1).sort_index()
df.index = pd.to_datetime(df.index, unit='ms')
print(df)
This gets me the output
bitcoin ethereum bitcoin-cash litecoin
2019-08-07 12:20:14.490 NaN NaN 0.029068 NaN
2019-08-07 12:20:17.420 NaN NaN NaN 0.007890
2019-08-07 12:20:21.532 1.0 NaN NaN NaN
2019-08-07 12:20:27.730 NaN 0.019424 NaN NaN
2019-08-07 12:24:45.309 NaN NaN 0.029021 NaN
... ... ... ... ...
2019-08-08 12:15:47.548 NaN NaN NaN 0.007578
2019-08-08 12:18:41.000 NaN 0.018965 NaN NaN
2019-08-08 12:18:44.000 1.0 NaN NaN NaN
2019-08-08 12:18:54.000 NaN NaN NaN 0.007577
2019-08-08 12:18:59.000 NaN NaN 0.028144 NaN
[1153 rows x 4 columns]
This is the data i get if i switch days to 180.
To get daily data, use the groupby function:
df = df.groupby(pd.Grouper(freq='D')).mean()
On a data frame of 5 days, this gives me:
bitcoin ethereum bitcoin-cash litecoin
2019-08-03 1.0 0.020525 0.031274 0.008765
2019-08-04 1.0 0.020395 0.031029 0.008583
2019-08-05 1.0 0.019792 0.029805 0.008360
2019-08-06 1.0 0.019511 0.029196 0.008082
2019-08-07 1.0 0.019319 0.028837 0.007854
2019-08-08 1.0 0.018949 0.028227 0.007593

Related

Iterating through pandas dataframe with if-statement

All,
I have a dataframe (df_live) with the following structure:
live live.updated live.latitude live.longitude live.altitude live.direction live.speed_horizontal live.speed_vertical live.is_ground
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
95 NaN NaN NaN NaN NaN NaN NaN NaN NaN
96 NaN 2022-10-11T17:46:19+00:00 -45.35 169.88 5791.2 44.0 518.560 0.0 False
97 NaN 2022-10-11T17:45:54+00:00 -27.55 143.20 11277.6 139.0 853.772 0.0 False
98 NaN NaN NaN NaN NaN NaN NaN NaN NaN
99 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I would like to iterate through this dataframe such that I only obtain rows for which numerical values are available (e.g. rows 96 and 97).
The code I am using is as follows:
import boto3
import json
from datetime import datetime
import calendar
import random
import time
import requests
import pandas as pd
aircraftdata = ''
params = {
'access_key': 'KEY',
'limit': '100',
'flight_status':'active'
}
url = "http://api.aviationstack.com/v1/flights"
api_result = requests.get('http://api.aviationstack.com/v1/flights', params)
api_statuscode = api_result.status_code
api_response = api_result.json()
df = pd.json_normalize(api_response["data"])
df_live = df[df.loc[:, df.columns.str.contains("live", case=False)].columns]
df_dep = df[df.loc[:, df.columns.str.contains("dep", case=False)].columns]
print(df_live)
for index, row in df_live.iterrows():
if df_live["live_updated"] != "NaN":
print (row)
else:
print ("Not live")
This yields the following error
KeyError: 'live_updated'

instead of iterating with the for loop, how about removing rows with all NaN in one go?
df_live = df_live[df_live.notnull().any(1)]
print(df_live)

Be careful with the column names. The key error
KeyError: 'live_updated'
means that there are no columns in the dataframe with the name of 'live_updated'.
If you check your dataframe columns, the actual name you probably want to refer to is 'live.updated', so just change the column name you are referring to on the code:
for index, row in df_live.iterrows():
if df_live["live.updated"] != "NaN":
print (row)
else:
print ("Not live")
Another solution could be to rename the dataframe columns before you refer to them:
df_live = df_live.rename(columns={'live.updated': 'live_updated'})

Forward fill column one year after last observation

I forward fill values in the following df using:
df = (df.resample('d') # ensure data is daily time series
.ffill()
.sort_index(ascending=True))
df before forward fill
id a b c d
datadate
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1 NaN 3 4
1980-05-31 NaN NaN NaN NaN
... ... ... ...
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20 33
However, I wish to only forward fill one year after (date is datetime) the last observation and then the remaining rows simply be NaN. I am not sure what is the best way to introduce this criteria in this task. Any help would be super!
Thanks

If I understand you correctly, you want to forward-fill the values on Dec 31, 2019 to the next year. Try this:
end_date = df.index.max()
new_end_date = end_date + pd.offsets.DateOffset(years=1)
new_index = df.index.append(pd.date_range(end_date, new_end_date, closed='right'))
df = df.reindex(new_index)
df.loc[end_date:, :] = df.loc[end_date:, :].ffill()
Result:
a b c d
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2.0 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1.0 NaN 3.0 4.0
1980-05-31 NaN NaN NaN NaN
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20.0 33.0
2020-01-01 NaN NaN 20.0 33.0
2020-01-02 NaN NaN 20.0 33.0
...
2020-12-31 NaN NaN 20.0 33.0

One solution is to forward fill using a limit parameter, but this wont handle the leap-year:
df.fillna(mehotd='ffill', limit=365)
The second solution is to define a more robust function to do the forward fill in the 1-year window:
from pandas.tseries.offsets import DateOffsets
def fun(serie_df):
serie = serie_df.copy()
indexes = serie[~serie.isnull()].index
for idx in indexes:
mask = (serie.index >= idx) & (serie.index < idx+DateOffset(years=1))
serie.loc[mask] = serie[mask].fillna(method='ffill')
return serie
df_filled = df.apply(fun, axis=0)
If a column has multiple non-nan values in the same 1-year window, then the first fill will stop once the most recent value is encounter. The second solution will treat the consecutive value as if they were independent.

How to filter dataframe on two columns and output cumulative sum

I am early beginner.
I have the following dataframe (df1) with transaction dates as index, columns = account #, quantity of transaction, and ticker.
Account Quantity Symbol/CUSIP
Trade Date
2020-03-31 1 NaN 990156937
2020-03-31 2 0.020 IIAXX
2020-03-24 1 NaN 990156937
2020-03-20 1 650.000 DOC
2020-03-23 1 NaN 990156937
... ... ... ...
2017-11-24 2 55.000 QQQ
2018-01-01 1 10.000 AMZN
2018-01-01 1 250.000 HOS
2017-09-13 1 229.051 VFINX
2017-09-21 1 1.118 VFINX
[266 rows x 3 columns]
I would like to populate a 2nd dataframe (df2) which shows the total quantity on every day between the min & max of the index of (df1), grouped by account and by ticker. Below is am empty dataframe of what I am looking to do:
df2 = Total Quantity by ticker and account #, on every single day between min and max of df1
990156937 IIAXX DOC AER NaN ATVI H VCSH GOOGL VOO VG \
2020-03-31 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-03-30 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-03-29 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Thus, for each day between the min of max of the transaction dates in df1 - I need to calculate the cumulative sum of all transaction of that date or earlier, grouped by account and ticker.
How could I accomplish this? Thanks in advance.

I suggest the following:
import pandas as pd
import numpy as np
# first I reproduce a similar dataframe
df = pd.DataFrame({"date": pd.date_range("2017-1-1", periods=3).repeat(6),
"account": [1, 1, 3, 1, 2, 3, 2,2, 1, 1, 2, 3, 1, 2, 3, 2,2,1],
"quantity": [123, 0.020, np.NaN, 650, 345, np.NaN, 345, 456, 121, 243, 445, 453, 987, np.NaN, 76, 143, 87, 19],
"symbol": ['990156937', '990156937', '990156937', 'DOC', 'AER', 'ATVI', 'AER', 'ATVI', 'IIAXX',
'990156937', '990156937', '990156937', 'DOC', 'AER', 'ATVI', 'AER', 'ATVI', 'IIAXX']})
This is what it looks like:
date account quantity symbol
0 2017-01-01 1 123.00 990156937
1 2017-01-01 1 0.02 990156937
2 2017-01-01 3 NaN 990156937
3 2017-01-01 1 650.00 DOC
4 2017-01-01 2 345.00 AER
You want to go to a wide format using unstack:
# You groupby date, account and symbol and sum the quantities
df = df.groupby(["date", "account", "symbol"]).agg({"quantity":"sum"})
df_wide = df.unstack()
# Finally groupby account to get the cumulative sum per account across dates
# Fill na with 0 to get cumulative sum right
df_wide = df_wide.fillna(0)
df_wide = df_wide.groupby(df_wide.index.get_level_values("account")).cumsum()
You get the result:
quantity
990156937 AER ATVI DOC IIAXX
date account
2017-01-01 1 123.02 0.0 0.0 650.0 0.0
2 0.00 345.0 0.0 0.0 0.0
3 0.00 0.0 0.0 0.0 0.0
2017-01-02 1 366.02 0.0 0.0 650.0 121.0
2 445.00 690.0 456.0 0.0 0.0

Datetime not showing in specified format

I have the following code:
import fxcmpy
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
end = datetime.datetime.today()
today = date.today()
data = con.get_candles(ticker, period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%B-%d')
data = data.set_index(data.index.normalize())
data = data.reindex(full_dates)
When i print data
i get this:
bidopen bidclose bidhigh bidlow askopen askclose askhigh asklow tickqty
2008-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2008-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2008-01-03 13261.82 13043.96 13279.54 12991.37 13261.82 13043.96 13279.54 12991.37 0.0
2008-01-04 13044.12 13056.72 13137.93 13023.56 13044.12 13056.72 13137.93 13023.56 0.0
2008-01-05 13046.56 12800.18 13046.72 12789.04 13046.56 12800.18 13046.72 12789.04 0.0
... ... ... ... ... ... ... ... ... ...
2019-12-19 28272.45 28401.75 28414.05 28245.65 28277.00 28405.45 28418.65 28248.35 378239.0
2019-12-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-21 28401.60 28472.20 28518.80 28369.90 28405.30 28474.30 28520.30 28371.30 513987.0
2019-12-22 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-12-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4375 rows × 9 columns
My question is that since the format i had used was format ='%Y-%B-%d' for the date why is it not showing in that format?

The format you were using in 'data.index = pd.to_datetime(data.index, format ='%Y-%B-%d')' was used to interpret the data in index as datetime. To display the output you will need something like data.index.dt.strftime('%Y-%B%-%d').

set all matching elements in a multiindex dataframe to a series

Previously I have been using a pandas.Panel to store multiple dataframes, one per date in a list of dates.
Since the deprecation of panels, I am trying to convert to using a multindex dataframe.
As an example, I have the following data:
dates = pandas.date_range('20180101', periods=3)
stocks = ['AAPL', 'GOOG', 'MSFT', 'AMZN', 'FB']
Before the deprecation, I could create a panel as follows:
pnl = pandas.Panel(items=dates, major_axis=stocks, minor_axis=stocks, dtype=float)
I now have 1 dataframe per date, for example, selecting the first:
pnl['2018-01-01']
returns a dataframe as follows:
Now, however, as per the advice in the depracation warning, I am creating a multiindex dataframe:
tuples = list(itertools.product(dates, stocks))
index = pandas.MultiIndex.from_tuples(tuples, names=['date', 'stock'])
df = pandas.DataFrame(index=index, columns=stocks, dtype=float)
The resulting dataframe now looks like this:
So far so good...
Populating the dataframe:
I have a pandas.Series of data for a given stock pair, with one entry per date.
For example:
data = pandas.Series([1.3, 7.4, 8.2], index=dates)
The series looks like this:
2018-01-01 1.3
2018-01-02 7.4
2018-01-03 8.2
Freq: D, dtype: float64
Say, for example, this data is for stock pair ['GOOG','MSFT'].
I would like to set all ['GOOG','MSFT'] entries.
With my panel, I could very easily do this using the following terse syntax:
pnl.loc[:,'GOOG','MSFT'] = data
What is the easiest way to select all ['GOOG','MSFT'] elements from my multiindex dataframe, and set them to my pandas.Series object (ie: date for date)?

Using pd.DataFrame.loc & pd.IndexSlice:
df.loc[pd.IndexSlice[data.index, 'GOOG'], 'MSFT'] = data.values
If you have many pairs of data, put them in a dictionary like this:
pairs = {('GOOG', 'MSFT'): data}
Then iterate through the pairs, setting the value using loc & pd.IndexSlice.
for k, v in pairs.items():
df.loc[pd.IndexSlice[v.index, k[0]], k[1]] = v.values
As an alternative to IndexSlice, you can set up a boolean index on the multiindex using the index method get_level_value
df.loc[ (df.index.get_level_values(1) == 'GOOG') &
(df.index.get_level_values(0).isin(data.index))
, 'MSFT'] = data.values
All of the above would produce the following output :
AAPL GOOG MSFT AMZN FB
date stock
2018-01-01 AAPL NaN NaN NaN NaN NaN
GOOG NaN NaN 1.3 NaN NaN
MSFT NaN NaN NaN NaN NaN
AMZN NaN NaN NaN NaN NaN
FB NaN NaN NaN NaN NaN
2018-01-02 AAPL NaN NaN NaN NaN NaN
GOOG NaN NaN 7.4 NaN NaN
MSFT NaN NaN NaN NaN NaN
AMZN NaN NaN NaN NaN NaN
FB NaN NaN NaN NaN NaN
2018-01-03 AAPL NaN NaN NaN NaN NaN
GOOG NaN NaN 8.2 NaN NaN
MSFT NaN NaN NaN NaN NaN
AMZN NaN NaN NaN NaN NaN
FB NaN NaN NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.