DataFrame Pandas shows NAN

DataFrame Pandas shows NAN - python

I have hdf5 and I have moved to DataFrame, but problem is when I want to plot, nothing shows on the graph. And I have checked new dataframe, but I saw, there was nothing.
This is my DF (
I don't allowed to post pics, so please click to the link )
df1 = pd.DataFrame(df.Price, index = df.Timestamp)
plt.figure()
df1.plot()
plt.show()
Second DF shows NAN in price column. Whats wrong?

I think you need set_index from column Timestamp, select column Price and plot:
#convert column to floats
df['Price'] = df['Price'].astype(float)
df.set_index('Timestamp')['Price'].plot()
#if some non numeric data, convert them to NaNs
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
df.set_index('Timestamp')['Price'].plot()
And get NaNs if use DataFrame constructor, because data not aligned - values of index of df are not same as Timestamp column.

You can do this by adding .values, And how about creating a series instead?
#df1 = pd.DataFrame(df.Price.values, df.Timestamp)
serie = pd.Series(df.Price.values, df.Timestamp)
Saw it was answered here: pandas.Series() Creation using DataFrame Columns returns NaN Data entries
Full example:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=["Price","Timestamp","Random"])
df.Price = np.random.randint(100, size = 10)
df.Timestamp = [datetime.datetime(2000,1,1) + \
datetime.timedelta(days=int(i)) for i in np.random.randint(100, size = 10)]
df.Random = np.random.randint(10, size= 10)
serie = pd.Series(df.Price.values, df.Timestamp)
serie.plot()
plt.show()
Difference
print("{}\n{}".format(type(df.Price), type(df.Price.values)))
<class 'pandas.core.series.Series'> # does not work
<class 'numpy.ndarray'> # works

Related

plt.scatter plot turns out blank

the code below returns a blank plot in Python:
# import libraries
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
os.chdir('file path')
# import data files
activity = pd.read_csv('file path\dailyActivity_merged.csv')
intensity = pd.read_csv('file path\hourlyIntensities_merged.csv')
steps = pd.read_csv('file path\hourlySteps_merged.csv')
sleep = pd.read_csv('file path\sleepDay_merged.csv')
# ActivityDate in activity df only includes dates (no time). Rename it Dates
activity = activity.rename(columns={'ActivityDate': 'Dates'})
# ActivityHour in intensity df and steps df includes date-time. Split date-time column into dates and times in intensity. Drop the date-time column
intensity['Dates'] = pd.to_datetime(intensity['ActivityHour']).dt.date
intensity['Times'] = pd.to_datetime(intensity['ActivityHour']).dt.time
intensity = intensity.drop(columns=['ActivityHour'])
# split date-time column into dates and times in steps. Drop the date-time column
steps['Dates'] = pd.to_datetime(steps['ActivityHour']).dt.date
steps['Times'] = pd.to_datetime(steps['ActivityHour']).dt.time
steps = steps.drop(columns=['ActivityHour'])
# split date-time column into dates and times in sleep. Drop the date-time column
sleep['Dates'] = pd.to_datetime(sleep['SleepDate']).dt.date
sleep['Times'] = pd.to_datetime(sleep['SleepDate']).dt.time
sleep = sleep.drop(columns=['SleepDate', 'TotalSleepRecords'])
# add a column & calculate time_awake_in_bed before falling asleep
sleep['time_awake_in_bed'] = sleep['TotalTimeInBed'] - sleep['TotalMinutesAsleep']
# merge activity and sleep
list = ['Id', 'Dates']
activity_sleep = sleep.merge(activity,
on = list,
how = 'outer')
# plot relation between calories used daily vs how long it takes users to fall asleep
plt.scatter(activity_sleep['time_awake_in_bed'], activity_sleep['Calories'], s=20, c='b', marker='o')
plt.axis([0, 200, 0, 5000])
plt.show()
NOTE: max(Calories) = 4900 and min(Calories) =0. max(time_awake_in_bed) = 0 and min(time_awake_in_bed) = 150
Please let me know how I can get a scatter plot out of this. Thank you in advance for any help.
The same variables from the same data-frame work perfectly with geom_point() in R.

I found where the problem was. As #Redox and #cheersmate mentioned in comments, the data-frame that I created by merging included NaN values. I fixed this by merging them only on 'Id'. Then I could create a scatter plot:
list = ['Id']
activity_sleep = sleep.merge(activity,
on = list,
how = 'outer')
The column "Dates" is not a good one to merge on, as in each data frame the same dates are repeated in multiple rows. Also I noticed that I get the same plot whether I outer or inner merge. Thank you.

python pandas replace all the float values

The purpose of this code is to:
create a dummy data set.
Then turn it into a data frame
Calculate the peaks and make it a column in the data frame
Calculate the troughs and make it a column in the data frame
Filling the “nan” values with “hold”
Replace all the float values with the word “buy”
The problem is with last step is that it is never worked, but there is no error, it is just print the dataframe just like before this couple of lines.
Here is the code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema
list1 = np.random.randint(0,30,(25,2))
df = pd.DataFrame(list1, columns=['a', 'b'])
df['minimum']= df.b[(df.b.shift(1) > df.b) & (df.b.shift(-1) > df.b)]
df['maximum'] = df.b[(df.b.shift(1) < df.b) & (df.b.shift(-1) < df.b)]
plt.scatter(df.index, df['minimum'], c='g')
plt.scatter(df.index, df['maximum'], c='r')
df.b.plot(figsize=(15,5))
df['minimum'].fillna('hold', inplace = True)
for x in df['minimum']:
if type(x) =='float':
df['minimum'].replace(x, 'buy', inplace = True)
print('df')

Use np.where to classify it
df['minimum'] = (np.where(df['minimum'].isnull(), 'hold', 'buy'))

np.where is a good idea
You can also do
df.loc[~df['minimum'].isna(),'minimum']='Buy'
df.loc[df['minimum'].isna(),'minimum']='Hold'

How to convert a OHLCV named data array into a numpy dataframe?

My data consist of a particular OHLCV object that is a bit weird in that it can only be accessed by the name, like this:
# rA = [<MtApi.MqlRates object at 0x000000A37A32B308>,...]
type(rA)
# <class 'list'>
ccnt = len(rA) # 100
for i in range(ccnt):
print('{} {} {} {} {} {} {}'.format(i, rA[i].MtTime, rA[i].Open, rA[i].High, rA[i].Low, rA[i].Close, rA[i].TickVolume))
#0 1607507400 0.90654 0.90656 0.90654 0.90656 7
#1 1607507340 0.90654 0.9066 0.90653 0.90653 20
#2 1607507280 0.90665 0.90665 0.90643 0.90653 37
#3 1607507220 0.90679 0.90679 0.90666 0.90666 22
#4 1607507160 0.90699 0.90699 0.90678 0.90678 29
with some additional formatting I have:
Time Open High Low Close Volume
-----------------------------------------------------------------
1607507400 0.90654 0.90656 0.90654 0.90656 7
1607507340 0.90654 0.90660 0.90653 0.90653 20
1607507280 0.90665 0.90665 0.90643 0.90653 37
1607507220 0.90679 0.90679 0.90666 0.90666 22
I have tried things like this:
df = pd.DataFrame(data = rA, index = range(100), columns = ['MtTime', 'Open', 'High','Low', 'Close', 'TickVolume'])
# Resulting in:
# TypeError: iteration over non-sequence
How can I convert this thing into a Panda DataFrame,
so that I can plot this using the original names?
Plotting using matplotlib should then be possible with something like this:
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
...
df = pd.DataFrame(rA) # not working
df['time'] = pd.to_datetime(df['MtTime'], unit='s')
plt.plot(df['MtTime'], df['Open'], 'r-', label='Open')
plt.plot(df['MtTime'], df['Close'], 'b-', label='Close')
plt.legend(loc='upper left')
plt.title('EURAUD candles')
plt.show()
Possibly related questions (but were not helpful to me):
Numpy / Matplotlib - Transform tick data into OHLCV
OHLC aggregator doesn't work with dataframe on pandas?
How to convert a pandas dataframe into a numpy array with the column names
Converting Numpy Structured Array to Pandas Dataframes
Pandas OHLC aggregation on OHLC data
Getting Open, High, Low, Close for 5 min stock data python
Converting OHLC stock data into a different timeframe with python and pandas

One idea is use list comprehension for extract values to list of tuples:
L = [(rA[i].MtTime, rA[i].Open, rA[i].High, rA[i].Low, rA[i].Close, rA[i].TickVolume)
for i in range(len(rA))]
df = pd.DataFrame(L, columns = ['MtTime', 'Open', 'High','Low', 'Close', 'TickVolume']))
Or if possible:
df = pd.DataFrame({'MtTime':list(rA.MtTime), 'Open':list(rA.Open),
'High':list(rA.High),'Low':list(rA.Low),
'Close':list(rA.Close), 'TickVolume':list(rA.TickVolume)})

Python: Why would numpy.corrcoef() return NaN values?

Why would numpy.corrcoef() return NaN values?
I am working with high dimensional data and it is infeasible to go through every datum to test values.
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Standardise
X_std = StandardScaler().fit_transform(df.values)
print(X_std.dtype) # Returns "float64"
# Correlation
cor_mat1 = np.corrcoef(X_std.T)
cor_mat1.max() # Returns nan
And then
cor_mat1.max()
Returns
nan
When computing cor_mat1 = np.corrcoef(X_std.T) I get this warning:
/Users/kimrants/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3183:
RuntimeWarning:
invalid value encountered in true_divide
This is a snippet of the input dataframe:
To try and fix it myself, I started removing all zero-columns and columns that contained any NaN values. I thought this would solve the problem, but it didn't. Am I missing something? I don't see why else it would return NaN values?
My end goal is to compute eigen-values and -vectors.

If you have a column where all rows have the same value, that column's variance is 0. np.corrcoef() thus numpy-divides that column's correlation coefficients by 0, which doesn't throw an error but only the warning invalid value encountered in true_divide with standard numpy settings. Those column's correlation coefficients get replaced by 'nan' instead:
import numpy as np
print(np.divide(0,0))
C:\Users\anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in true_divide
"""Entry point for launching an IPython kernel.
nan
Removing all columns with Series.nunique() == 1 should solve your problem.

This solved the problem for reasons I cannot explain:
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Keep track of index / columns to reproduce dataframe
cols = df.columns
index = df.index
# Standardise
X_std = StandardScaler().fit_transform(df.values)
X_std = StandardScaler().fit_transform(X_std)
print(X_std.dtype) # Return "float64"
# Turn to dataFrame again to drop values easier
df = pd.DataFrame(data=X_std, columns= cols, index=index)
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
The standardising two times in a row works, but is weird.

Quantile values for each column in dataframe

I have a dataframe which consists of columns of numbers. I am trying to calc the decile rank values for each column. The following code gives me the values for the dataframe as a whole. How can I do it by column?
pd.qcut(df, 10, labels=False)
Thanks.

If you apply qcut across the columns you will get a dataframe where each entry is the rank value.
import numpy as np
import pandas as pd
data_a = np.random.random(100)
data_b = 100*np.random.random(100)
df = pd.DataFrame(columns=['A','B'], data=list(zip(data_a, data_b)))
rank = df.apply(pd.qcut, axis=0, q=10, labels=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame Pandas shows NAN - python

Related

plt.scatter plot turns out blank

python pandas replace all the float values

How to convert a OHLCV named data array into a numpy dataframe?

Python: Why would numpy.corrcoef() return NaN values?

Quantile values for each column in dataframe

Categories

Resources