The purpose of this code is to:
create a dummy data set.
Then turn it into a data frame
Calculate the peaks and make it a column in the data frame
Calculate the troughs and make it a column in the data frame
Filling the “nan” values with “hold”
Replace all the float values with the word “buy”
The problem is with last step is that it is never worked, but there is no error, it is just print the dataframe just like before this couple of lines.
Here is the code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import argrelextrema
list1 = np.random.randint(0,30,(25,2))
df = pd.DataFrame(list1, columns=['a', 'b'])
df['minimum']= df.b[(df.b.shift(1) > df.b) & (df.b.shift(-1) > df.b)]
df['maximum'] = df.b[(df.b.shift(1) < df.b) & (df.b.shift(-1) < df.b)]
plt.scatter(df.index, df['minimum'], c='g')
plt.scatter(df.index, df['maximum'], c='r')
df.b.plot(figsize=(15,5))
df['minimum'].fillna('hold', inplace = True)
for x in df['minimum']:
if type(x) =='float':
df['minimum'].replace(x, 'buy', inplace = True)
print('df')
Use np.where to classify it
df['minimum'] = (np.where(df['minimum'].isnull(), 'hold', 'buy'))
np.where is a good idea
You can also do
df.loc[~df['minimum'].isna(),'minimum']='Buy'
df.loc[df['minimum'].isna(),'minimum']='Hold'
Related
Let's consider dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([np.random.randn(1000)]).transpose()
I want to apply percentage change transformations and add it to df. I want to apply 1,...,10 percentage changes. My primitive solution is:
df_copy = df.copy()
for i in range(1, 11):
to_add = df_copy.pct_change(i)
df = pd.concat([df, to_add], axis = 1)
However, I'm not sure if its the most efficient way how it can be done. Do you know if there is any option to do it better?
I am using the Housing train.csv data from Kaggle to run a prediction.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
I am trying to generate a correlation and only keep the features that correlate with SalePrice from 0.5 to 0.9. I tried to use this function to fileter some of it, but I am removing the correlation values that are above .9 only.
How would I update this function to only keep those specific features that I need to generate a correlation heat map?
data = train
corr = data.corr()
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = data.columns[columns]
data = data[selected_columns]
import pandas as pd
data = pd.read_csv('train.csv')
col = data.columns
c = [i for i in col if data[i].dtypes=='int64' or data[i].dtypes=='float64'] # dropping columns as dtype == object
main_col = ['SalePrice'] # column with which we have to compare correlation
corr_saleprice = data.corr().filter(main_col).drop(main_col)
c1 =(corr_saleprice['SalePrice']>=0.5) & (corr_saleprice['SalePrice']<=0.9)
c2 =(corr_saleprice['SalePrice']>=-0.9) & (corr_saleprice['SalePrice']<=-0.5)
req_index= list(corr_saleprice[c1 | c2].index) # selecting column with given criteria
#req_index.append('SalePrice') #if you want SalePrice column in your final dataframe too , uncomment this line
data = data[req_index]
data
Also using for loops is not so efficient, a direct implementation is favorable. I hope this is what you want!
For generating heatmap , you can use following code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
a =data.corr()
mask = np.triu(np.ones_like(a, dtype=np.bool))
plt.figure(figsize=(10,10))
_ = sns.heatmap(a,cmap=sns.diverging_palette(250, 20, n=250),square=True,mask=mask,annot=True,center=0.5)
I have a dataframe with 3 columns.
UserId | ItemId | Rating
(where Rating is the rating a User gave to an Item. It's a np.float16. The 2 Id's are np.int32)
How do you best compute correlations between items using python pandas?
My take is to first pivot the table (wide format) and then apply pd.corr
df = df.pivot(index='UserId', columns='ItemId', values='Rating')
df.corr()
It's working on small datasets, but not on big ones.
That first step creates a big matrix dataset mostly full of missing values. It's quite ram intensive and I can't run it with bigger dataframes.
Isn't there a simpler way to compute the correlations directly on the long dataset, without pivoting?
(I looked into pd.groupBy, but that seems to only split the dataframe, not what I'm looking for.)
EDIT: oversimplified data and working pivot code
import pandas as pd
import numpy as np
d = {'UserId': [1,2,3, 1,2,3, 1,2,3],
'ItemId': [1,1,1, 2,2,2, 3,3,3],
'Rating': [1.1,4.5,7.1, 5.5,3.1,5.5, 1.1,np.nan,2.2]}
df = pd.DataFrame(data=d)
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
print(df.info())
pivot = df.pivot(index='UserId', columns='ItemId', values='Rating')
print('')
print(pivot)
corr = pivot.corr()
print('')
print(corr)
EDIT2: Large random data generator
def randDf(size = 100):
## MAKE RANDOM DATAFRAME, df =======================
import numpy as np
import pandas as pd
import random
import math
dict_for_df = {}
for i in ('UserId','ItemId','Rating'):
dict_for_df[i] = {}
for j in range(size):
if i=='Rating': val = round( random.random()*5, 1)
else: val = round( random.random() * math.sqrt(size/2) )
dict_for_df[i][j] = val # store in a dict
# print(dict_for_df)
df = pd.DataFrame(dict_for_df) # after the loop convert the dict to a dataframe
# print(df.head())
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
# df = df.astype(dtype={'UserId': np.int64, 'ItemId': np.int64, 'Rating': np.float64})
## remove doubles -----
df.drop_duplicates(subset=['UserId','ItemId'], keep='first', inplace=True)
## show -----
print(df.info())
print(df.head())
return df
# =======================
df = randDf()
I had another go, and have something that gets exactly the same correlation numbers as your method without using pivot, but is much slower. I can't say whether it uses less or more memory:
from scipy.stats.stats import pearsonr
import itertools
import pandas as pd
import numpy as np
d = []
itemids = list(set(df['ItemId']))
pairsofitems = list(itertools.combinations(itemids,2))
for itempair in pairsofitems:
a = df[df['ItemId'] == itempair[0]][['Rating', 'UserId']]
b = df[df['ItemId'] == itempair[1]][['Rating', 'UserId']]
z = np.ones(len(set(df.UserId)), dtype=int)
z = z * np.nan
z[a.UserId.values] = a.Rating.values
w = np.ones(len(set(df.UserId)), dtype=int)
w = w * np.nan
w[b.UserId.values] = b.Rating.values
bad = ~np.logical_or(np.isnan(w), np.isnan(z))
z = np.compress(bad, z)
w = np.compress(bad, w)
d.append({'firstitem': itempair[0],
'seconditem': itempair[1],
'correlation': pearsonr(z,w)[0]})
df_out = pd.DataFrame(d, columns=['firstitem', 'seconditem', 'correlation'])
This was of help working out to handle the nan's before taking the correlation.
The slicing in the two lines after the for loop take time. I think though, it may have potential if the bottlenecks could be fixed.
Yes, there is some repetition in there with the z and w variables, could put that in a function.
Some explanation of what it does:
find all combinations of pairs within your items
organise and "x" and "y" set of points for UserId / Rating where any point pair where one of the two is missing (nan) is dropped. I think of a scatter plot and the correlation being how well a straight line fits through it.
run pearson correlation on this x-y pair
put the ItemId each pair and correlation into a dataframe
I have 2 timeseries dataframes where xp are say the x coordinates of the data fp.
I want to interpolate the values from xp/fp combinations per date for a fixed set of x values. So resulting output being a timeseries dataframe with same datetime index as the xp and fp and no of columns = no of elements in x
I have tried to use numpy.interp() but end up with ValueError: object too deep for desired array
import pandas as pd
import numpy as np
fp = pd.DataFrame(
data=np.random.randint(0,100,size=(10, 4)),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
xp = pd.DataFrame(
data=np.column_stack([
list(range(1,11)),
list(range(70,80)),
list(range(150,160)),
list(range(220,230))
]),
index=pd.date_range("20180101", periods=10),
columns=list('ABCD'),
)
x = [60, 120, 180]
x_interp = np.interp(x,xp,fp)
It seems like np.interp cant take dataframes as input? but it sounds like this is the fastest way for me to do it for a large dataset (of >3000 xp and fp rows)
Would appreciate any pointers.
UPDATE
found a way of doing what I wanted as below
x_interp = pd.DataFrame.from_records(fp.index.to_series().apply(lambda z: np.interp(x, xp.loc[(z)], fp.loc[(z)])).values, index = fp.index)
I have hdf5 and I have moved to DataFrame, but problem is when I want to plot, nothing shows on the graph. And I have checked new dataframe, but I saw, there was nothing.
This is my DF (
I don't allowed to post pics, so please click to the link )
df1 = pd.DataFrame(df.Price, index = df.Timestamp)
plt.figure()
df1.plot()
plt.show()
Second DF shows NAN in price column. Whats wrong?
I think you need set_index from column Timestamp, select column Price and plot:
#convert column to floats
df['Price'] = df['Price'].astype(float)
df.set_index('Timestamp')['Price'].plot()
#if some non numeric data, convert them to NaNs
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
df.set_index('Timestamp')['Price'].plot()
And get NaNs if use DataFrame constructor, because data not aligned - values of index of df are not same as Timestamp column.
You can do this by adding .values, And how about creating a series instead?
#df1 = pd.DataFrame(df.Price.values, df.Timestamp)
serie = pd.Series(df.Price.values, df.Timestamp)
Saw it was answered here: pandas.Series() Creation using DataFrame Columns returns NaN Data entries
Full example:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=["Price","Timestamp","Random"])
df.Price = np.random.randint(100, size = 10)
df.Timestamp = [datetime.datetime(2000,1,1) + \
datetime.timedelta(days=int(i)) for i in np.random.randint(100, size = 10)]
df.Random = np.random.randint(10, size= 10)
serie = pd.Series(df.Price.values, df.Timestamp)
serie.plot()
plt.show()
Difference
print("{}\n{}".format(type(df.Price), type(df.Price.values)))
<class 'pandas.core.series.Series'> # does not work
<class 'numpy.ndarray'> # works