I am given a dataset called stocks_df. Each column has stock prices for different stocks in each day. I am trying to normalize it and return it as a matrix. So, each column will have normalized for a stock for each day.
Wrote up this function-
def normalized_prices(stocks_df):
normalized=np.zeros((stocks_df.shape[0],len(stocks_df.columns[1:])))
for i in range(1,len(stocks_df.columns[1:])+1):
for j in range(0,stocks_df.shape[0]+1):
normalized[i,j]=((stocks_df[i][j]/stocks_df[0][i]))
return normalized
And then tried to call the function-
normalized_prices(stocks_df)
But I'm getting this error-
What can be done to fix this?
From your code, it looks you want to divide everything by the first column, so you can simply do:
import numpy as np
import pandas as pd
np.random.seed(123)
stocks_df = pd.DataFrame(np.random.uniform(0,1,(20,10)))
stocks_df.div(stocks_df[0],axis=0)
0 1 2 3 4 5 6 7 8 9
0 1.0 0.410843 0.325716 0.791585 1.033023 0.607502 1.408195 0.983288 0.690529 0.563008
1 1.0 2.124407 1.277973 0.173898 1.159877 2.150474 0.531770 0.511256 1.548909 1.549713
2 1.0 1.338951 1.141952 0.963150 1.138780 0.509077 0.570284 0.359809 0.462979 0.994601
3 1.0 4.708772 4.677955 5.360028 4.623317 3.390277 4.628973 9.699688 10.250916 5.448532
4 1.0 0.185300 0.508509 0.664836 1.388421 0.401401 0.774152 1.579542 0.832571 0.982277
This gives you every column divided by the first. Now you just need to subset this output:
stocks_df.div(stocks_df[0],axis=0).iloc[:,1:]
Related
how can I calculate the Average true range in a data frame
I have tried to using np where() and is not working
I have all this value below
Current High - Current Low
abs(Current High - Previous Close)
abs(Current Low - Previous Close)
but I don't know how I to set the highest between the three value to the pandas data frame
It looks like you might be trying to do the following :
import pandas as pd
from numpy.random import rand
df = pd.DataFrame(rand(10,5),columns={'High-Low','High-close','Low-close','A','B'})
cols = ['High-Low','High-close','Low-close']
df['true_range'] = df[cols].max(axis=1)
print(df)
The output will look like
High-Low Low-close B A High-close true_range
0 0.916121 0.026572 0.082619 0.672000 0.605287 0.916121
1 0.622589 0.944646 0.638486 0.905139 0.262275 0.944646
2 0.611374 0.756191 0.829803 0.828205 0.614956 0.756191
3 0.810638 0.501693 0.504800 0.069532 0.283825 0.810638
4 0.984463 0.900823 0.434061 0.905273 0.518056 0.984463
5 0.377742 0.480266 0.018676 0.383831 0.819448 0.819448
6 0.473753 0.652077 0.730400 0.305507 0.396969 0.652077
7 0.427047 0.733135 0.526076 0.542852 0.719194 0.733135
8 0.911629 0.633997 0.101848 0.020811 0.327233 0.911629
9 0.244624 0.893365 0.278941 0.354696 0.678280 0.893365
If this isn't what you had in mind, it would be helpful to clarify your question by providing a small example where you clearly identify the columns and the index in your DataFrame and what you mean by "true range".
I have a data frame with the column "Key" as index like below:
Key
Prediction
C11D0 0
C11D1 8
C12D0 1
C12D1 5
C13D0 3
C13D1 9
C14D0 4
C14D1 9
C15D0 5
C15D1 3
C1D0 5
C2D0 7
C3D0 4
C4D0 1
C4D1 9
I want to add the values of two cells in Prediction column when their "index = something". The logic is I want to add the values whose index matches for upto 4 letters. Example: indexes having "C11D0 & C11D1" or having "C14D0 & C14D1" ? Then the output will be:
Operation
Addition Result
C11D0+C11D1 8
C12D0+C12D1 6
C13D0+C13D1 12
you can use isin function.
Example:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':[1,2,1,3,7,1]})
df[df.id.isin([1,5,6])].value.sum()
output:
9
for your case
idx = ['C11D0', 'C11D1']
print(df[df.Key.isin(idx)].Prediction.sum()) #outputs 8
First set key as a column if it is the index:
df.reset_index(inplace=True)
Then you can use DataFrame.loc with boolean indexing:
df.loc[df['key'].isin(["C11D0","C11D1"]),'Prediciton'].sum()
You can also create a function for it:
def sum_select_df(key_list,df):
return pd.concat([df[df['Key'].isin(['C'+str(key)+'D1','C'+str(key)+'D0'])] for key in key_list])['Prediction'].sum()
sum_select_df([11,14],df)
Output:
21
Here is a complete solution, slightly different from the other answers so far. I tried to make it pretty self-explanatory, but let me know if you have any questions!
import numpy as np # only used to generate test data
import pandas as pd
import itertools as itt
start_inds = ["C11D0", "C11D1", "C12D0", "C12D1", "C13D0", "C13D1", "C14D0", "C14D1",
"C15D0", "C15D1", "C1D0", "C2D0", "C3D0", "C4D0", "C4D1"]
test_vals = np.random.randint(low=0, high=10, size=len(start_inds))
df = pd.DataFrame(data=test_vals, index=start_inds, columns=["prediction"])
ind_combs = itt.combinations(df.index.array, 2)
sum_records = ((f"{ind1}+{ind2}", df.loc[[ind1, ind2], "prediction"].sum())
for (ind1, ind2) in ind_combs if ind1[:4] == ind2[:4])
res_ind, res_vals = zip(*sum_records)
res_df = pd.DataFrame(data=res_vals, index=res_ind, columns=["sum_result"])
I'm trying to compare the differences and similarities between 10 dataframes. I have decided to df.describe() each dataframe in turn and accumulate the results into a new dataframe.
count mean std min 25% 50% 75% max
run
0 38 11.9394 3.99795 2.66622 9.00963 13.6531 14.6516 18.2803
1 75 13.7902 2.69114 8.06895 13.5017 14.3492 15.4146 17.4614
2 17 13.9666 1.12535 11.1525 13.7025 14.1217 14.6637 15.6118
3 21 13.2841 2.81016 6.25177 13.198 14.0382 15.1457 16.2141
4 29 11.5376 3.35056 6.70377 8.43451 12.8287 14.7004 16.155
5 11 12.5245 3.0237 6.01391 11.0818 13.6772 14.6237 15.527
6 32 13.7039 2.36393 6.95464 13.6765 14.1967 14.8114 17.3966
7 11 13.9055 2.03886 10.5235 12.6321 13.9394 14.5784 18.0726
8 19 13.2579 1.80329 9.00478 13.0772 13.8909 14.1755 15.0772
9 28 13.2817 3.61778 5.64462 9.90116 14.6581 15.6785 18.7766
I thought from this point it would be trivial to do a barplot where each bar was a different variable (the columns) and they where hue'd according to which dataframe the variable was from(the rows).
However I can't work out how to split up the columns.
sns.barplot(data = describedWidth)
outputs the following graph
https://i.stack.imgur.com/9XITl.png
Thanks in advance
What about:
df[df.columns].plot(kind = 'bar')
This, by default, should print all the columns with different legends. You can later customize according to your requirements I suppose.
You can also do this for your dataframe summary:
descDf = df.describe()
descDf[descDf.columns].plot(kind = 'bar')
Sample output:
PS: Apologies for the clumsy output image but you get the point I hope.
I have a dataframe with one column:revenue_sum
revenue_sum
10000.0
12324.0
15534.0
26435.0
45623.0
56736.0
56353.0
And I want to write a function that creates all new columns at once that shows the sum of revenues.
For example, first row in the 'revenue_1'should show the sum of first two float in revenue_sum;
second row in the 'revenue_1'should show the sum of 2nd and 3rd float in revenue_sum.
First row in the 'revenue_2' should show the sum of first 3 float in revenue_sum
revenue_sum revenue_1 revenue_2
10000.0 22324.0 47858.0
12324.0 27858.0 54293.0
15534.0 41969.0 87592.0
26435.0 72058.0 128794.0
45623.0 102359.0 158712.0
56736.0 113089.0 NaN
56353.0 NaN NaN
Here is my code:
'''python
df_revenue_sum1 = df_revenue_sum1.iloc[::-1]
len_sum1 = len(df_revenue_sum1)+1
def func(df_revenue_sum1):
for i in range(1,len_sum1):
df_revenue_sum1['revenue_'+'i']=
df_revenue_sum1['revenue_sum'].rolling(i+1).sum()
return df_revenue_sum1
df_revenue_sum1 = df_revenue_sum1.applymap(func)
'''
And it shows the error:
"'float' object is not subscriptable", 'occurred at index revenue_sum'
I think there might be an easier way to do this without a for loop. The pandas function rolling (http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html) might do what you need. It sums along a sliding window specified by the min_periods and window parameters. Min periods means how many values it should sum at least. Window means it will sum at most that many values. Applying this works as follows:
import pandas as pd
# The dataframe provided
d = {
'revenue_sum': [
10000.0,
12324.0,
15534.0,
26435.0,
45623.0,
56736.0,
56353.0
]
}
# Reverse the dataframe because rolling only looks backwards and
# we want to make a rolling window forward
d1 = pd.DataFrame(data=d)
df = d1[::-1]
# apply rolling summing 2 at a time
df['revenue_1'] = df['revenue_sum'].rolling(min_periods=2, window=2).sum()
# apply rolling window 3 at a time
df['revenue_2'] = df['revenue_sum'].rolling(min_periods=3, window=3).sum()
print(df[::-1])
This gave me the following dataframe:
revenue_sum revenue_1 revenue_2
0 10000.0 22324.0 37858.0
1 12324.0 27858.0 54293.0
2 15534.0 41969.0 87592.0
3 26435.0 72058.0 128794.0
4 45623.0 102359.0 158712.0
5 56736.0 113089.0 NaN
6 56353.0 NaN NaN
I'm trying to create a for-loop that automatically runs through my parsed list of NASDAQ stocks, and inserts their Quandl codes to then be retrieved from Quandl's database. essentially creating a large data set of stocks to perform data analysis on. My code "appears" right, but when I print the query it only prints 'GOOG/NASDAQ_Ticker' and nothing else. Any help and/or suggestions will be most appreciated.
import quandl
import pandas as pd
import matplotlib.pyplot as plt
import numpy
def nasdaq():
nasdaq_list = pd.read_csv('C:\Users\NAME\Documents\DATASETS\NASDAQ.csv')
nasdaq_list = nasdaq_list[[0]]
print nasdaq_list
for abbv in nasdaq_list:
query = 'GOOG/NASDAQ_' + str(abbv)
print query
df = quandl.get(query, authtoken="authoken")
print df.tail()[['Close', 'Volume']]
Iterating over a pd.DataFrame as you have done iterates by column. For example,
>>> df = pd.DataFrame(np.arange(9).reshape((3,3)))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
>>> for i in df[[0]]: print(i)
0
I would just get the first column as a Series with .ix,
>>> for i in df.ix[:,0]: print(i)
0
3
6
Note that in general if you want to iterate by row over a DataFrame you're looking for iterrows().