I would like to change my index from how it is seen in image 1 to image 2.
Here is the code to get Image 1:
stocks = pd.DataFrame()
Tickers = ['AAPL', 'TSLA', 'IBM', 'MSFT']
for tick in Tickers:
df = web.DataReader(tick, "av-daily", start=datetime(2015, 1, 1),end=datetime.today(),api_key='')
df['Stock'] = tick
stocks = stocks.append(df)
stocks.index = pd.to_datetime(stocks.index)
stocks = stocks.set_index('Stock', append = True)
vol = stocks[[‘volume’]]
weekly = vol.groupby([pd.Grouper(level=0, freq='W', label = 'left'), pd.Grouper(level='Stock')]).sum()
weekly.index.rename(['Date', 'Stock'], inplace = True)
weekly.unstack()
Image 1
Image 2
After you get the stocks DataFrame, do this:
weekly = stocks["volume"].unstack().resample("W").sum()
weekly.index = pd.MultiIndex.from_tuples([(dt.year, dt.week) for dt in weekly.index])
>>> weekly
volume
Stock AAPL IBM MSFT TSLA
2015 1 53204626 5525341 27913852 4764443
2 282868187 24440360 158596624 22622034
3 304226647 23272056 157088136 30799137
4 198737041 31230797 137352632 16215501
5 465842684 32927307 437786778 15720217
... ... ... ...
2021 23 327048055 22042806 107035149 105306562
24 456667151 23177438 128993727 107296122
25 354155878 17129373 117966870 153549954
26 321360130 29077036 104384023 103666230
27 213093382 12153414 54825591 42076410
weekly.droplevel(level=0,axis=1)
Related
I have a dataframe with som "NaN" and Outlier values which I want to fill with the mean value of the specific month.
df[["arrival_day", "ib_units", "month","year"]]
arrival_day ib_units month year
37 2020-01-01 262 1 2020
235 2020-01-02 2301 1 2020
290 2020-01-02 145 1 2020
476 2020-01-02 6584 1 2020
551 2020-01-02 30458 1 2020
... ... ... ... ...
1479464 2022-07-19 56424 7 2022
1479490 2022-07-19 130090 7 2022
1479510 2022-07-19 3552 7 2022
1479556 2022-07-19 23779 7 2022
1479756 2022-07-20 2882 7 2022
I know there is the pandas.DataFrame.fillna function df.fillna(df.mean()), but in this case it would build the overall mean for the whole dataset. I want to fill the "NaNs" with the mean value of the specific month in this specific year.
This is what I have tried but this solution is not straightforward and only calculates the mean by year and not the mean by month:
mask_2020 = (df['arrival_day'] >= '2020-01-01') & (df['arrival_day'] <= '2020-12-31')
df_2020 = df.loc[mask_2020]
mask_2021 = (df['arrival_day'] >= '2021-01-01') & (df['arrival_day'] <= '2021-12-31')
df_2021 = df.loc[mask_2021]
mask_2022 = (df['arrival_day'] >= '2022-01-01') & (df['arrival_day'] <= '2022-12-31')
df_2022 = df.loc[mask_2022]
mean_2020 = df_2020.ib_units.mean()
mean_2021 = df_2021.ib_units.mean()
mean_2022 = df_2022.ib_units.mean()
# this finds quartile outliers and replaces them with the mean value of the specific year
for x in ['ib_units']:
q75,q25 = np.percentile(df_2020.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2020.loc[df_2020[x] < min,x] = mean_2020
df_2020.loc[df_2020[x] > max,x] = mean_2020
for x in ['ib_units']:
q75,q25 = np.percentile(df_2021.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2021.loc[df_2021[x] < min,x] = mean_2021
df_2021.loc[df_2021[x] > max,x] = mean_2021
for x in ['ib_units']:
q75,q25 = np.percentile(df_2022.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2022.loc[df_2022[x] < min,x] = mean_2022
df_2022.loc[df_2022[x] > max,x] = mean_2022
So how can I do this in a more code effective way and also by month and not by year?
Thanks!
I think you are overthinking. Please see the code below if it works for you. For outlier the code should be the same as the filling the N/A
import pandas as pd
import datetime as dt
# Sample Data:
df = pd.DataFrame({'date': ['2000-01-02', '2000-01-02', '2000-01-15', '2000-01-27',
'2000-06-03', '2000-06-29', '2000-06-15', '2000-06-29',
'2001-01-02', '2001-01-02', '2001-01-15', '2001-01-27'],
'val':[5,7,None,4,
8,1,None,9,
2,3,None,7]})
# Some convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# Create mean value:
tem = df.groupby(['year', 'month'])[['val']].mean().reset_index()
tem.rename(columns={'val': 'val_mean'}, inplace=True)
tem
# Merge and fill NA:
df = pd.merge(df, tem, how='left', on=['year', 'month'])
df.loc[df['val'].isna(),'val'] = df['val_mean']
want to a bit modify #PTQouc code and rely on his dataframe
Grouping
tem = df.groupby(['year','month'])['val'].mean().reset_index()
Merging
merged = df.merge(df1, how = 'left' , on = ['year','month'])
Using Where
merged['col_z'] = merged['val_x'].where(merged['val_x'].notnull(), merged['val_y'])
Droping
merged = merged.drop(['val_x','val_y'],axis=1)
I try to build a simply stock screener. The screener should download the volume and avg. volume and should give me all stocks with the condition volume > avg. volume.
The Problem is, that something went wrong and i don't know how to fix the code. Its a overall problem in the code and i hope i can get some help. Often i get the message, that the data is empty. But i think there is something wrong with the tables and the conditions...
The first for loop gives me this and this is good, i think there is no mistake in the code
Ticker volume avg. volume
aapl 31 20
ayx 20 32
nflx 25 28
The second for loop is to check my condition ( volume > avg. volume)
but it gives me the following output:
ticker1 ... Stock
0 NaN ... ticker1
1 NaN ... Volume_1
2 NaN ... Average_Volume
[3 rows x 4 columns]
Process finished with exit code 0
normaly only apple fullfills the conditions and it must look like:
Stock volume avg. volume
aapl 31 20
Thats my code:
from yahoo_fin.stock_info import get_analysts_info, get_stats, get_live_price, get_quote_table
import pandas as pd
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
tickers = ['aapl', 'ayx' , 'nflx']
exportList = pd.DataFrame(columns = ['ticker1', 'Volume_1', 'Average_Volume'])
for ticker in tickers:
df = get_stats(ticker)
df['ticker'] = ticker
df = df.pivot(index = 'ticker', columns = 'Attribute', values = 'Value')
df['Volume_1'] = get_quote_table(ticker)['Volume']
df['Average_Volume'] = get_quote_table(ticker)['Avg. Volume']
df = df[['Volume_1', 'Average_Volume']]
df = df.reset_index()
df.columns = ('ticker', 'Volume_1', 'Average_Volume')
rs_df = pd.DataFrame (columns = ['ticker1', 'Volume_1', 'Average_Volume'])
for stock in rs_df:
Volume_1 = df["Volume_1"]
Average_Volume = df["Average_Volume"]
if float(Volume_1) < float(Average_Volume):
exportList = exportList.append({'Stock': stock, "Volume_1": Volume_1, "Average_Volume": Average_Volume},
ignore_index=True)
print('\n', exportList)
The data here is web-scraped from a website, and this initial data in the variable 'r' has three columns, where there are three columns: 'Country', 'Date', '% vs 2019 (Daily)'. From these three columns I was able to extract only the ones I wanted from dates: "2021-01-01" to current/today. What I am trying to do (have spent hours), is trying to organize the data in such a way where there is one column with just the dates which correspond to the percentage data, then 4 other columns which are the country names: Denmark, Finland, Norway, Sweden. Underneath those four countries should have cells populated with the percent data. Have tried using [], loc, and iloc and various other combinations to filter the panda dataframes in such a way to make this happen, but to no avail.
Here is the code I have so far:
import requests
import pandas as pd
import json
import math
import datetime
from jinja2 import Template, Environment
from datetime import date
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0])
d = [[i['c'][0]['v'], i['c'][2]['f'], (i['c'][5]['v'])*100 ] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['Country', 'Date', '% vs 2019 (Daily)'])
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# EXTRACTING BETWEEN TWO DATES
df['Date'] = pd.to_datetime(df['Date'])
startdate = datetime.datetime.strptime('2021-01-01', "%Y-%m-%d").date()
enddate = datetime.datetime.strptime('2021-02-02', "%Y-%m-%d").date()
pd.Timestamp('today').floor('D')
df = df[(df['Date'] > pd.Timestamp(startdate).floor('D')) & (df['Date'] <= pd.Timestamp(enddate).floor('D'))]
Den = df.loc[df['Country'] == 'Denmark']
Fin = df.loc[df['Country'] == 'Finland']
Swe = df.loc[df['Country'] == 'Sweden']
Nor = df.loc[df['Country'] == 'Norway']
Den_data = Den.loc[: , "% vs 2019 (Daily)"]
Den_date = Den.loc[: , "Date"]
Nor_data = Nor.loc[: , "% vs 2019 (Daily)"]
Swe_data = Swe.loc[: , "% vs 2019 (Daily)"]
Fin_data = Fin.loc[: , "% vs 2019 (Daily)"]
Fin_date = Fin.loc[: , "Date"]
Den_data = Den.loc[: , "% vs 2019 (Daily)"]
df2 = pd.DataFrame()
df2['DEN_DATE'] = Den_date
df2['DENMARK'] = Den_data
df3 = pd.DataFrame()
df3['FIN_DATE'] = Fin_date
df3['FINLAND'] = Fin_data
Want it to be organized like this so I can eventually export it to excel:
Date | Denmark | Finland| Norway | Sweden
2020-01-01 | 1234 | 4321 | 5432 | 6574
...
Any help is greatly appreicated.
Thank you
Use isin to filter only the countries you are interested in getting the data. Then use pivot to return a reshaped dataframe organized by a given index and column values, in this case the index is the Date column, and the column values are the countries from the previous selection.
...
...
pd.Timestamp('today').floor('D')
df = df[(df['Date'] > pd.Timestamp(startdate).floor('D')) & (df['Date'] <= pd.Timestamp(enddate).floor('D'))]
countries_list=['Denmark', 'Finland', 'Norway', 'Sweden']
countries_selected = df[df.Country.isin(countries_list)]
result = countries_selected.pivot(index="Date", columns="Country")
print(result)
Output from result
% vs 2019 (Daily)
Country Denmark Finland Norway Sweden
Date
2021-01-02 -65.261383 -75.416667 -39.164087 -65.853659
2021-01-03 -60.405405 -77.408056 -31.763620 -66.385669
2021-01-04 -69.371429 -75.598086 -34.002770 -70.704467
2021-01-05 -73.690932 -79.251701 -33.815689 -73.450509
2021-01-06 -76.257310 -80.445151 -43.454791 -80.805484
...
...
2021-01-30 -83.931624 -75.545852 -63.751763 -76.260163
2021-01-31 -80.654339 -74.468085 -55.565777 -65.451895
2021-02-01 -81.494253 -72.419106 -49.610390 -75.473322
2021-02-02 -81.741233 -73.898305 -46.164021 -78.215223
I'm using yfinance to download the price history for multiple symbols, which returns a dataframe with multiple indexes. For example:
import yfinance as yf
df = yf.download(tickers = ['AAPL', 'MSFT'], period = '2d')
A similar dataframe could be constructed without yfinance like:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
import numpy as np
attributes = ['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume']
symbols = ['AAPL', 'MSFT']
dates = ['2020-07-23', '2020-07-24']
data = [[[371.38, 202.54], [371.38, 202.54], [388.31, 210.92], [368.04, 202.15], [387.99, 207.19], [49251100, 67457000]],
[[370.46, 201.30], [370.46, 201.30], [371.88, 202.86], [356.58, 197.51 ], [363.95, 200.42], [46323800, 39799500]]]
data = np.array(data).reshape(len(dates), len(symbols) * len(attributes))
cols = pd.MultiIndex.from_product([attributes, symbols])
df = pd.DataFrame(data, index=dates, columns=cols)
df
Output:
Adj Close Close High Low Open Volume
AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT
2020-07-23 371.38 202.54 371.38 202.54 388.31 210.92 368.04 202.15 387.99 207.19 49251100.0 67457000.0
2020-07-24 370.46 201.30 370.46 201.30 371.88 202.86 356.58 197.51 363.95 200.42 46323800.0 39799500.0
Once I have this dataframe, I want to restructure it so that I have a row for each symbol and date. I'm currently doing this by looping through a list of symbols and calling the API once each time, and appending the results. I'm sure there must be a more efficient way:
df = pd.DataFrame()
symbols = ['AAPL', 'MSFT']
for x in range(0, len(symbols)):
symbol = symbols[x]
result = yf.download(tickers = symbol, start = '2020-07-23', end = '2020-07-25')
result.insert(0, 'symbol', symbol)
df = pd.concat([df, result])
Example of the desired output:
df
symbol Open High Low Close Adj Close Volume
Date
2020-07-23 AAPL 387.989990 388.309998 368.040009 371.380005 371.380005 49251100
2020-07-24 AAPL 363.950012 371.880005 356.579987 370.459991 370.459991 46323800
2020-07-23 MSFT 207.190002 210.919998 202.149994 202.539993 202.539993 67457000
2020-07-24 MSFT 200.419998 202.860001 197.509995 201.300003 201.300003 39799500
This looks like a simple stacking operation. Let's go with
df = yf.download(tickers = ['AAPL', 'MSFT'], period = '2d') # Get your data
df.stack(level=1).rename_axis(['Date', 'symbol']).reset_index(level=1)
Output:
symbol Adj Close ... Open Volume
Date ...
2020-07-23 AAPL 371.380005 ... 387.989990 49251100
2020-07-23 MSFT 202.539993 ... 207.190002 67457000
2020-07-24 AAPL 370.459991 ... 363.950012 46323800
2020-07-24 MSFT 201.300003 ... 200.419998 39799500
[4 rows x 7 columns]
Plot three lines - one line, per symbol, per date
import pandas as pd
import matplotlib.pyplot as plt
symbol price interest
Date
2016-04-22 AAPL 445.50 0.00
2016-04-22 GOOG 367.02 21.52
2016-04-22 MSFT 248.94 3.44
2016-04-15 AAPL 425.51 0.00
2016-04-15 GOOG 338.57 13.06
2016-04-15 MSFT 226.66 1.15
Currently I split the dataframe into three different frames:
df1 = df[df.symbol == 'AAPL']
df2 = df[df.symbol == 'GOOG']
df3 = df[df.symbol == 'MSFT']
Then I plot them:
plt.plot(df1.index, df1.price.values,
df2.index, df2.price.values,
df3.index, df3.price.values)
Is it possible to plot these three symbols prices straight from the dataframe?
try this:
ax = df[df.symbol=='AAPL'].plot()
df[df.symbol=='GOOG'].plot(ax=ax)
df[df.symbol=='MSFT'].plot(ax=ax)
plt.show()
# Create sample data.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'), index=pd.date_range('2016-1-1', periods=100)).cumsum().reset_index().rename(columns={'index': 'date'})
df = pd.melt(df, id_vars='date', value_vars=['A', 'B', 'C'], value_name='price', var_name='symbol')
df['interest'] = 100
>>> df.head()
date symbol price interest
0 2016-01-01 A 1.764052 100
1 2016-01-02 A 4.004946 100
2 2016-01-03 A 4.955034 100
3 2016-01-04 A 5.365632 100
4 2016-01-05 A 6.126670 100
# Generate plot.
plot_df = (df.loc[df.symbol.isin(['A', 'B', 'C']), ['date', 'symbol', 'price']]
.set_index(['symbol', 'date'])
.unstack('symbol'))
plot_df.columns = plot_df.columns.droplevel()
>>> plot_df.plot())