Scraping table from Python Beautifulsoup - python

I tried to scrape table from this website: https://stockrow.com/VRTX/financials/income/quarterly
I am using Python Google Colab and I'd like to have the dates as columns. (e.g. 2020-06-30 etc) I used code to do something like this:
source = urllib.request.urlopen('https://stockrow.com/VRTX/financials/income/quarterly').read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find_all('table')
However, I cannot get the tables. I am a bit new to scraping so I looked at other Stackoverflow pages but couldn't solve the problem. Can you please help me? That would be much appreciated.

You can use their API to load the data:
import requests
import pandas as pd
indicators_url = 'https://stockrow.com/api/indicators.json'
data_url = 'https://stockrow.com/api/companies/VRTX/financials.json?ticker=VRTX&dimension=Q&section=Income+Statement'
indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_data = []
for d in requests.get(data_url).json():
d['id'] = indicators[d['id']]['name']
all_data.append(d)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Prints:
id 2020-06-30 2020-03-31 2019-12-31 2019-09-30 2019-06-30 ... 2011-12-31 2011-09-30 2011-06-30 2011-03-31 2010-12-31 2010-09-30
0 Consolidated Net Income/Loss 837270000.0 602753000.0 583234100.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
1 EPS (Basic, from Continuous Ops) 3.2248 2.3199 2.2654 0.2239 1.044 ... 0.9374 1.109 -0.9751 -0.8703 -0.8966 -1.0402
2 Net Profit Margin 0.5492 0.3978 0.4127 0.0606 0.2841 ... 0.2816 0.3354 -1.5213 -2.3906 -2.7531 -8.7816
3 Gross Profit 1339965000.0 1352610000.0 1228253000.0 817914000.0 805553000.0 ... 533213000.0 620794000.0 105118000.0 70996000.0 62475000.0 20567000.0
4 Income Tax Provision -12500000.0 54781000.0 93716000.0 13148000.0 59711000.0 ... 22660000.0 -27842000.0 24448000.0 0.0 NaN 0.0
5 Operating Income 718033000.0 720224100.0 551464400.0 99333000.0 269960000.0 ... 223901900.0 215707000.0 -165890000.0 -159899000.0 -166634000.0 -199588000.0
6 EBIT 718033000.0 720224100.0 551464700.0 99333000.0 269960000.0 ... 223901900.0 215707000.0 -165890000.0 -159899000.0 -166634000.0 -199588000.0
7 EPS (Diluted, from Cont. Ops) 3.1787 2.2874 2.2319 0.2208 1.0293 ... 1.0011 1.0415 -0.9751 -0.8703 -0.8966 -1.0402
8 EBITDA 744730000.0 747045000.0 577720400.0 125180000.0 297658000.0 ... 233625900.0 223457000.0 -157181000.0 -151041000.0 -158429000.0 -192830000.0
9 EPS (Basic, Consolidated) 3.2248 2.3199 2.2654 0.2239 1.044 ... 0.9374 1.109 -0.9751 -0.8703 -0.8966 -1.0402
10 EBT 824770000.0 657534000.0 676950000.0 70666000.0 327138000.0 ... 210801000.0 200610000.0 -174870000.0 -176096000.0 -180392000.0 -208957000.0
11 Operating Cash Flow Margin 0.6812 0.5384 0.3156 0.3525 0.4927 ... 0.8941 0.0651 -1.8894 -2.5336 -2.535 -6.8918
12 EBT margin 0.541 0.434 0.479 0.0744 0.3475 ... 0.3742 0.3043 -1.5283 -2.3906 -2.7531 -8.7816
13 EBIT Margin 0.471 0.4754 0.3902 0.1046 0.2868 ... 0.3975 0.3272 -1.4498 -2.1707 -2.5431 -8.3878
14 Income from Continuous Operations 837270000.0 602753000.0 583234000.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
15 R&D Expenses 420928000.0 448528000.0 480011000.0 555948000.0 379091000.0 ... 186438000.0 189052000.0 173604000.0 158612000.0 168888000.0 170434000.0
16 Non-operating Interest Expenses 13871000.0 14136000.0 14249000.0 14548000.0 14837000.0 ... 11659000.0 7059000.0 6962000.0 12001000.0 7686000.0 3951000.0
17 EBITDA Margin 0.4885 0.4931 0.4088 0.1318 0.3162 ... 0.4147 0.339 -1.3737 -2.0505 -2.4179 -8.1038
18 Non-operating Income/Expense 106737000.0 -62690000.0 125485000.0 -28667000.0 57178000.0 ... -13101000.0 -15097000.0 -8980000.0 -16197000.0 -13758000.0 -9369000.0
19 EPS (Basic) 3.22 2.32 2.26 0.22 1.04 ... 0.76 1.06 -0.85 -0.87 -0.9 -1.04
20 Gross Margin 0.879 0.8927 0.8691 0.8611 0.8558 ... 0.9465 0.9417 0.9187 0.9638 0.9535 0.8643
21 Revenue 1524485000.0 1515107000.0 1413265000.0 949828000.0 941293000.0 ... 563340000.0 659200000.0 114424000.0 73662000.0 65524000.0 23795000.0
22 Shares (Diluted, Average) 263403000.0 263515000.0 262108000.0 260473000.0 259822000.0 ... 217602000.0 219349000.0 204413000.0 202329000.0 201355000.0 200887000.0
23 Cost of Revenue 184520000.0 162497000.0 185012000.0 131914000.0 135740000.0 ... 30127000.0 38406000.0 9306000.0 2666000.0 3049000.0 3228000.0
24 SG&A Expenses 191804000.0 182258000.0 195277000.0 159674000.0 156502000.0 ... 121881000.0 110654000.0 96663000.0 71523000.0 62478000.0 48855000.0
25 EPS (Diluted, Consolidated) 3.1787 2.2874 2.2319 0.2208 1.0293 ... 1.0011 1.0415 -0.9751 -0.8703 -0.8966 -1.0402
26 Revenue Growth 0.6196 0.765 0.6242 0.2107 0.2515 ... 7.5975 26.7033 2.6185 2.2842 0.9335 -0.0466
27 Shares (Basic, Weighted) 259637000.0 259815000.0 256728000.0 256946000.0 256154000.0 ... 204891000.0 206002000.0 204413000.0 202329000.0 200402000.0 200887000.0
28 Income after Tax 837270000.0 602753000.0 583234000.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
29 EPS (Diluted) 3.18 2.29 2.23 0.22 1.03 ... 0.74 1.02 -0.85 -0.87 -0.9 -1.04
30 Net Income Common 837270000.0 602753000.0 583234100.0 57518000.0 267427000.0 ... 158629000.0 221110000.0 -174069000.0 -176096000.0 -180392000.0 -208957000.0
31 Shares (Diluted, Weighted) 263403000.0 263515000.0 260673000.0 260473000.0 259822000.0 ... 208807000.0 219349000.0 204413000.0 202329000.0 200402000.0 200887000.0
32 Non-Controlling Interest NaN NaN NaN NaN NaN ... 29512000.0 7342000.0 -25249000.0 0.0 NaN 0.0
33 Dividends (Preferred) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
34 EPS (Basic, from Discontinued Ops) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
35 EPS (Diluted, from Disc. Ops) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
36 Income from Discontinued Operations NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
[37 rows x 41 columns]
And saves data.csv:
Or donwload their XLSX from that page:
url = 'https://stockrow.com/api/companies/VRTX/financials.xlsx?dimension=Q&section=Income%20Statement&sort=desc'
df = pd.read_excel(url)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(df)

First problem is, that table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium.
Second problem is, that there is no table tag in HTML, it uses grid formatting.
Since you're using Google Colab, you'll need to install there selenium web driver (code taken from this answer):
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
After that you can load the page and parse it:
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
# load page via selenium
wd.get("https://stockrow.com/VRTX/financials/income/quarterly")
# wait 5 seconds until element with class mainGrid will be loaded
grid = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'mainGrid')))
# parse content of the grid
soup = BeautifulSoup(grid.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', {'class': 'financials-value'}):
print(tag)

Related

How to find the attribute and element id by selenium.webdriver?

I am learning web scrapping since I need it for my work. I wrote the following code:
from selenium import webdriver
chromedriver='/home/es/drivers/chromedriver'
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)
driver.get('http://crdd.osdd.net/raghava/hemolytik/submitkey_browse.php?ran=1955')
df = pd.read_html(driver.find_element_by_id("table.example.display.datatable").get_attribute('example'))[0]
However, it is showing the following error:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="table.example.display.datatable"]"}
(Session info: chrome=103.0.5060.134)
Then I inspect the table that I wanna scrape this table from this page
what is the attribute that needs to be included in get_attribute() function in the following line?
df = pd.read_html(driver.find_element_by_id("table.example.display.datatable").get_attribute('example'))[0]
what I should write in the driver.find_element_by_id?
EDITED:
Some tables have lots of records in multi-pages.
For example, this page has 2,246 entries, which shows 100 entries on each page. Once I tried to web-scrape it, there were only 320 entries in df and the record ID is from 1232-1713, which means it took entries from the next few pages and it is not starting from the first page to the end at the last page.
What we can do in such cases?
You need to get the outerHTML property of the table first, then call the table element from pandas.
You need to wait for element to be visible. Use explicit wait like WebdriverWait()
driver.get('http://crdd.osdd.net/raghava/hemolytik/submitkey_browse.php?ran=1955')
table=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#example")))
tableRows=table.get_attribute("outerHTML")
df = pd.read_html(tableRows)[0]
print(df)
Import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import pandas as pd
Output:
ID PMID YEAR ... DSSP Natural Structure Final Structure
0 1643 16137634 2005 ... CCCCCCCCCCCSCCCC NaN NaN
1 1644 16137634 2005 ... CCTTSCCSSCCCC NaN NaN
2 1645 16137634 2005 ... CTTTCGGGHHHHHHHHCC NaN NaN
3 1646 16137634 2005 ... CGGGTTTHHHHHHHGGGC NaN NaN
4 1647 16137634 2005 ... CCSCCCSSCHHHHHHHHHTTC NaN NaN
5 1910 16730859 2006 ... CCCCCCCSSCCSHHHHHHHHTTHHHHHHHHSSCCC NaN NaN
6 1911 16730859 2006 ... CCSCC NaN NaN
7 1912 16730859 2006 ... CCSSSCSCC NaN NaN
8 1913 16730859 2006 ... CCCSSCCSSCCSHHHHHTTHHHHTTTCSCC NaN NaN
9 1914 16730859 2006 ... CCSHHHHHHHHHHHHHCCCC NaN NaN
10 2110 11226440 2001 ... CCCSSCCCBTTBTSSSSSSCSCC NaN NaN
11 3799 9204560 1997 ... CCSSCC NaN NaN
12 4149 16137634 2005 ... CCHHHHHHHHHHHC NaN NaN
[13 rows x 17 columns]
If you want to select table by #id you need
driver.find_element_by_id("example")
By.CSS:
driver.find_element_by_css_selector("table#example")
By.XPATH:
driver.find_element_by_xpath("//table[#id='example'])
If you want to extract #id value you need
.get_attribute('id')
Since there is not much sense in searching by #id to extract that exact #id you might use other attribute of table node:
driver.find_element_by_xpath("//table[#aria-describedby='example_info']").get_attribute('id')

Pandas dataframe merge row by addition

I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.
For now, I wrote this
census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']
I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.
Here is the dataframe that I currently get:
zipcode agi_stub N02650 A02650 ELDERLY A07180 Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more Amount_1_25000 Amount_25000_50000 Amount_50000_75000 Amount_75000_100000 Amount_100000_200000 Amount_200000_more
0 0 1 778140.0 10311099.0 144610.0 2076.0 778140.0 NaN NaN NaN NaN NaN 10311099.0 NaN NaN NaN NaN NaN
1 0 2 525940.0 19145621.0 113810.0 17784.0 NaN 525940.0 NaN NaN NaN NaN NaN 19145621.0 NaN NaN NaN NaN
2 0 3 285700.0 17690402.0 82410.0 9521.0 NaN NaN 285700.0 NaN NaN NaN NaN NaN 17690402.0 NaN NaN NaN
3 0 4 179070.0 15670456.0 57970.0 8072.0 NaN NaN NaN 179070.0 NaN NaN NaN NaN NaN 15670456.0 NaN NaN
4 0 5 257010.0 35286228.0 85030.0 14872.0 NaN NaN NaN NaN 257010.0 NaN NaN NaN NaN NaN 35286228.0 NaN
And here is what I want to get:
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 850.0
here is one way to do it using groupby and sum the desired columns
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 0.0
This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:
- `N1: Number of returns`
- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000'
3 = '$10,000 under $25,000'
4 = '$25,000 under $50,000'
5 = '$50,000 under $75,000'
6 = '$75,000 under $100,000'
7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"
I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc
With this information, I think what you want to find is the number of
people who filed a tax return for each of the income levels of agi_stub.
If that is what you mean then, this can be achieved by:
import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up.
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]
Here's the output.
In [77]: df
Out[77]:
agi_stub_level_1 agi_stub_level_2 ... agi_stub_level_5 agi_stub_level_6
zipcode ...
0 50061850.0 37566510.0 ... 21938920.0 8859370.0
1001 2550.0 2230.0 ... 1420.0 230.0
1002 2850.0 1830.0 ... 1840.0 990.0
1005 650.0 570.0 ... 450.0 60.0
1007 1980.0 1530.0 ... 1830.0 460.0
... ... ... ... ... ...
99827 470.0 360.0 ... 170.0 40.0
99833 550.0 380.0 ... 290.0 80.0
99835 1250.0 1130.0 ... 730.0 190.0
99901 1960.0 1520.0 ... 1030.0 290.0
99999 868450.0 644160.0 ... 319880.0 142960.0
[27595 rows x 6 columns]

How to calculate slope of moving average accurately for a series of points?

What I am trying to do is calculate a simple moving average for a specified period of time for stock prices. I referred to a lot of online resources and all of them recommend using the rolling_mean function to calculate a moving average.
I did the above like this:
def getEODData(symbol):
api_result = requests.get('http://api.marketstack.com/v1/eod?access_key='+apikey+'&symbols='+symbol+'&limit=2500')
api_response = api_result.json()
df=pd.DataFrame.from_dict(api_response['data'])
df=df.iloc[::-1]
timeshort=66
if not df.empty:
df['SMA']=df.iloc[:,3].rolling(window=timeshort).mean()
slope_short=((df['SMA'][0]-df['SMA'][timeshort])/timeshort)
slope_short_deg = math.atan(slope_short) * 180 / math.pi
print(slope_short_deg)
I did df. iloc[::-1] because the first 66 periods be NaN for rolling_mean calculation so I flipped the data frame so that I can get the moving average values for the latest dates.
This is how it looks after flipping:
open high low close volume adj_high adj_low ... adj_volume split_factor symbol exchange date SMA SMA_long
1791 568.0000 568.0000 552.9200 558.4600 13100.0 568.00 552.92 ... 13100.0 2.0 GOOG XNAS 2014-03-27T00:00:00+0000 NaN NaN
1790 561.2000 566.4300 558.6700 559.9900 41100.0 566.43 558.67 ... 41100.0 1.0 GOOG XNAS 2014-03-28T00:00:00+0000 NaN NaN
1789 566.8900 567.0000 556.9300 556.9700 10800.0 567.00 556.93 ... 10800.0 1.0 GOOG XNAS 2014-03-31T00:00:00+0000 NaN NaN
1788 558.7100 568.4500 558.7100 567.1600 7900.0 568.45 558.71 ... 7900.0 1.0 GOOG XNAS 2014-04-01T00:00:00+0000 NaN NaN
1787 565.1060 604.8300 562.1900 567.0000 146700.0 604.83 562.19 ... 146700.0 1.0 GOOG XNAS 2014-04-02T00:00:00+0000 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4 2402.7200 2419.7000 2384.5000 2395.1699 1648353.0 NaN NaN ... NaN 1.0 GOOG XNAS 2021-05-03T00:00:00+0000 2134.197117 1724.360315
3 2369.7400 2379.2600 2311.7000 2354.2500 1686545.0 NaN NaN ... NaN 1.0 GOOG XNAS 2021-05-04T00:00:00+0000 2141.638632 1728.445849
2 2368.4199 2382.2000 2351.8850 2356.7400 1090275.0 NaN NaN ... NaN 1.0 GOOG XNAS 2021-05-05T00:00:00+0000 2149.532571 1732.516758
1 2350.6399 2382.7100 2342.3381 2381.3501 978908.0 NaN NaN ... NaN 1.0 GOOG XNAS 2021-05-06T00:00:00+0000 2156.805300 1736.588853
0 2400.0000 2416.4099 2390.0000 2398.6899 1163600.0 NaN NaN ... NaN 1.0 GOOG XNAS 2021-05-07T00:00:00+0000 2163.944389 1740.744544
Now I tried to run for the google stock and it gave the output as 80.47 deg. Then I went to a site called tradingview to verify my result and it was like this:
( settings for this graph -> time period of graph - 1day and moving average period -66)
I drew the red line for the slope for 66 bars and as you can see this is nowhere close to 80 deg.
Then I thought of using np.polyfit() to find the slope like this:
y=np.array(df['SMA'][-(timeshort):])
x= range(0, len(y))
sl, b=np.polyfit(x,y,1)
sl=math.atan(sl) * 180 / math.pi
But this also gave an output of 79 deg.
What am I doing wrong? How can I get a slope like of the websites?
Any help would be greatly appreciated.
Sorry for the delay,
let's see the code snippet below:
>>> import math
>>> slope_short_deg = math.atan(1) * 180 / math.pi
>>> slope_short_deg
45.0
>>> slope_short_deg = math.atan(2) * 180 / math.pi
>>> slope_short_deg
63.43494882292201
>>> slope_short_deg = math.atan(3) * 180 / math.pi
>>> slope_short_deg
71.56505117707799
As you can see, when the input argument of the math.atan() is 1, the degree of the result is 45, so it seems that it considers the x element as 1, and by increasing the input argument, the degree increases.
You can use math.atan2(y, x) and pass x parameter too.
You may say the x parameter is the date of the candles, this is ok, you can see the date size is relative and you can change it by scrolling and the degree of the line in the chart will be changed.
So you can choose an x number and form a condition relative to it according to your strategy.

Selenium code can not catch the table from Chrome

I am using selenium to parse from
https://www.worldometers.info/coronavirus/
and doing as the following, I get attribute error and the table variable remains empty, what is the reason ?
I use Chrome 80. Are the tags right ?
AttributeError: 'NoneType' object has no attribute 'tbody'
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
browser.get("https://www.worldometers.info/coronavirus/")
html = bs4.BeautifulSoup(browser.page_source, "html.parser")
table = html.find("table",class_="table table-bordered table-hover main_table_countries dataTable no-footer") #
Wherever I have table tags, I find it easier to use pandas to capture the table.
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
table = pd.read_html(url)[0]
Output:
print(table)
Country,Other TotalCases ... Tot Cases/1M pop Tot Deaths/1M pop
0 China 81093 ... 56.00 2.0
1 Italy 63927 ... 1057.00 101.0
2 USA 43734 ... 132.00 2.0
3 Spain 35136 ... 751.00 49.0
4 Germany 29056 ... 347.00 1.0
.. ... ... ... ... ...
192 Somalia 1 ... 0.06 NaN
193 Syria 1 ... 0.06 NaN
194 Timor-Leste 1 ... 0.80 NaN
195 Turks and Caicos 1 ... 26.00 NaN
196 Total: 378782 ... 48.60 2.1
[197 rows x 10 columns]

How to handle Cells containing only NaN values in pandas?

I am setting up a stock price prediction data set,in that while applying the following code for Ichimoku Cloud Indicator:
from datetime import timedelta
high_9 = df['High'].rolling(window= 9).max()
low_9 = df['Low'].rolling(window= 9).min()
df['tenkan_sen'] = (high_9 + low_9) /2
high_26 = df['High'].rolling(window= 26).max()
low_26 = df['Low'].rolling(window= 26).min()
df['kijun_sen'] = (high_26 + low_26) /2
# this is to extend the 'df' in future for 26 days
# the 'df' here is numerical indexed df
# the problem is here
last_index = df.iloc[-1:].index[0]
last_date = df['Date'].iloc[-1].date()
for i in range(26):
df.loc[last_index+1 +i, 'Date'] = last_date + timedelta(days=i)
df['senkou_span_a'] = ((df['tenkan_sen'] + df['kijun_sen']) / 2).shift(26)
high_52 = df['High'].rolling(window= 52).max()
low_52 = df['Low'].rolling(window= 52).min()
df['senkou_span_b'] = ((high_52 + low_52) /2).shift(26)
# most charting softwares dont plot this line
df['chikou_span'] = df['Close'].shift(-26)
The above code works great but the problem is while extending to the next 26 time steps(rows) in 'senoku span a' and 'b' columns it turns other rest columns row's values to NaN.
So i need the help to make 'Senoku span a' & 'Senoku span b' predicted rows in my data set without making other rows vlaues to NaN.
The current output is:
Date Open High Low Close Senoku span a Senoku span b
2019-03-16 50 51 52 53 56.0 55.82
2019-03-17 NaN NaN NaN NaN 55.0 56.42
2019-03-18 NaN NaN NaN NaN 54.0 57.72
2019-03-19 NaN NaN NaN NaN 53.0 58.12
2019-03-20 NaN NaN NaN NaN 52.0 59.52
The expected output is:
Date Open High Low Close Senoku span a Senoku span b
2019-03-16 50 51 52 53 56.0 55.82
2019-03-17 55.0 56.42
2019-03-18 54.0 57.72
2019-03-19 53.0 58.12
2019-03-20 52.0 59.52

Categories

Resources