Text Extraction from Images - python

I did extraction of text from image. I got unstructured data after extracting text. I have to convert this to a structured form but I'm not able to do the so.
The unstructured data extracted from image in python:
EQUITY-LARGE CAP ©# SBIMUTUAL FUND
A’ A PARTNER FOR LIFE
LSS LAST DIVIDENDS Ct EV a A)
i Option NAV #) Record Date Dividend (in /Unit) NAV (#)
BLUE CH | Pp FU N D Reg-Plan-Growth 34.9294 23-Sep-16 (Reg Plan) 1.00 18.5964
—————— a 23-Sep-16 (Dir Plan) 1.20 21.8569
= Reg-Plan-Dividend 19.8776 9 =
An Open-ended Growth Scheme = -Reg-Plan-Dividend 188776 TT a5 Reg Plan) 2.50 17.6880
Dir-Plan-Dividend 23.5613 17-Jul-15 (Dir Plan) 2.90 20.5395
. . ir a 21- Mar-14 (Reg Plan) 1.80 12.7618
Investment Objective Dir-Plan-Growth 36.2961
a. . a. Pursuant to payment of dividend, the NAV of Dividend Option of
To provide investors with opportunities scheme/plans would fall to the extent of payout and statutory levy, if
for long-term growth in capital through applicable.
anactive management of investments ina
diversified basket of equity stocks of
companies whose market capitalization
is at least equal to or more than the least PORTFOLIO
market capitalized stock of S&P BSE 100
face Stock Name (%) Of Total AUM Stock Name (%) Of Total AUM
. HDFC Bank Ltd. 8.29 Apollo Hospitals Enterprises Ltd. 1.04
Fund Details Larsen & Toubro Ltd. 4.46 Tata Motors Ltd. (Dvr-A-Ordy) 0.85
ITC Ltd. 4.07 Eicher Motors Ltd. 0.84
+ Type of Scheme UPL Ltd. 2.95 Shriram City Union Finance Ltd. 0.79
An Open - Ended Growth Scheme Infosys Ltd. 2.93 Divi's Laboratories Ltd. 0.73
Mahindra & Mahindra Ltd. 2.92 Pidilite Industries Ltd. 0.62
+ Date of Allotment: 14/02/2006 Nestle India Ltd. 2.90 Fag Bearings India Ltd. 0.62
. . Reliance Industries Ltd. 2.86 Sadbhav Engineering Ltd. 0.61
Reno AS ono /OG/2007 Indusind Bank Ltd. 2.68 Grasim Industries Ltd. 0.60
+ AAUM for the Month of June 2017 State Bank Of India 2.63 Petronet LNG Ltd. 0.60
214,204.29¢ Kotak Mahindra Bank Ltd. 2.57 Hudco Ltd. 0.58
, rores HCL Technologies Ltd. 2.50 Torrent Pharmaceuticals Ltd. 0.55
+» AUMas on June 30, 2017 Bharat Electronics Ltd. 2.48 Thermax Ltd. 0.52
% 14,292.59 Crores Cholamandalam Investment And Dr. Lal Path Labs Ltd. 0.49
: — - Finance Company Ltd. 2.36 Coal India Ltd. 0.44
+ Fund Manager: Ms. Sohini Andani Hero Motocorp Ltd. 2.16 Narayana Hrudayalaya Ltd. 0.41
Managing Since: Sep-2010 Hindustan Petroleum Corporation Ltd. 2.11 Britannia Industries Ltd. 0.40
i . Motherson Sumi Systems Ltd. 1.98 Tata Steel Ltd. 0.38
Total Experience: Over 22 years Maruti Suzuki India Ltd. 1.90 Procter & Gamble Hygiene And
+ Benchmark: S&P BSE 100 Index ICICI Bank Ltd. 1.88 Health Care Ltd. 0.38
— Sun Pharmaceuticals Industries Ltd. 1.66 SKF India Ltd. 0.35
+ Exit Load: HDFC Ltd. 1.66 ff Tata Motors Ltd. 0.26
For exit within 1 year from the date of Strides Shasun Ltd. 1.59 Equity Shares Total 90.22
allotment - 1%; For exit after 1 year Titan Company Ltd. 1.58 Motilal Oswal Securities Ltd
fi he d f n il Hindalco Industries Ltd. 1.57 CP Mat 28.07.2017. 0.42
rom the date of allotment - Ni Ultratech Cement Ltd. 1.52 [| Commercial Paper Total 0.42
+ Entry Load: N.A. Voltas Ltd. 1.48 HDFC Bank Ltd. 0.14
- - Mahindra & Mahindra Financial Services Ltd. 1.42 Fixed Deposits Total 0.14
+ Plans Available: Regular, Direct The Ramco Cements Ltd. 1.41 CBLO 8.24
. a ao PI Industries Ltd. 1.40 Cash & Other Receivables (4.29)
Options: Growth, Dividend Aurobindo Pharma Ltd. 1.39 Futures 4.72
+ SIP Indian Oil Corporation Ltd. 1.36 HDFC Ltd. 0.56
Weekly - Minimum & 1000 & in multiples The Federal Bank Ltd. 1.22 Warrants Total 0.56
LIC Housing Finance Ltd. 1.18 Grand Total 100.00
of = 1 thereafter for a minimum of 6 Shriram Transport Finance Company Ltd. 1.10
instalments.
Monthly - Minimum = 1000 & in
Eee ee aC PORTFOLIO CLASSIFICATION BY PORTFOLIO CLASSIFICATION BY
See ee eae Oe INDUSTRY ALLOCATION (%) ASSET ALLOCATION (%)
multiples of = 1 thereafter for minimum
one year. Financial Services 29.34
Quarterly - Minimum % 1500 & in Automobile 10.90 s.o6 172
multiples of = 1 thereafter for minimum ronsumer Goods 03
nergy :
one WEEN Construction 6.54 18.66
+ Minimum Investment Pharma 5.93 *
= 5000 & in multiples of = 1 IT 5.43
resi Fertilisers & Pesticides 4.35
. Additional Investment Industrial Manufacturing 3.97
< HOO © tho coawlittas Gtr Cement & Cement Products 3.53
Metals 2.39 71.55
Quantitative Data Healthcare Services 1.93
Chemicals 0.62
Standard Deviation® 112.21% Cash & Other Recivables -4.29 L c = Mia
mLarge Cap jidcap
Beta* :0.86 Futures 4.72
ae cBLO 8.24
Sharpe Ratio’ 0.76 Fixed Deposits 0.14 m Cash & Other Current Assets Futures
Portfolio Turnover* 11.03
*Source: CRISIL Fund Analyser Riskometor SBI Blue Chip Fund
“Portfolio Turnover = lower of total sale or one] > This product is suitable for investors who are seeking:
total purchase for the last 12 months L\E * Long term capital appreciation,
Fe on C aL a GCM cL OT LT Ss BAA Z*3\ * Investment in equity shares of companies whose market capitalization is at least equal to or more
Risk Free rate: FBIL Overnight Mibor rate Inve EE sical than the least market capitalized stock of S&P BSE 100 index to provide long term capital growth
(6.25% as on 30th June 2017) Basis for will best Moderately Highrisk | OPPOrtunities.
Ratio Calculation: eavcarsiMonthiy{Data ‘Alnvestors should consult their financial advisers if in doubt about whether the product is suitable for them.
The image:
Please help to convert this unstructured data to structure data. Any library or any function suggested?

You need to have certain parameters to split,
text=inp_text.split(".\n")## this will help to split where full stop and new line starts
text= re.split('\s{4,}',inp_text) ## this will help to split where atleast 4 white spaces

Related

Window function to find differences in String

I have a table like the below in MS SQL Server with the first two columns, I would like to apply some kind of window function partitioned by fund_name to output the differences in security_name to give the desired_output?
Is there such a way? I could maybe try reading the data into python but can't think of solution using either.
There will be no consistency in the pattern of desired_output.
Fund_Name
Security_Name
Desired_Output
Morgan Stanley Investment Sust Asn Eq Fd
Morgan Stanley Investment Funds - Sustainable Asian Equity Fund Z
Z
Morgan Stanley Investment Sust Asn Eq Fd
Morgan Stanley Investment Funds - Sustainable Asian Equity Fund B
B
Morgan Stanley Investment Sust Asn Eq Fd
Morgan Stanley Investment Funds - Sustainable Asian Equity Fund I
I
Morgan Stanley Investment Sust Asn Eq Fd
Morgan Stanley Investment Funds - Sustainable Asian Equity Fund A
A
MS INVF Emerging Markets Equity Fund
Morgan Stanley Investment Funds - Emerging Markets Equity Fund Z
Z
MS INVF Emerging Markets Equity Fund
Morgan Stanley Investment Funds - Emerging Markets Equity Fund I
I
MS INVF Latin American Equity Fund
Morgan Stanley Investment Funds - Latin American Equity Fund A
A
MS INVF Latin American Equity Fund
Morgan Stanley Investment Funds - Latin American Equity Fund I
I
MS INVF Latin American Equity Fund
Morgan Stanley Investment Funds - Latin American Equity Fund B
B
MS INVF Latin American Equity Fund
Morgan Stanley Investment Funds - Latin American Equity Fund C
C
MS INVF Latin American Equity Fund
Morgan Stanley Investment Funds - Latin American Equity Fund Z
Z
MS INVF US Growth Fund
Morgan Stanley Investment Funds - US Growth Fund A
A
MS INVF US Growth Fund
Morgan Stanley Investment Funds - US Growth Fund AH (EUR)
AH (EUR)
MS INVF US Growth Fund
Morgan Stanley Investment Funds - US Growth Fund NH (EUR)
NH (EUR)
MS INVF US Growth Fund
Morgan Stanley Investment Funds - US Growth Fund BH (EUR)
BH (EUR)
MS INVF US Growth Fund
Morgan Stanley Investment Funds - US Growth Fund Z
Z
using the pattern provided, you can measure if the length of the last term is equal to 1, then append it, else check if it's in brackets (), then append the slice of the last 2 string splits, else apply some additional logic you have, like so:
import pandas as pd
mydata=pd.read_csv("so.csv")
out=[]
for security in mydata['Security_Name']:
if len(security.split()[-1])==1:
out.append(security.split()[-1])
elif security[-1]==")":
out.append(security.split()[-2:])
else:
out.append("add new logic here")
mydata["new_logic"]=pd.DataFrame(out)
print(mydata)

How to loop through CSV to read and extract ranking according to one column?

I feel that I am going to ask a very basic question, but please bear with me.
I have 3 CSV files. I want to find the best and worst of a specific column.
I did for one CSV, somehow.
import pandas as pd
import os
import numpy as np
path = r"MyFolder"
file1 = '59107_20210630m.csv'
file2 = '65758_20210630m.csv'
file3 = '26389_20210630m.csv'
d1 = os.path.join(path,file1)
d2 = os.path.join(path,file2)
d3 = os.path.join(path,file3)
df1 = pd.read_csv(d1,dtype={'Fund':str, 'Identifier':str, 'Product Description':str, 'L/S':str},thousands=',')
df2 = pd.read_csv(d2,dtype={'Fund':str, 'Identifier':str, 'Product Description':str, 'L/S':str},thousands=',')
df3= pd.read_csv(d3,dtype={'Fund':str, 'Identifier':str, 'Product Description':str, 'L/S':str},thousands=',')
Best_Contributors = df1.sort_values('Net MTD P&L (Base)',
ascending=False).reset_index().head()[['Fund', 'Identifier', 'Product Description', 'L/S',
'% exp of NAV','Period % (Base)','Net MTD P&L (Base)','Contribution (bps)' ]]
Worst_Contributors = df1.sort_values('Net MTD P&L (Base)').reset_index().head()[['Fund', 'Identifier', 'Product Description', 'L/S',
'% exp of NAV','Period % (Base)','Net MTD P&L (Base)','Contribution (bps)']]
Fund = df1.iloc[0,0]
Fund
'59107'
Best_Contributors.style.set_caption(Fund+" Best_Contributors")
59107 Best_Contributors
Fund Identifier Product Description L/S % exp of NAV Period % (Base) Net MTD P&L (Base) Contribution (bps)
0 59107 4523 JP Equity EISAI CO LTD L 1.3 4.67 44 5.41060
1 59107 9517 JP Equity EREX CO LTD L 1.2 4.22 43 5.47042
2 59107 7203 JP Equity TOYOTA MOTOR CORP L 4.5 6.53 22 2.42082
3 59107 3382 JP Equity SEVEN & I HOLDINGS CO LTD L 2.3 1.68 18 2.45396
4 59107 6501 JP Equity HITACHI LTD L 1.9 1.01 17 2.51208
Worst_Contributors .style.set_caption(Fund+" Worst_Contributors")
59107 Worst_Contributors
Fund Identifier Product Description L/S % exp of NAV Period % (Base) Net MTD P&L (Base) Contribution (bps)
0 59107 6301 JP Equity KOMATSU LTD L 1.4 -1.40 -1680 -21.12414
1 59107 9984 JP Equity SOFTBANK GROUP CORP L 1.8 -5.94 -114 -14.41187
2 59107 3678 JP Equity MEDIA DO HOLDINGS CO LTD L 0.0 -1.90 -1133 -14.24195
3 59107 8630 JP Equity SOMPO HOLDINGS INC L 1.7 -6.36 -9766 -12.27612
4 59107 8750 JP Equity DAI-ICHI LIFE HOLDINGS INC L 1.2 -8.01 -931 -11.70994
How can I make it in a loop, to get 6 tables (dataframe) at once? Basically one set of Best_Contributors and Worst_Contributors for each fund.
Thank you.

How can i create a loop to scrape multiple pages from source url using BeautifulSoup?

The current script allows me to scrape only a single page but i would like to scarpe all 5 pages from the source url. How can i loop/iterate through remaining 4 pages?
#Import Libraries
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.sustainalytics.com/esg-ratings/?industry=Aerospace%20&%20Defense&currentpage=1').text
soup = BeautifulSoup(source, 'lxml')
#Start CSV
csv_file = open('aerospacedata_1.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['company_name', 'company_exchange', 'company_risk'])
#Scrape Data from Web and write to csv
for company_info in soup.find_all(class_='company-row d-flex'):
company_name = company_info.a.text
company_exchange = company_info.find("small").text
company_risk = company_info.find("div", class_="col-2").text
print(company_name, company_exchange,company_risk)
csv_writer.writerow([company_name, company_exchange, company_risk])
csv_file.close()
Output:
company_name company_exchange company_risk
AECC Aviation Power Co Ltd SHG:600893 53.3
Airbus SE PAR:AIR 30.3
Aselsan Elektronik Sanayi ve Ticaret AS IST:ASELS 31.6
AVIC Aircraft Co., Ltd. SHE:000768 54.4
AVIC Shenyang Aircraft Co. Ltd. SHG:600760 51.3
AviChina Industry & Technology Company Limited HKG:2357 45.2
BAE Systems PLC LON:BA 34.3
Bombardier Inc. TSE:BBD.B 30
BWX Technologies, Inc. NYS:BWXT 42.3
CAE Inc. TSE:CAE 32.4
Put a for loop and use the loop invariable to construct the url and the file name
#Import Libraries
from bs4 import BeautifulSoup
import requests
import csv
pages = 5
for i in range(1, pages+1):
print(f"Page - {i}")
source = requests.get(f'https://www.sustainalytics.com/esg-ratings/?industry=Aerospace%20&%20Defense&currentpage={i}').text
soup = BeautifulSoup(source, 'lxml')
#Start CSV
csv_file = open(f'aerospacedata_{i}.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['company_name', 'company_exchange', 'company_risk'])
#Scrape Data from Web and write to csv
for company_info in soup.find_all(class_='company-row d-flex'):
company_name = company_info.a.text
company_exchange = company_info.find("small").text
company_risk = company_info.find("div", class_="col-2").text
print(company_name, company_exchange,company_risk)
csv_writer.writerow([company_name, company_exchange, company_risk])
csv_file.close()
print("---" * 30)
Output:
Page - 1
AECC Aviation Power Co Ltd SHG:600893 53.3
Airbus SE PAR:AIR 30.3
Aselsan Elektronik Sanayi ve Ticaret AS IST:ASELS 31.6
AVIC Aircraft Co., Ltd. SHE:000768 54.4
AVIC Shenyang Aircraft Co. Ltd. SHG:600760 51.3
AviChina Industry & Technology Company Limited HKG:2357 45.2
BAE Systems PLC LON:BA 34.3
Bombardier Inc. TSE:BBD.B 30
BWX Technologies, Inc. NYS:BWXT 42.3
CAE Inc. TSE:CAE 32.4
------------------------------------------------------------------------------------------
Page - 2
China Avionics Systems Co.,Ltd. SHG:600372 54.8
Cobham PLC LON:COB 34.7
Curtiss-Wright Corp NYS:CW 39
Dassault Aviation S.A. PAR:AM 31.8
Embraer S.A. BSP:EMBR3 36.3
FACC AG WBO:FACC 37.9
General Dynamics Corp NYS:GD 37.5
Heico Corp NYS:HEI 39.3
Hexcel Corp NYS:HXL 31.6
Huntington Ingalls Industries, Inc. NYS:HII 41.3
------------------------------------------------------------------------------------------
Page - 3
Kongsberg Gruppen ASA OSL:KOG 29
Korea Aerospace Industries, Ltd. KRX:047810 49.9
L3Harris Technologies, Inc. NYS:LHX 38.8
Leonardo S.p.a. MIL:LDO 28.7
Lockheed Martin Corp NYS:LMT 30.6
Macquarie Infrastructure Corp NYS:MIC 44.7
Meggitt PLC LON:MGGT 32.7
MTU Aero Engines AG ETR:MTX 23.8
Northrop Grumman Corp. NYS:NOC 31.1
QinetiQ Group PLC LON:QQ 23
------------------------------------------------------------------------------------------
Page - 4
Raytheon Co NYS:RTN 32.9
Rheinmetall AG ETR:RHM 35.4
Rolls-Royce Holdings PLC LON:RR 28.6
Saab AB OME:SAAB.B 31.5
Safran SA PAR:SAF 30.7
Senior PLC LON:SNR 31.9
Signature Aviation Plc LON:SIG 35.4
Singapore Technologies Engineering Ltd. SES:S63 29.2
Spirit AeroSystems Holdings Inc NYS:SPR 36.8
Teledyne Technologies, Inc. NYS:TDY 37.5
------------------------------------------------------------------------------------------
Page - 5
Textron Inc. NYS:TXT 37.8
Thales SA PAR:HO 28.6
The Boeing Company NYS:BA 39
TransDigm Group Inc NYS:TDG 40.9
Ultra Electronics Holdings PLC LON:ULE 37.4
United Technologies Corp NYS:UTX 29.3
------------------------------------------------------------------------------------------

scrappig an HTML tag on the web page using BS

Image description is:
Tag is:
I was looking to get the data in the stock dropdown. I went into the source and found the tag but I can't get the code to access the data. Can someone please help me fix the bug?
url ="http://www.moneycontrol.com/india/fnoquote/reliance-industries/RI/2020-07-30"
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
for i in soup.select("stock_id"):
print(i.text)
You can use #stock_code > option instead of stock_id to get the data in the stock dropdown.You can try it:
url ="http://www.moneycontrol.com/india/fnoquote/reliance-industries/RI/2020-07-30"
headers = {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
from bs4 import BeautifulSoup
import requests
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
a = soup.select("#stock_code > option")
for i in a:
print(i.text)
Output will be:
ACC
Adani Enterpris
Adani Ports
Adani Power
Ajanta Pharma
Allahabad Bank
Amara Raja Batt
Ambuja Cements
Apollo Hospital
Apollo Tyres
Arvind
Ashok Leyland
Asian Paints
Aurobindo Pharm
Axis Bank
Bajaj Auto
Bajaj Finance
Bajaj Finserv
Balkrishna Ind
Bank of Baroda
Bank of India
Bata India
BEML
Berger Paints
Bharat Elec
Bharat Fin
Bharat Forge
Bharti Airtel
Bharti Infratel
BHEL
Biocon
Bosch
BPCL
Britannia
Cadila Health
Can Fin Homes
Canara Bank
Capital First
Castrol
Ceat
Century
CESC
CG Power
Chennai Petro
Cholamandalam
Cipla
Coal India
Colgate
Container Corp
Cummins
Dabur India
Dalmia Bharat
DCB Bank
Dewan Housing
Dish TV
Divis Labs
DLF
Dr Reddys Labs
Eicher Motors
EngineersInd
Equitas Holding
Escorts
Exide Ind
Federal Bank
GAIL
Glenmark
GMR Infra
Godfrey Phillip
Godrej Consumer
Godrej Ind
Granules India
Grasim
GSFC
Havells India
HCL Tech
HDFC
HDFC Bank
Hero Motocorp
Hexaware Tech
Hind Constr
Hind Zinc
Hindalco
HPCL
HUL
ICICI Bank
ICICI Prudentia
IDBI Bank
IDFC
IDFC Bank
IFCI
IGL
India Cements
Indiabulls Hsg
Indian Bank
IndusInd Bank
Infibeam Avenue
Infosys
Interglobe Avi
IOC
IRB Infra
ITC
Jain Irrigation
Jaiprakash Asso
Jet Airways
Jindal Steel
JSW Steel
Jubilant Food
Just Dial
Kajaria Ceramic
Karnataka Bank
Kaveri Seed
Kotak Mahindra
KPIT Tech
L&T Finance
Larsen
LIC Housing Fin
Lupin
M&M
M&M Financial
Mahanagar Gas
Manappuram Fin
Marico
Maruti Suzuki
Max Financial
MCX India
Mindtree
Motherson Sumi
MRF
MRPL
Muthoot Finance
NALCO
NBCC (India)
NCC
Nestle
NHPC
NIIT Tech
NMDC
NTPC
Oil India
ONGC
Oracle Fin Serv
Oriental Bank
Page Industries
PC Jeweller
Petronet LNG
Pidilite Ind
Piramal Enter
PNB
Power Finance
Power Grid Corp
PTC India
PVR
Ramco Cements
Raymond
RBL Bank
REC
Rel Capital
Reliance
Reliance Comm
Reliance Infra
Reliance Power
Repco Home
SAIL
SBI
Shree Cements
Shriram Trans
Siemens
South Ind Bk
SREI Infra
SRF
Strides Pharma
Sun Pharma
Sun TV Network
Suzlon Energy
Syndicate Bank
Tata Chemicals
Tata Comm
Tata Elxsi
Tata Global Bev
Tata Motors
Tata Motors (D)
Tata Power
Tata Steel
TCS
Tech Mahindra
Titan Company
Torrent Pharma
Torrent Power
TV18 Broadcast
TVS Motor
Ujjivan Financi
UltraTechCement
Union Bank
United Brewerie
United Spirits
UPL
V-Guard Ind
Vedanta
Vodafone Idea
Voltas
Wipro
Wockhardt
Yes Bank
Zee Entertain
Select
ACC
Adani Enterpris
Adani Ports
Adani Power
Ajanta Pharma
Allahabad Bank
Amara Raja Batt
Ambuja Cements
Apollo Hospital
Apollo Tyres
Arvind
Ashok Leyland
Asian Paints
Aurobindo Pharm
Axis Bank
Bajaj Auto
Bajaj Finance
Bajaj Finserv
Balkrishna Ind
Bank of Baroda
Bank of India
Bata India
BEML
Berger Paints
Bharat Elec
Bharat Fin
Bharat Forge
Bharti Airtel
Bharti Infratel
BHEL
Biocon
Bosch
BPCL
Britannia
Cadila Health
Can Fin Homes
Canara Bank
Capital First
Castrol
Ceat
Century
CESC
CG Power
Chennai Petro
Cholamandalam
Cipla
Coal India
Colgate
Container Corp
Cummins
Dabur India
Dalmia Bharat
DCB Bank
Dewan Housing
Dish TV
Divis Labs
DLF
Dr Reddys Labs
Eicher Motors
EngineersInd
Equitas Holding
Escorts
Exide Ind
Federal Bank
GAIL
Glenmark
GMR Infra
Godfrey Phillip
Godrej Consumer
Godrej Ind
Granules India
Grasim
GSFC
Havells India
HCL Tech
HDFC
HDFC Bank
Hero Motocorp
Hexaware Tech
Hind Constr
Hind Zinc
Hindalco
HPCL
HUL
ICICI Bank
ICICI Prudentia
IDBI Bank
IDFC
IDFC Bank
IFCI
IGL
India Cements
Indiabulls Hsg
Indian Bank
IndusInd Bank
Infibeam Avenue
Infosys
Interglobe Avi
IOC
IRB Infra
ITC
Jain Irrigation
Jaiprakash Asso
Jet Airways
Jindal Steel
JSW Steel
Jubilant Food
Just Dial
Kajaria Ceramic
Karnataka Bank
Kaveri Seed
Kotak Mahindra
KPIT Tech
L&T Finance
Larsen
LIC Housing Fin
Lupin
M&M
M&M Financial
Mahanagar Gas
Manappuram Fin
Marico
Maruti Suzuki
Max Financial
MCX India
Mindtree
Motherson Sumi
MRF
MRPL
Muthoot Finance
NALCO
NBCC (India)
NCC
Nestle
NHPC
NIIT Tech
NMDC
NTPC
Oil India
ONGC
Oracle Fin Serv
Oriental Bank
Page Industries
PC Jeweller
Petronet LNG
Pidilite Ind
Piramal Enter
PNB
Power Finance
Power Grid Corp
PTC India
PVR
Ramco Cements
Raymond
RBL Bank
REC
Rel Capital
Reliance
Reliance Comm
Reliance Infra
Reliance Power
Repco Home
SAIL
SBI
Shree Cements
Shriram Trans
Siemens
South Ind Bk
SREI Infra
SRF
Strides Pharma
Sun Pharma
Sun TV Network
Suzlon Energy
Syndicate Bank
Tata Chemicals
Tata Comm
Tata Elxsi
Tata Global Bev
Tata Motors
Tata Motors (D)
Tata Power
Tata Steel
TCS
Tech Mahindra
Titan Company
Torrent Pharma
Torrent Power
TV18 Broadcast
TVS Motor
Ujjivan Financi
UltraTechCement
Union Bank
United Brewerie
United Spirits
UPL
V-Guard Ind
Vedanta
Vodafone Idea
Voltas
Wipro
Wockhardt
Yes Bank
Zee Entertain

python: Arrange in pandas dataframe

I extract the data from a webpage but would like to arrange it into the pandas dataframe table.
finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')
finz = html.fromstring(finviz.content)
col = finz.xpath('//table/tr/td[#class="table-top"]/text()')
data = finz.xpath('//table/tr/td/a[#class="screener-link"]/text()')
Col is the column for the pandas dataframe and each of the 28 data points in data list will be arranged accordingly into rows. data points 29 to 56 in the second row and so forth. How to write the code elegantly?
datalist = []
for y in range (28):
datalist.append(data[y])
>>> datalist
['1', 'Agilent Technologies, Inc.', 'Healthcare', 'Medical Laboratories & Research', 'USA', '23.00B', '29.27', '4.39', '4.53', '18.76', '1.02%', '5.00%', '5.70%', '3
24.30M', '308.52M', '2.07', '8.30%', '15.70%', '14.60%', '1.09', '1,775,149', '2', 'Alcoa Corporation', 'Basic Materials', 'Aluminum', 'USA', '1.21B', '-']
But the result is not in table form like dataframe
Pandas has a function to parse HTML: pd.read_html
You can try the following:
# Modules
import pandas as pd
import requests
# HTML content
finviz = requests.get('https://finviz.com/screener.ashx?v=152&o=ticker&c=0,1,2,3,4,5,6,7,10,11,12,14,16,17,19,21,22,23,24,25,31,32,33,38,41,48,65,66,67&r=1')
# Convert to dataframe
df = pd.read_html(finviz.content)[-2]
# Set 1st row to columns names
df.columns = df.iloc[0]
# Drop 1st row
df = df.drop(df.index[0])
# df = df.set_index('No.')
print(df)
# 0 No. Ticker Company Sector Industry Country ... Debt/Eq Profit M Beta Price Change Volume
# 1 1 A Agilent Technologies, Inc. Healthcare Medical Laboratories & Research USA ... 0.51 14.60 % 1.20 72.47 - 0.28 % 177333
# 2 2 AA Alcoa Corporation Basic Materials Aluminum USA ... 0.44 - 10.80 % 2.03 6.28 3.46 % 3021371
# 3 3 AAAU Perth Mint Physical Gold ETF Financial Exchange Traded Fund USA ... - - - 16.08 - 0.99 % 45991
# 4 4 AACG ATA Creativity Global Services Education & Training Services China ... 0.02 - 2.96 0.95 - 0.26 % 6177
# 5 5 AADR AdvisorShares Dorsey Wright ADR ETF Financial Exchange Traded Fund USA ... - - - 40.80 0.22 % 1605
# 6 6 AAL American Airlines Group Inc. Services Major Airlines USA ... - 3.70 % 1.83 12.81 4.57 % 16736506
# 7 7 AAMC Altisource Asset Management Corporation Financial Asset Management USA ... - -17.90 % 0.78 12.28 0.00 % 0
# 8 8 AAME Atlantic American Corporation Financial Life Insurance USA ... 0.28 - 0.40 % 0.29 2.20 3.29 % 26
# 9 9 AAN Aaron's, Inc. Services Rental & Leasing Services USA ... 0.20 0.80 % 1.23 22.47 - 0.35 % 166203
# 10 10 AAOI Applied Optoelectronics, Inc. Technology Semiconductor - Integrated Circuits USA ... 0.49 - 34.60 % 2.02 7.80 2.63 % 61303
# 11 11 AAON AAON, Inc. Industrial Goods General Building Materials USA ... 0.02 11.40 % 0.88 48.60 0.71 % 20533
# 12 12 AAP Advance Auto Parts, Inc. Services Auto Parts Stores USA ... 0.21 5.00 % 1.04 95.94 - 0.58 % 165445
# 13 13 AAPL Apple Inc. Consumer Goods Electronic Equipment USA ... 1.22 21.50 % 1.19 262.39 2.97 % 11236642
# 14 14 AAT American Assets Trust, Inc. Financial REIT - Retail USA ... 1.03 12.50 % 0.99 25.35 2.78 % 30158
# 15 15 AAU Almaden Minerals Ltd. Basic Materials Gold Canada ... 0.04 - 0.53 0.28 - 1.43 % 34671
# 16 16 AAWW Atlas Air Worldwide Holdings, Inc. Services Air Services, Other USA ... 1.33 - 10.70 % 1.65 22.79 2.70 % 56521
# 17 17 AAXJ iShares MSCI All Country Asia ex Japan ETF Financial Exchange Traded Fund USA ... - - - 60.13 1.18 % 161684
# 18 18 AAXN Axon Enterprise, Inc. Industrial Goods Aerospace/Defense Products & Services USA ... 0.00 0.20 % 0.77 71.11 2.37 % 187899
# 19 19 AB AllianceBernstein Holding L.P. Financial Asset Management USA ... 0.00 89.60 % 1.35 19.15 1.84 % 54588
# 20 20 ABB ABB Ltd Industrial Goods Diversified Machinery Switzerland ... 0.67 5.10 % 1.10 17.44 0.52 % 723739
# [20 rows x 29 columns]
I let you improve the data selection if the HTML page structure change ! The parent div id might be useful.
Explanation "[-2]": the read_html returns a list of dataframe:
list_df = pd.read_html(finviz.content)
print(type(list_df))
# <class 'list'>
# Elements types in the lists
print(type(list_df [0]))
# <class 'pandas.core.frame.DataFrame' >
So in order to get the desired dataframe, I select the 2nd element before the end with [-2]. This discussion explains about negative indexes.

Categories

Resources