Extracting chosen information from URL results into a dataframe - python

I would like to create a dataframe by pulling only certain information from this website.
https://www.stockrover.com/build/production/Research/tail.js?1644930560
I would like to pull all the entries like this one. ["0005.HK","HSBC HOLDINGS","",""]
Another problem is, suppose I only want only the first 20,000 lines which is the stock information and there is other information after line 20,000 that I don't want included in the dataframe.
To summarize, could someone show me how to pull out just the information I'm trying to extract and create a dataframe with those results if this is possible.
A sample of the website results
function getStocksLibraryArray(){return[["0005.HK","HSBC HOLDINGS","",""],["0006.HK","Power Assets Holdings Ltd","",""],["000660.KS","SK hynix","",""],["004370.KS","Nongshim","",""],["005930.KS","Samsung Electroni","",""],["0123.HK","YUEXIU PROPERTY","",""],["0336.HK","HUABAO INTL","",""],["0408.HK","YIP'S CHEMICAL","",""],["0522.HK","ASM PACIFIC","",""],["0688.HK","CHINA OVERSEAS","",""],["0700.HK","TENCENT","",""],["0762.HK","CHINA UNICOM","",""],["0808.HK","PROSPERITY REIT","",""],["0813.HK","SHIMAO PROPERTY",
Code to pull all lines including ones not wanted
import requests
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)

Use regex to extract the details followed by literal_eval to convert string to python object
import re
from ast import literal_eval
import pandas as pd
import requests
url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"
response = requests.request("GET", url, headers={}, data={})
regex_ = re.compile(r"getStocksLibraryArray\(\)\{return(.+?)}", re.DOTALL)
print(pd.DataFrame(literal_eval(regex_.search(response.text).group(1))))
0 1 2 3
0 0005.HK HSBC HOLDINGS
1 0006.HK Power Assets Holdings Ltd
2 000660.KS SK hynix
3 004370.KS Nongshim
4 005930.KS Samsung Electroni
... ... ... ... ..
21426 ZZHGF ZhongAn Online P&C _INSUP
21427 ZZHGY ZhongAn Online P&C _INSUP
21428 ZZLL ZZLL Information Tech _INTEC
21429 ZZZ.TO Sleep Country Canada _SPECR
21430 ZZZOF Zinc One Resources _OTHEI

Related

Trouble Looping through JSON elements pulled using API

I am trying to pull search results data from an API on a website and put it into a pandas dataframe. I've been able to successfully pull the info from the API into a JSON format.
The next step I'm stuck on is how to loop through the search results on a particular page and then again for each page of results.
Here is what I've tried so far:
#Step 1: Connect to an API
import requests
import json
response_API = requests.get('https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page=1')
#200
#Step 2: Get the data from API
data = response_API.text
#Step 3: Parse the data into JSON format
parse_json = json.loads(data)
#Step 4: Extract data
title = parse_json['results'][0]['title']
pub_date = parse_json['results'][0]['publication_date']
agency = parse_json['results'][0]['agencies'][0]['name']
Here is where I've tried to put this all into a loop:
import numpy as np
import pandas as pd
df=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
for i in parse_json:
title = parse_json['results'][i]['title']
pub_date = parse_json['results'][i]['publication_date']
agency = parse_json['results'][i]['agencies'][0]['name']
df.append([title,pub_date,agency])
cols = ["Title", "Date","Agency"]
df = pd.DataFrame(df,columns=cols)
I feel like I'm close to the correct answer, but I'm not sure how to move forward from here. I need to iterate through the results where I placed the i's when parsing through the json data, but I get an error that reads, "Type Error: list indices must be integers or slices, not str". I understand I can't put the i's in those spots, but how else am I supposed to iterate through the results?
Any help would be appreciated!
Thank you!
I think you are very close!
import numpy as np
import pandas as pd
import requests
BASE_URL = "'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}"
results = []
for page in range(0, 7):
response = requests.get(BASE_URL.format(page=page))
if response.ok:
resp_json = response.json()
for res in resp_json["results"]:
results.append(
[
res["title"],
res["publication_date"],
[agency["name"] for agency in res["agencies"]]
]
)
df = pd.DataFrame(results, columns=["Title", "Date", "Agencies"])
In this block of code, I used the requests library's built-in .json() method, which can automatically convert a response's text to a JSON dict (if it's in the proper format).
The if response.ok is a little less-verbose way provided by requests to check if the status code is < 400, and can prevent errors that might occur when attempting to parse the response if there was a problem with the HTTP call.
Finally, I'm not sure what data you need exactly for your DataFrame, but each object in the
"results" list from the pages pulled from that website has "agencies" as a list of agencies... wasn't sure if you wanted to drop all that data, so I kept the names as a list.
*Edit:
In case the response objects don't contain the proper keys, we can use the .get() method of Python dictionaries.
# ...snip
for res in resp_json["results"]:
results.append(
[
res.get("title"), # This will return `None` as a default, instead of causing a KeyError
res.get("publication_date"),
[
# Here, get the 'raw_name' or None, in case 'name' key doesn't exist
agency.get("name", agency.get("raw_name"))
for agency in res.get("agencies", [])
]
]
)
Slightly different approach: rather than iterating through the response, read into a dataframe then save what you need. The saves the first agency name in the list.
df_list=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
# print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
df = pd.json_normalize(parse_json['results'])
df['Agency'] = df['agencies'][0][0]['raw_name']
df_list.append(df[['title', 'publication_date', 'Agency']])
df_final = pd.concat(df_list)
df_final
title publication_date Agency
0 Determination of the Promotion of Economy and ... 2021-09-28 OFFICE OF MANAGEMENT AND BUDGET
1 Corporate Average Fuel Economy Standards for M... 2021-09-03 OFFICE OF MANAGEMENT AND BUDGET
2 Public Hearing for Corporate Average Fuel Econ... 2021-09-14 OFFICE OF MANAGEMENT AND BUDGET
3 Investigation of Urea Ammonium Nitrate Solutio... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
4 Call for Nominations To Serve on the National ... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
.. ... ... ...
15 Energy Conservation Program: Test Procedure fo... 2021-09-14 DEPARTMENT OF COMMERCE
16 Self-Regulatory Organizations; The Nasdaq Stoc... 2021-09-09 DEPARTMENT OF COMMERCE
17 Regulations To Improve Administration and Enfo... 2021-09-20 DEPARTMENT OF COMMERCE
18 Towing Vessel Firefighting Training 2021-09-01 DEPARTMENT OF COMMERCE
19 Patient Protection and Affordable Care Act; Up... 2021-09-27 DEPARTMENT OF COMMERCE
[140 rows x 3 columns]

Python scraping with beautifulsoup cannot scrape properly some lines of data

I am exploring web scraping in python. I have the following snippet but the problem with this code is that some lines of data being extracted is not correct. What could be the problem of this snippet?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = 'https://bscscan.com/txsinternal?ps=100&zero=false&valid=all'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req, timeout=10).read()
soup = BeautifulSoup(webpage, 'html.parser')
rows = soup.findAll('table')[0].findAll('tr')
for row in rows[1:]:
ttype = (row.find_all('td')[3].text[0:])
amt = (row.find_all('td')[7].text[0:])
transamt = str(amt)
print()
print ("this is bnbval: ", transamt)
print ("transactiontype: ", ttype)
Sample output:
trans amt: Binance: WBNB Token #- wrong data being extracted
transtype: 0x2de500a9a2d01c1d0a0b84341340f92ac0e2e33b9079ef04d2a5be88a4a633d4 #- wrong data being extracted
trans amt: 1 BNB
transtype: call
trans amt: 1 BNB
transtype: call
this is bnbval: Binance: WBNB Token #- wrong data being extracted
transactiontype: 0x1cc224ba17182f8a4a1309cb2aa8fe4d19de51c650c6718e4febe07a51387dce #- wrong data being extracted
trans amt: 1 BNB
transtype: call
There is nothing wrong with your code. But there is a problem with the data on the page.
Some rows are 7 column rows - one that you're expecting, and some rows are 9 column rows. Those that are 9 column rows give you wrong data.
You can just go to the page and inspect elements to see the issue.
I can suggest that you use the last element [-1] instead of [7]. But you need to have some kind of if check for 3rd column

Scrape table from static web site

I need scrape table with top level domains from iana.org.
My code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.iana.org/domains/root/db'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='tld-table')
How can I get this to pandas DataFrame with structure as it is on web site (DOMAIN, TYPE, TLD MANAGER).
Pandas already comes with something to read tables from html, no need to use BeautifulSoup:
import pandas as pd
url = "https://www.iana.org/domains/root/db"
# This returns a list of DataFrames with all tables in the page.
df = pd.read_html(url)[0]
You can use pandas pd.read_html
import pandas as pd
URL = "https://www.iana.org/domains/root/db"
df = pd.read_html(URL)[0]
print(df.head())
Domain Type TLD Manager
0 .aaa generic American Automobile Association, Inc.
1 .aarp generic AARP
2 .abarth generic Fiat Chrysler Automobiles N.V.
3 .abb generic ABB Ltd
4 .abbott generic Abbott Laboratories, Inc.

Need a 'for loop' to get dividend data for a stock portfolio, from their respective api urls

I am trying to automate parsing of dividend data for a stock portfolio, and getting the stock wise dividend values into a single dataframe table.
The data for each stock in a portfolio is stored in a separate api url
The portfolio ids (for stocks - ITC, Britannia, Sanofi) are [500875, 500825, 500674].
I would first like to run a 'for loop' to generate/concatenate each specific url (which goes like this - https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode=500674), the last 6 digit numbers of urls being their respective company ids
Then I would like to use that url to get each of the respective dividend table's first line into a single dataframe. The code I used to get the individual dividend data, and the final dataframe that I need is represented in image attached
Basically I would like to run a 'for loop' to get the first line of 'Table2' for each stock id and store it in a single data frame as a final result.
PS - The code which I used to get individual dividend data is highlighted below:
url = 'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode=500674'
jsondata = requests.get(url, headers= {'User-Agent': 'Mozilla/5.0'}).json()
df = pd.DataFrame(jsondata['Table2'])
If you need for-loop then you should use it and show code with for-loop and problem which it gives you.
You could use single for-loop for all works.
You can use string formatting to create url with code and read data from server. Next you can get first row (even without creating DataFrame) and append to list with all rows. And after loop you can convert this list to DataFrame
import requests
import pandas as pd
# --- before loop ---
headers = {'User-Agent': 'Mozilla/5.0'}
all_rows = []
# --- loop ---
for code in [500875, 500825, 500674]:
# use `f-string` of string `.format()` to create url
#url = f'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode={code}'
url = 'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode={}'.format(code)
r = requests.get(url, headers=headers)
#print(r.text) # to check error message
#print(r.status_code)
data = r.json()
first_row = data['Table2'][0] # no need to use DataFrame
#df = pd.DataFrame(data['Table2'])
#first_row = df.iloc[0]
#print(first_row)
all_rows.append(first_row)
# --- after loop ---
df_result = pd.DataFrame(all_rows)
print(df_result)
Result:
scrip_code sLongName ... Details PAYMENT_DATE
0 500875 ITC LTD. ... 10.1500 2020-09-08T00:00:00
1 500825 BRITANNIA INDUSTRIES LTD. ... 83.0000 2020-09-16T00:00:00
2 500674 Sanofi India Ltd ... 106.0000 2020-08-06T00:00:00
[3 rows x 9 columns]

Newyork BBL to Latitude/Longitude API

I want to convert NewYork BBL numbers to Latitude Longitude values. The BBL values are present as CSV file. Is there a Free API to convert them using python?
These has one month free.. If you go an sign up for a free account, this code works (I tried it out).
import pandas as pd
import requests
TOKEN = 'YOUR TOKEN'
def get_coord(bbl):
url = f'https://locatenyc.io/arcgis/rest/services/locateNYC/v1/GeocodeServer/findAddressCandidates?singleLine={bbl}&token={TOKEN}'
resp = requests.get(url)
data = resp.json()
attrs = data['candidates'][0]['attributes']
return attrs['longitudeInternalLabel'], attrs['latitudeInternalLabel']
df['coords'] = df['bbl'].apply(get_coord)

Categories

Resources