Python scraping with beautifulsoup cannot scrape properly some lines of data

Python scraping with beautifulsoup cannot scrape properly some lines of data - python

I am exploring web scraping in python. I have the following snippet but the problem with this code is that some lines of data being extracted is not correct. What could be the problem of this snippet?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = 'https://bscscan.com/txsinternal?ps=100&zero=false&valid=all'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req, timeout=10).read()
soup = BeautifulSoup(webpage, 'html.parser')
rows = soup.findAll('table')[0].findAll('tr')
for row in rows[1:]:
ttype = (row.find_all('td')[3].text[0:])
amt = (row.find_all('td')[7].text[0:])
transamt = str(amt)
print()
print ("this is bnbval: ", transamt)
print ("transactiontype: ", ttype)
Sample output:
trans amt: Binance: WBNB Token #- wrong data being extracted
transtype: 0x2de500a9a2d01c1d0a0b84341340f92ac0e2e33b9079ef04d2a5be88a4a633d4 #- wrong data being extracted
trans amt: 1 BNB
transtype: call
trans amt: 1 BNB
transtype: call
this is bnbval: Binance: WBNB Token #- wrong data being extracted
transactiontype: 0x1cc224ba17182f8a4a1309cb2aa8fe4d19de51c650c6718e4febe07a51387dce #- wrong data being extracted
trans amt: 1 BNB
transtype: call

There is nothing wrong with your code. But there is a problem with the data on the page.
Some rows are 7 column rows - one that you're expecting, and some rows are 9 column rows. Those that are 9 column rows give you wrong data.
You can just go to the page and inspect elements to see the issue.
I can suggest that you use the last element [-1] instead of [7]. But you need to have some kind of if check for 3rd column

Related

Need a 'for loop' to get dividend data for a stock portfolio, from their respective api urls

I am trying to automate parsing of dividend data for a stock portfolio, and getting the stock wise dividend values into a single dataframe table.
The data for each stock in a portfolio is stored in a separate api url
The portfolio ids (for stocks - ITC, Britannia, Sanofi) are [500875, 500825, 500674].
I would first like to run a 'for loop' to generate/concatenate each specific url (which goes like this - https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode=500674), the last 6 digit numbers of urls being their respective company ids
Then I would like to use that url to get each of the respective dividend table's first line into a single dataframe. The code I used to get the individual dividend data, and the final dataframe that I need is represented in image attached
Basically I would like to run a 'for loop' to get the first line of 'Table2' for each stock id and store it in a single data frame as a final result.
PS - The code which I used to get individual dividend data is highlighted below:
url = 'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode=500674'
jsondata = requests.get(url, headers= {'User-Agent': 'Mozilla/5.0'}).json()
df = pd.DataFrame(jsondata['Table2'])

If you need for-loop then you should use it and show code with for-loop and problem which it gives you.
You could use single for-loop for all works.
You can use string formatting to create url with code and read data from server. Next you can get first row (even without creating DataFrame) and append to list with all rows. And after loop you can convert this list to DataFrame
import requests
import pandas as pd
# --- before loop ---
headers = {'User-Agent': 'Mozilla/5.0'}
all_rows = []
# --- loop ---
for code in [500875, 500825, 500674]:
# use `f-string` of string `.format()` to create url
#url = f'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode={code}'
url = 'https://api.bseindia.com/BseIndiaAPI/api/CorporateAction/w?scripcode={}'.format(code)
r = requests.get(url, headers=headers)
#print(r.text) # to check error message
#print(r.status_code)
data = r.json()
first_row = data['Table2'][0] # no need to use DataFrame
#df = pd.DataFrame(data['Table2'])
#first_row = df.iloc[0]
#print(first_row)
all_rows.append(first_row)
# --- after loop ---
df_result = pd.DataFrame(all_rows)
print(df_result)
Result:
scrip_code sLongName ... Details PAYMENT_DATE
0 500875 ITC LTD. ... 10.1500 2020-09-08T00:00:00
1 500825 BRITANNIA INDUSTRIES LTD. ... 83.0000 2020-09-16T00:00:00
2 500674 Sanofi India Ltd ... 106.0000 2020-08-06T00:00:00
[3 rows x 9 columns]

How to revert a string into a command?

I actually need more than one item from the page but they are all under the same headers and I really don't want to repeat the same soup_wash.find("td", headers="tf89c8e5b-5207-48e7-a536-1f50ee7f5088c{}").text.strip() line every time, so I am trying to set text as the directory to save time.
import requests
from bs4 import BeautifulSoup
def html(url):
return BeautifulSoup(requests.get(url).text, "lxml")
soup_wash = html("https://www.washtenaw.org/3108/Cases")
text = 'soup_wash.find("td", headers="tf89c8e5b-5207-48e7-a536-1f50ee7f5088c{}").text.strip()'
item1 = text.format("2")
item2 = text.format("6")
print(item1, item2) # Supposed to print -> 1561, 107 but it actually prints str(text) formatted.
I need bs4 to process the string of item1 and item2 but I'm not sure how to do so.

I personally wouldn't use the value tf89c8e5b-5207-48e7-a536-1f50ee7f5088c{} to get Total Cases and Total Deaths values, because it looks it will change any time.
Instead, grab first table and use standard python indexing to get the columns. For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.washtenaw.org/3108/Cases'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print('{:<15}{}'.format('Total Cases', 'Total Deaths'))
for tr in soup.select('table')[0].select('tr:has(td)'):
tds = [td.get_text() for td in tr.select('td')]
print('{:<15}{}'.format(tds[1], tds[5]))
Prints:
Total Cases Total Deaths
1561 107
338 3
1899 110

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.
I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.
My two-part question is this:
I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
Thank you for the help!

You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.
Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.
def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content

Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:
cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)
Then you create your base_url, with the items inserted and get your data:
for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)
And go from there.

How can I find next sibling of 'a' tag which is inside into a table th tag?

I am scraping companies data from Wikipedia infobox table where I need to scrape some values that are inside td - like Type, Traded as, Services etc.
my code is
response = requests.get(url,headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
table_container = html_soup.find('table', class_='infobox')
hq_name=table_container.find("th", text=['Headquarters']).find_next_sibling("td")
It gives the headquarter value and works perfectly
But when I am going to fetch 'Trade as' or any hyperlink th element the above code is not working, it returns none.
So how to get the next sibling of Trade as or Type?

From your comment:
https://en.wikipedia.org/wiki/IBM This is the URL, and the expected
output will be Trade as- NYSE: IBM DJIA Component S&P 100 Component
S&P 500 Component
Use the a tags to separate and select the required row from the table by nth-of-type. You can then join the first two items in the output list if required
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/IBM')
soup = bs(r.content, 'lxml')
items = [item.text.replace('\xa0',' ') for item in soup.select('.vcard tr:nth-of-type(4) a')]
print(items)
To have as shown (if indeed first and second are joined?):
final = items[2:]
final.insert(0, '-'.join([items[0] , items[1]]))
final

Python - Combine two, single column lists into one dual column list and print

I'm just beginning to dabble with Python, and as many have done I am starting with a web-scraping example to try the language.
I have seen many examples of using zip and map to combine lists, but I am having issues attempting to have that list print.
Again, I am new so please be gentle.
The code gathers everything from 2 certain tag types (the date and title of a post) and returns them as 2 lists. For this I am using BeautifulSoup and requests.
The site I am practicing on for this test is the blog for a small game called 'Staxel'
I can get my code to print a full list of one tag using [soup.find] and [print] in a for loop, but when I attempt to add a 2nd list to print I am simply getting a termination with no error.
Any tips on how to correctly print the 2 lists?
I am looking for output like
Entry 2019-01-06 New Years
Entry 2018-11-30 Staxel Changelog for 1.3.52
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
title_box = soup.find_all('h1',attrs={'class':'entry-title'})
date_box = soup.find_all('span',attrs={'class':'entry-date published'})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip()for date in date_box]
date_list = zip(dates, titles)
for heading in date_list:
print ("Entry {}")

The problem is your query for dates is returning an empty list, so the zipped result will also be empty. To extract the date from that page, you want to look for tags of type time, not span, with class entry-date published:
like this:
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
So with the following code:
import requests
from bs4 import BeautifulSoup
quote_page = "https://blog.playstaxel.com"
page = requests.get(quote_page)
soup = BeautifulSoup(page.content, "lxml")
title_box = soup.find_all("h1", attrs={"class": "entry-title"})
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip() for date in date_box]
for date, title in zip(dates, titles):
print(f"{date}: {title}")
The result becomes:
2019-01-10: Magic update – feature preview
2019-01-06: New Years
2018-11-30: Staxel Changelog for 1.3.52
2018-11-13: Staxel Changelog for 1.3.49
2018-10-21: Staxel Changelog for 1.3.48
2018-10-12: Halloween Update & GOG

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scraping with beautifulsoup cannot scrape properly some lines of data - python

Related

Need a 'for loop' to get dividend data for a stock portfolio, from their respective api urls

How to revert a string into a command?

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

How can I find next sibling of 'a' tag which is inside into a table th tag?

Python - Combine two, single column lists into one dual column list and print

Categories

Resources