I am very new in python and currently learning pandas toolkit and I want to extract table data from the web page for my following project
can anyone help me by providing the best toolkit name for data extraction and less time consuming as I am not focusing on that part.Thank you.
http://contentlinks.dionglobal.in/ib/closeprices.asp?Exchange=NSE&Startname=A
import requests
from BeautifulSoup import BeautifulSoup
req = requests.get('http://contentlinks.dionglobal.in/ib/closeprices.asp?Exchange=NSE&Startname=A')
soup = BeautifulSoup(req.text)
table = soup.findAll('table')
tr = table[-1].findAll('tr')
for i in tr:
# here you can extract data from tr
# and write your pandas code
print i # example
print '\n\n' # example
I am not sure, in which way you want to save, but this will help
Related
I need to download the Net Income of the s&p 500 companies from this website https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement
I wrote this part of code following an online guide (this one https://towardsdatascience.com/web-scraping-for-accounting-analysis-using-python-part-1-b5fc016a1c9a), but i can't figure out how to conlude it and, more specifically, how to download the extracted Net Income into an excel file.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement'
response = requests.get(url)
response
soup = BeautifulSoup(response.text, 'html.parser')
income_statement = soup.findAll('a')[19]
link = income_statement['href']
download_url = 'https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement/'+ link
Any suggestion would be very appreciated, thanks!
I think the correct way to tackle this mission is to use some stock market API instead of web scraping with BS4.
I'll recommend you to have a look at the following article, it also includes some practical examples:
https://towardsdatascience.com/best-5-free-stock-market-apis-in-2019-ad91dddec984
Edit:
If you decide to stick to the plan of using this exact URL you mentioned, I think you should try to use pandas, try something like this:
import pandas as pd
data = pd.read_html('https://www.macrotrends.net/stocks/charts/MMM/3m/income-statement​',skiprows=1)
You'll have to play with encoding a little bit as the table contains some non-ascii chars
I'm new to Python and I need to get the data from a table on a
Webpage and send to a list.
I've tried everything, and the best I got is:
f = urllib.request.urlopen(url)
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
rows.append(tr)
Any suggestions?
You're not that far !
First make sure to import the proper version of BeautifulSoup which is BeautifulSoup4 by doing apt-get install python3-bs4 (assuming you're on Ubuntu or Debian and running Python 3).
Then isolate the td elements of html table and clean data a bit. For example remove the first 3 elements of the lists which are useless, and remove the ugly '\n':
import urllib
from bs4 import BeautifulSoup
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-enUS.asp?Data=11/22/2017&Data1=20171122&slcTaxa=APR#"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),'lxml')
rows=list()
for tr in soup.findAll('table'):
for td in tr:
rows.append(td.string)
temp_list=rows[3:]
final_list=[element for element in temp_list if element != '\n']
I don't know which data you want to extract precisely. Now you need to work on your Python list (called final_list here)!
Hope it's clear.
There is a Dowload option at the end of the webpage. If you can download the file manually you are good to go.
If you want to access different dates automatically, and since it is JavaScript, I suggest to use Selenium to download the xlsx files through Python.
With the xlsx file you can use Xlsxwriter to read the data and do what you want.
Let's say we have a website www.example.com
and I need 5 certain elements from the website, I have found every element and declared them using BeautifulSoup.
g_data1 = soup.find_all("td", {"class": "title"})
for item in g_data1:
try
print item.****[3].text
except:
pass
Now I have to save this information in a CSV file named ****.csv
This is my code for trying to save it in the CSV file:
def save_csv(f, tvseries):
'''
Output a CSV file containing highest ranking TV-series.
'''
import urllib2
url = *example url*
response = urllib2.urlopen(url)
with open('****.csv', 'w') as f:
f.write(response.read())
Im getting the entire html website.. because i've obviously declared it to grab the url but can someone explain me a different kind of approach, because I don't really understand how to :L
with kind regards,
1337
You should be using Python's csv module.
Specifically the CSVWriter.
Take the text items you grabbed using BeautifulSoup and write them into the CSV file.
I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time.
I found a scam-database: http://www.419scam.org/emails/
I would like to download the content of the links with python, but I am stuck.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
links2 = soup.findAll(href=re.compile("index"))
print links2
So I can fetch the links but I don't know yet how can I download the content. Any ideas? Thanks a lot!
You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis.
This essentially amounts to a very simple web-crawler. If you can use other people's code, you may find something that fits by googling "python Web crawler." I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. Most web-crawlers use recursion to traverse the full tree of a given site. It looks like something much simpler could suffice for your case.
Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done:
from BeautifulSoup import BeautifulSoup
import urllib2, re
emailContents = []
def analyze_emails():
# this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents
def parse_email_page(link):
print "opening " + link
# open, soup, and parse the page.
#Looks like the email itself is in a "blockquote" tag so that may be the starting place.
#From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents
def parse_list_page(link):
print "opening " + link
html = urllib2.urlopen(link).read()
soup = BeatifulSoup(html)
email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages
for link in email_page_links:
parseEmailPage(link['href'])
def main():
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work
for link in links:
parse_list_page(link['href'])
analyze_emails()
if __name__ == "__main__":
main()
Here,
http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500
There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:
import urllib
import os
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
web.close()
ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()
I lost from here. Anyone who can help on this? Thanks!
Pandas can do this right out of the box, saving you from having to parse the html yourself. read_html() extracts all tables from your html and puts them in a list of dataframes. to_csv() can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used df_list[-1] in the code below.
import requests
import pandas as pd
url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
It's simple enough to do in one line, if you prefer:
pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)
P.S. Just make sure you have lxml, html5lib, and BeautifulSoup4 packages installed in advance.
So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.
You already have solutions using BeautifulSoup. I'll post a solution using lxml:
from lxml import etree
import urllib.request
web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
I would recommend BeautifulSoup as it has the most functionality. I modified a table parser that I found online that can extract all tables from a webpage, as long as there are no nested tables. Some of the code is specific to the problem I was trying to solve, but it should be pretty easy to modify for your usage. Here is the pastbin link.
http://pastebin.com/RPNbtX8Q
You could use it as follows:
from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())
# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let's say it's the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
writer = csv.writer(f)
for row in table:
writer.writerow(row)
The code above is an outline, but if you use the table parser from the pastbin link you should be able to get to where you want to go.
You need to parse the table into an internal data structure and then output it in CSV form.
Use BeautifulSoup to parse the table. This question is about how to do that (the accepted answer uses version 3.0.8 which is out of date by now, but you can still use it, or convert the instructions to work with BeautifulSoup version 4).
Once you have the table in a data structure (probably a list of lists in this case) you can write it out with csv.write.
Look at BeautifulSOup module. In documentation you will find many examples of parsing html.
Also for csv you have ready solution - csv module.
It should be quite easy.
Look at this answer parsing table with BeautifulSoup and write in text file.
Also use google with next words "python beautifulsoup"