Unable to write data across columns in a csv file - python

I've written a script in python to scrape different names and their values out of a table from a webpage and write the same in a csv file. My below script can parse them flawlessly but I can't write them to a csv file in a customized manner.
What I wish to do is write the names and values across columns which you may see in image 2.
This is my try:
import csv
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.bloomberg.com/markets/stocks",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
writer.writerow([name,value])
Output I'm getting like below:
How I wish to get the output is like the following:
Any help to solve this will be vastly appreciated.

Try to define empty list, append all the values in a loop and then write them all at once:
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
names_and_values = []
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
names_and_values.extend([name,value])
writer.writerow(names_and_values)

If I understand you correctly, try making just one call to writerow instead of one per loop
import csv
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.bloomberg.com/markets/stocks",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
data = []
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
data.extend([name, value])
writer.writerow(data)

That seems like an ugly thing to want to do, are you sure?
Use pandas for getting csvs and manipulating tables. You'll want to do something like:
import pandas as pd
df = pd.read_csv(path)
df.values.ravel()

Related

How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way into CSV (with sample)

I am learning python and practicing it for extracting data in a public site.
but I found a problem in this learning. I'd like to get your kindly help me out.
Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)
Purpose:
extract all 65 pages' col, row with contents into a csv in one script
65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65
Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file
for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?
here is one page extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
for tr in div.find_all("tr")[1:]:
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
csv_writer.writerow(data)
Question2:
I found problem to literate 65 pages urls to extract data into csv.
it doesn't work... any idea fix it..
here are 65 pages urls' extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
for url in [url.format(pageNo) for pageNo in range(1,65)]:
soup = bs(url.content, 'html.parser')
for div in soup.find_all("div", class_ = "iiright"):
for tr in div.find_all("tr"):
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
writer.writerow(data)
if __name__ == '__main__':
with open("test.csv","w",newline="") as infile:
writer = csv.writer(infile)
get_data(url)
Just an alternativ approach
Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.
define a list (data) to keep your results
iterate over the urls with pd.read_html
concat the data frames in data and write them to_csvor to_excel
read_html
find the table that matches a string -> match='预售信息查询:' and select it with [0] cause read_html() will always give you a list of tables
take a special row as header header =2
get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]
Example
import pandas as pd
data = []
for pageNo in range(1,5):
data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
Example (based on your code with function)
import pandas as pd
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="
def get_data(url):
data = []
for pageNo in range(1,2):
data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
if __name__ == '__main__':
get_data(url)

Getting headers from html (parsing)

The source is https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. I am looking to use the table called "COVID-19 pandemic in the United States by state and territory" which is the third diagram on the page.
Here is my code so far
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
This last line with header1 i giving me the error "list index out of range". What it is supposed to print is "U.S State or territory....."
I don't know anything about html, and everything gets me stuck and confused. The soup.find could also be referencing the wrong part of the webpage.
Can you just use
headers = [element.text.strip() for element in data_table.find_all("th")]
To get the text in the headers?
To get the entire table as a pandas dataframe, you can do:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
rows = data_table.find_all("tr")
# Delete first row as it's not part of the table and confuses pandas
# this removes it from both soup and data_table
rows[0].decompose()
# Same for third row
rows[2].decompose()
# Same for last two rows
rows[-1].decompose()
rows[-2].decompose()
# Read html with pandas
df = pd.read_html(str(data_table))[0]
# Keep only the useful columns
df = df[['U.S. state or territory[i].1', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]']]
# Rename columns
df.columns = ["State", "Cases", "Deaths", "Recov.", "Hosp."]
It's probably easier in these cases to try to read tables with pandas, and go from there:
import pandas as pd
table = soup.select_one("div#covid19-container table")
df = pd.read_html(str(table))[0]
df
The output is the target table.
by looking at your code, I think you should call the html tag by find, not by find_all in the title tag

how to put this distorted data into the csv file, in the table format

Please find below the code that I am using to put distorted data into CSV file in the table format:
import requests
from bs4 import BeautifulSoup
import csv
f = open('moneyControl-bonus','w' , newline = '')
writer = csv.writer(f)
f2 = open('moneyControl-dividend','w' , newline = '')
writer2 = csv.writer(f2)
url = 'https://www.moneycontrol.com/stocks/marketinfo/upcoming_actions/home.html'
headers = {'user-agent':'Mozilla/5.0'}
response = requests.get(url,headers)
soup = BeautifulSoup(response.content,'lxml')
div = soup.find_all('div',class_='tbldata36 PT10')[0]
for table in div.find_all('table'):
for row in table.find_all('tr'):
writer.writerow[row]
div2 = soup.find_all('div',class_='tbldata36 PT20')[0]
for table2 in soup.find_all('table'):
for row2 in table2.find_all('tr'):
writer2.writerow[row2]
Have you tried using pandas? This the default library to be used in writing the csv format.
# pip install pandas
import pandas as pd
You can write into the CSV file by either passing list of lists or by dictonary.
df = pd.DataFrame({'col1':[1,2,3,4],'col2':['a','b','c','d']})
df = pd.DataFrame([[1,2,3,4],['a','b','c','d']],columns=['col1','col2'])
df.to_csv(path_and_name_of_file)
You can use many different formats with that one such as Excel, table, text, json etc.
Please take a look at the official DataFrame Documentation

Saving two columns on csv file after running two separate for loops

I have a code below which scrapes web data using BeautifulSoup. I am using two different for loops to grab two different sets of data: name and value
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://finance.yahoo.com/quote/' + ticker + '/key-statistics?p=' + ticker).text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('yahoo_key_stats_grab.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['name', 'value'])
def yahoo_key_stats_grab(ticker):
for stat in soup.find_all('span')[12:21]:
name = stat.text
print(name)
csv_writer.writerow([name])
for stat in soup.find_all('td', class_='Fz(s) Fw(500) Ta(end)'):
if len(str(stat.text)) > 6:
break
else:
print(stat.text)
csv_file.close()
If I run the code yahoo_key_stats_grab('MIC'), I get the following output: which is exactly what I want.
Market Cap (intraday)
Enterprise Value
Trailing P/E
Forward P/E
PEG Ratio (5 yr expected)
Price/Sales
Price/Book
Enterprise Value/Revenue
Enterprise Value/EBITDA
3.23B
6.8B
6.95
16.04
1.64
1.73
1.04
3.65
10.80
However, I would like to save the scraped data on a csv file with two columns name and value. I can get the name column, but I can't figure out how to add the second column value to the csv file.
name value
Market Cap (intraday)
Enterprise Value
Trailing P/E
Forward P/E
PEG Ratio (5 yr expected)
Price/Sales
Price/Book
Enterprise Value/Revenue
Enterprise Value/EBITDA
Can anyone give me some suggestions?
Thanks in advance.
You can add columns to csv files by passing an array to the csv.write() method.
Example:
import csv
data = [["key1", "value1"], ["key2", "value2"]
csv_file = open('testfile.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['name', 'value'])
for row in data:
csv_writer.writerow(data[0], data[1])
csv_file.close()
Update: In your case, since you have two different for loops creating your data, you can store the first set of data in a list:
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://finance.yahoo.com/quote/' + ticker + '/key-statistics?p=' + ticker).text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('yahoo_key_stats_grab.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['name', 'value'])
def yahoo_key_stats_grab(ticker):
names = []
for stat in soup.find_all('span')[12:21]:
names.append(stat.text)
for stat in soup.find_all('td', class_='Fz(s) Fw(500) Ta(end)'):
if len(str(stat.text)) > 6:
break
else:
csv_writer.writerow([names.pop(0), stat.text])
# note that this will throw an exception if there
# are a different number of names and stats!
csv_file.close()
It might not be the best options but what will work is to append the values to a list in the for loops that you have running and then print out what you need using the values you collected. Something like:
field = []
value = []
for stat in soup.find_all('span')[12:21]:
name = stat.text
print(name)
field.append(name)
for stat in soup.find_all('td', class_='Fz(s) Fw(500) Ta(end)'):
if len(str(stat.text)) > 6:
break
else:
value.append(stat.text)
then print them out with a new for loop with csv_writer in a single line separated by whatever separator you want for the csv

Parsing html data into python list for manipulation

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

Categories

Resources