Looking to get a json and csv file from this - python

import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen("http://www.nhl.com/scores/htmlreports/20172018/TH020070.HTM").read()
soup = bs.BeautifulSoup(sauce, "html.parser")
table = soup.table
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
I am trying to output this to a csv and json. How would i do both(not at the same time). Eventually when i get it properly formatted i would like to dump it straight into postgres. New to python so any help and suggestions would be appreciated. I got help previously with output to csv using pandas but i cant get it to format the way i would like it using pandas although ive been told its much easier..

Assuming you want to output the row variable in each iteration to a JSON / CSV.
For JSON, you can simply dump the list of all rows to JSONs. Something like:
import json
#Your logic here
rows=[]
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
rows.append(row)
with open("out", "w") as fp:
json.dump(rows, fp)
For CSV, you can use a similar logic as well.
Check out the documentation:
https://docs.python.org/2/library/csv.html
https://docs.python.org/2/library/json.html

Related

How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way into CSV (with sample)

I am learning python and practicing it for extracting data in a public site.
but I found a problem in this learning. I'd like to get your kindly help me out.
Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)
Purpose:
extract all 65 pages' col, row with contents into a csv in one script
65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65
Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file
for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?
here is one page extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
for tr in div.find_all("tr")[1:]:
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
csv_writer.writerow(data)
Question2:
I found problem to literate 65 pages urls to extract data into csv.
it doesn't work... any idea fix it..
here are 65 pages urls' extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
for url in [url.format(pageNo) for pageNo in range(1,65)]:
soup = bs(url.content, 'html.parser')
for div in soup.find_all("div", class_ = "iiright"):
for tr in div.find_all("tr"):
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
writer.writerow(data)
if __name__ == '__main__':
with open("test.csv","w",newline="") as infile:
writer = csv.writer(infile)
get_data(url)
Just an alternativ approach
Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.
define a list (data) to keep your results
iterate over the urls with pd.read_html
concat the data frames in data and write them to_csvor to_excel
read_html
find the table that matches a string -> match='预售信息查询:' and select it with [0] cause read_html() will always give you a list of tables
take a special row as header header =2
get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]
Example
import pandas as pd
data = []
for pageNo in range(1,5):
data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
Example (based on your code with function)
import pandas as pd
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="
def get_data(url):
data = []
for pageNo in range(1,2):
data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
if __name__ == '__main__':
get_data(url)

Combine multiple lists into one organized csv using bs4

I am new to this, I am using this as a learning opportunity, and have only gotten this far due to this communities help. But, I am trying to grab multiple sections pages like this
https://m.the-numbers.com/movie/Black-Panther
specifically the summary, starring cast, and supporting cast
I have been successful writing 1 list to csv, but cannot seem to find a way to write multiple. I am looking for a solution that is scalable, where I can keep adding more lists to the export.
things I have tried:
putting them in separate lists such as details, actors, using the same list with details.extended, etc. and nothing seems to work.
Expected result is producing a table such as:
HEADERS:
title, amount,starName,StarCharacter
with the data listed underneath.
ERRORS:
Exception has occurred: AttributeError'str' object has no attribute 'keys'
from bs4 import BeautifulSoup
import csv
import re
# Making get request
r = requests.get('https://m.the-numbers.com/movie/Black-Panther')
# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')
# Localizing table from the BS object
table_soup = soup.find('div', class_='row').find('div', class_='table-responsive').find('table', id='movie_finances')
website = 'https://m.the-numbers.com/'
details = []
# Iterating through all trs in the table except the first(header) and the last two(summary) rows
for tr in table_soup.find_all('tr')[2:4]:
tds = tr.find_all('td')
# Creating dict for each row and appending it to the details list
details.extend({
'title': tds[0].text.strip(),
'amount': tds[1].text.strip(),
})
cast_soup = soup.find('div', id='accordion').find('div', class_='cast_new').find('table', class_='table table-sm')
for tr in cast_soup.find_all('tr')[2:15]:
tdc = tr.find_all('td')
# Creating dict for each row and appending it to the details list
details.append({
'starName': tdc[0].text.strip(),
'starCharacter': tdc[1].text.strip(),
})
# Writing details list of dicts to file using csv.DictWriter
with open('moviesPage2018.csv', 'w', encoding='utf-8', newline='\n') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=details[0].keys())
writer.writeheader()
writer.writerows(details)```

Unable to write data across columns in a csv file

I've written a script in python to scrape different names and their values out of a table from a webpage and write the same in a csv file. My below script can parse them flawlessly but I can't write them to a csv file in a customized manner.
What I wish to do is write the names and values across columns which you may see in image 2.
This is my try:
import csv
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.bloomberg.com/markets/stocks",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
writer.writerow([name,value])
Output I'm getting like below:
How I wish to get the output is like the following:
Any help to solve this will be vastly appreciated.
Try to define empty list, append all the values in a loop and then write them all at once:
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
names_and_values = []
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
names_and_values.extend([name,value])
writer.writerow(names_and_values)
If I understand you correctly, try making just one call to writerow instead of one per loop
import csv
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.bloomberg.com/markets/stocks",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
data = []
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
data.extend([name, value])
writer.writerow(data)
That seems like an ugly thing to want to do, are you sure?
Use pandas for getting csvs and manipulating tables. You'll want to do something like:
import pandas as pd
df = pd.read_csv(path)
df.values.ravel()

Parse complex multi-header html table with pandas and bs4

Complex Table link
I have used bs4, pandas and lxml libraries to parse the html table above but i am not having success. With pandas i try to skip rows and setting header to 0 however the result is a DataFrame highly unstructured and it also seems that some data is missing.
With the other 2 libraries i tried to use selectors and even the xpath from the tbody section but i receive a empty list in both cases.
This would be what i want to retrieve:
Can anyone give me a hand about how i can i scrape that data?
Thank you!
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
page = urlopen('https://transparency.entsoe.eu/generation/r2/actualGenerationPerProductionType/show?name=&defaultValue=true&viewType=TABLE&areaType=BZN&atch=false&datepicker-day-offset-select-dv-date-from_input=D&dateTime.dateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&dateTime.endDateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&area.values=CTY%7C10YES-REE------0!BZN%7C10YES-REE------0&productionType.values=B01&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B20&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&dateTime.timezone=UTC&dateTime.timezone_input=UTC')
soup = BeautifulSoup(page.read())
table = soup.find('tbody')
res = []
row = []
for tr in table.find_all('tr'):
for td in tr.find_all('td'):
row.append(td.text)
res.append(row)
row = []
df = pd.DataFrame(data=res)
Then add column names with df.columns and drop empty columns.
EDIT: Suggest this modifed for-loop. (BillBell)
>>> for tr in table.find_all('tr'):
... for td in tr.find_all('td'):
... row.append(td.text.strip())
... res.append(row)
... row = []
The original form of the for statement failed compilation.
The original form of the the append left new-lines and blanks in constants.

Get column from a table with Python and Beautiful Soup

I am new to Python and I want to get the "price" column of data from a table however I'm unable to retrieve that data.
Currently what I'm doing:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr"):
col = row.find_all("td")
print(col[2])
print("---")
I keep getting a list index out of value range. I've read the documentation and tried a few different ways, but I can't seem to get it down.
Also, I am using Python3.
The problem is that you are iterating over all tr inside the table, and there is 1 header tr at the beginning that you don't need, so just avoid using that one:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
print(col[2])
print("---")
Probably means that one of the rows has no td tag. You could wrap the print or whatever usage of col[2] in a try except block and ignore cases where the col is empty or has less than three items
for row in table.find_all("tr"):
col = row.find_all("td")
try:
print(col[2])
print("---")
except IndexError:
pass

Categories

Resources