Combine multiple lists into one organized csv using bs4 - python

I am new to this, I am using this as a learning opportunity, and have only gotten this far due to this communities help. But, I am trying to grab multiple sections pages like this
https://m.the-numbers.com/movie/Black-Panther
specifically the summary, starring cast, and supporting cast
I have been successful writing 1 list to csv, but cannot seem to find a way to write multiple. I am looking for a solution that is scalable, where I can keep adding more lists to the export.
things I have tried:
putting them in separate lists such as details, actors, using the same list with details.extended, etc. and nothing seems to work.
Expected result is producing a table such as:
HEADERS:
title, amount,starName,StarCharacter
with the data listed underneath.
ERRORS:
Exception has occurred: AttributeError'str' object has no attribute 'keys'
from bs4 import BeautifulSoup
import csv
import re
# Making get request
r = requests.get('https://m.the-numbers.com/movie/Black-Panther')
# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')
# Localizing table from the BS object
table_soup = soup.find('div', class_='row').find('div', class_='table-responsive').find('table', id='movie_finances')
website = 'https://m.the-numbers.com/'
details = []
# Iterating through all trs in the table except the first(header) and the last two(summary) rows
for tr in table_soup.find_all('tr')[2:4]:
tds = tr.find_all('td')
# Creating dict for each row and appending it to the details list
details.extend({
'title': tds[0].text.strip(),
'amount': tds[1].text.strip(),
})
cast_soup = soup.find('div', id='accordion').find('div', class_='cast_new').find('table', class_='table table-sm')
for tr in cast_soup.find_all('tr')[2:15]:
tdc = tr.find_all('td')
# Creating dict for each row and appending it to the details list
details.append({
'starName': tdc[0].text.strip(),
'starCharacter': tdc[1].text.strip(),
})
# Writing details list of dicts to file using csv.DictWriter
with open('moviesPage2018.csv', 'w', encoding='utf-8', newline='\n') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=details[0].keys())
writer.writeheader()
writer.writerows(details)```

Related

How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way into CSV (with sample)

I am learning python and practicing it for extracting data in a public site.
but I found a problem in this learning. I'd like to get your kindly help me out.
Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)
Purpose:
extract all 65 pages' col, row with contents into a csv in one script
65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65
Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file
for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?
here is one page extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
for tr in div.find_all("tr")[1:]:
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
csv_writer.writerow(data)
Question2:
I found problem to literate 65 pages urls to extract data into csv.
it doesn't work... any idea fix it..
here are 65 pages urls' extract code:
import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
for url in [url.format(pageNo) for pageNo in range(1,65)]:
soup = bs(url.content, 'html.parser')
for div in soup.find_all("div", class_ = "iiright"):
for tr in div.find_all("tr"):
data = []
for td in tr.find_all("td"):
data.append(td.text.strip())
if data:
print("Inserting data: {}".format(','.join(data)))
writer.writerow(data)
if __name__ == '__main__':
with open("test.csv","w",newline="") as infile:
writer = csv.writer(infile)
get_data(url)
Just an alternativ approach
Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.
define a list (data) to keep your results
iterate over the urls with pd.read_html
concat the data frames in data and write them to_csvor to_excel
read_html
find the table that matches a string -> match='预售信息查询:' and select it with [0] cause read_html() will always give you a list of tables
take a special row as header header =2
get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]
Example
import pandas as pd
data = []
for pageNo in range(1,5):
data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
Example (based on your code with function)
import pandas as pd
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="
def get_data(url):
data = []
for pageNo in range(1,2):
data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询:')[0].iloc[:-1,:-1])
pd.concat(data).to_csv('test.csv', index=False)
if __name__ == '__main__':
get_data(url)

Beautiful soup: processing cell data using Python

I am using python 2.7 with beautifulsoup to read in a simple HTML table.
After reading in the table, I then try to access the returned data.
As far as I can see, a python list object is returned. But when I try to access the data using statements such as cell=row[0] I get an "IndexError: list index out of range" error.
from bs4 import BeautifulSoup
# read in HTML data
html = open("in.html").read()
soup = BeautifulSoup(html,"lxml")
table = soup.find("table")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
# process some cell data
for row in output_rows:
name=row[0]
print name
# fails with list index out of range error```
I have come up with this code, to parse each cell of a row into a variable, where I can then process further. But it's not very elegant...ideas for a more elegant solution, suitable for newbies, are welcomed.
for x in range(len(row)):
if x==0:
name=row[x]
print name
if x==1:
address=row[x]
print address

Python BeautifulSoup Extracting Data From Header

This is a follow-up from another question. Thanks for the help so far.
I've got some code to loop through a page and create a dataframe. I'm trying to add a third piece of information but it is contained within the header so it's just returning blank. The level information contained in the td and h3 part of the code. It returns the error "AttributeError: 'NoneType' object has no attribute 'text'" If I change level.h3.text to level.h3 it will run but then it will have the full tags in the data frame, instead of just the number.
import urllib
import bs4 as bs
import pandas as pd
#import csv as csv
sauce = urllib.request.urlopen('https://us.diablo3.com/en/item/helm/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
item_details = soup.find('tbody')
names = item_details.find_all('div', class_='item-details')
types = item_details.find_all('ul', class_='item-type')
#levels = item_details.find_all('h3', class_='subheader-3')
levels = item_details.find_all('td', class_='column-level align-center')
print(levels)
mytable = []
for name, type, level in zip(names, types, levels):
mytable.append((name.h3.a.text, type.span.text, level.h3.text))
export = pd.DataFrame(mytable, columns=('Item', 'Type','Level'))
Try to modify your code as below:
for name, type, level in zip(names, types, levels):
mytable.append((name.h3.a.text, type.span.text, level.h3.text if level.h3 else "No level"))
Now "No level" (you can use "N/A", None or whatever you like the most) will be added as third value in case there is no level (no header)

Unable to write data across columns in a csv file

I've written a script in python to scrape different names and their values out of a table from a webpage and write the same in a csv file. My below script can parse them flawlessly but I can't write them to a csv file in a customized manner.
What I wish to do is write the names and values across columns which you may see in image 2.
This is my try:
import csv
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.bloomberg.com/markets/stocks",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
writer.writerow([name,value])
Output I'm getting like below:
How I wish to get the output is like the following:
Any help to solve this will be vastly appreciated.
Try to define empty list, append all the values in a loop and then write them all at once:
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
names_and_values = []
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
names_and_values.extend([name,value])
writer.writerow(names_and_values)
If I understand you correctly, try making just one call to writerow instead of one per loop
import csv
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.bloomberg.com/markets/stocks",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
data = []
for table in soup.select(".data-table-body tr"):
name = table.select_one("[data-type='full']").text
value = table.select_one("[data-type='value']").text
print(f'{name} {value}')
data.extend([name, value])
writer.writerow(data)
That seems like an ugly thing to want to do, are you sure?
Use pandas for getting csvs and manipulating tables. You'll want to do something like:
import pandas as pd
df = pd.read_csv(path)
df.values.ravel()

Looking to get a json and csv file from this

import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen("http://www.nhl.com/scores/htmlreports/20172018/TH020070.HTM").read()
soup = bs.BeautifulSoup(sauce, "html.parser")
table = soup.table
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
I am trying to output this to a csv and json. How would i do both(not at the same time). Eventually when i get it properly formatted i would like to dump it straight into postgres. New to python so any help and suggestions would be appreciated. I got help previously with output to csv using pandas but i cant get it to format the way i would like it using pandas although ive been told its much easier..
Assuming you want to output the row variable in each iteration to a JSON / CSV.
For JSON, you can simply dump the list of all rows to JSONs. Something like:
import json
#Your logic here
rows=[]
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
rows.append(row)
with open("out", "w") as fp:
json.dump(rows, fp)
For CSV, you can use a similar logic as well.
Check out the documentation:
https://docs.python.org/2/library/csv.html
https://docs.python.org/2/library/json.html

Categories

Resources