Loop in Python script, Only get last results - python
Why do I only get the stats from the last player in PLAYER_NAME?
I would like to get the stats from all the players in PLAYER_NAME.
import csv
import requests
from bs4 import BeautifulSoup
import urllib
PLAYER_NAME = ["andy-murray/mc10", "rafael-nadal/n409"]
URL_PATTERN = 'http://www.atpworldtour.com/en/players/{}/player-stats?year=0&surfaceType=clay'
for item in zip (PLAYER_NAME):
url = URL_PATTERN.format(item)
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'class': 'mega-table-wrapper'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = (cell.text.encode("utf-8").strip())
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
outfile = open("./tennis.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Name", "Stat"])
writer.writerows(list_of_rows)
As mentioned in the comments, you're recreating list_of_rows every time. To fix that, you have to move it outside the for loop, and instead of appending to it, and turning it into a list of lists, extend it.
On a side note, you have a few other issues with your code:
zip is redundant, and it actually ends up converting your names into tuples, which will cause incorrect formatting, you just want to iterate over PLAYER_NAME, and while you're at it, maybe rename that to PLAYER_NAMES (since it's a list of names)
When trying to format the string you just have empty braces, you need a number in there to specify the position of the argument in format - in this case {0}.
PLAYER_NAMES = ["andy-murray/mc10", "rafael-nadal/n409"]
URL_PATTERN = 'http://www.atpworldtour.com/en/players/{0}/player-stats?year=0&surfaceType=clay'
list_of_rows = []
for item in PLAYER_NAMES:
url = URL_PATTERN.format(item)
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'class': 'mega-table-wrapper'})
# for row in table.findAll('tr'):
# list_of_cells = []
# for cell in row.findAll('td'):
# text = (cell.text.encode("utf-8").strip())
# list_of_cells.append(text)
# list_of_rows.extend(list_of_cells) # Change to extend here
# Incidentally, the for loop above could also be written as:
list_of_rows += [
[cell.text.encode("utf-8").strip() for cell in row.findAll('td')]
for row in table.findAll('tr')
]
Related
Only print on the first column of csv
So I have this code but I am having issues when the data I am scraping has commas. I want it only show on the first column but when there's a comma, the data appears on the 2nd column. Is it possible to scrape and print it on the first column only of csv without using panda? Thanks i = 1 for url in urls: print(f'Scraping the URL no {i}') i += 1 response = requests.get(url) soup = BeautifulSoup(response.text,'html.parser') links = [] for text in soup.find('div',class_='entry-content').find_all('div',class_='streak'): link = text.a['href'] text = text.a.text links.append(link) with open("/Users/Rex/Desktop/data.csv", "a") as file_object: file_object.write(text) file_object.write("\n")
CSV files have rules for escaping commas within a single column so that they are not mistakenly interpreted as a new column. This escaping can be applied automatically if you use the csv module. You really only need to open the file once, so with a few more tweaks to your code import csv with open("/Users/Rex/Desktop/data.csv", "a", newline=None) as file_object: csv_object = csv.writer(file_object) i = 1 for url in urls: print(f'Scraping the URL no {i}') i += 1 response = requests.get(url) soup = BeautifulSoup(response.text,'html.parser') links = [] for text in soup.find('div',class_='entry-content').find_all('div',class_='streak'): link = text.a['href'] text = text.a.text.strip() # only record if we have text if text: links.append(link) csv_object.writerow([text]) NOTE: This code is skipping links that do not have text.
Adding Data from Beautiful Soup table to a list
Hello I'm a beginner to python and programming in general, and I was wondering how I would make the outputted data a list. I used bs to extract data from a table and attempt to make a list with the data, but I end up only adding the first number to the list. Can someone provide me assistance and an explaination? from bs4 import BeautifulSoup from requests_html import HTMLSession s = HTMLSession() url = 'https://www.timeanddate.com/weather/usa/new-york/ext' def get_data(url): r = s.get(url) soup = BeautifulSoup(r.text, 'html.parser') return soup with open('document.txt', 'a') as f: f.write(str(get_data(url))) with open('document.txt', 'r') as html_file: contents = html_file.read() soup = BeautifulSoup(contents, 'lxml') forecast_table = soup.find('table', class_ = 'zebra tb-wt fw va-m tb-hover') wtitle = soup.title.text print(wtitle) print("------") def get_weather_high(forecast_table): print("Weather Highs:") for high in forecast_table.find_all('tbody'): rows1 = high.find_all('tr') for row1 in rows1: pl_high = row1.find_all('td') pl_high = [td.text.strip() for td in pl_high] pl_high = pl_high[1:2] for pl_high_final in pl_high: pl_high_final = pl_high_final[0:3] print(pl_high_final) get_weather_high(forecast_table) This the output. Instead of each line being a number, I want to have it all under on list
Create a list before your for loop and just append your data instead of printing it and then just print the list after the for loop data = [] def get_weather_high(forecast_table): print("Weather Highs:") for high in forecast_table.find_all('tbody'): rows1 = high.find_all('tr') for row1 in rows1: pl_high = row1.find_all('td') pl_high = [td.text.strip() for td in pl_high] pl_high = pl_high[1:2] for pl_high_final in pl_high: pl_high_final = pl_high_final[0:3] data.append(pl_high_final) print(data) # or return data if you need it some where else
Beautiful Soup - Results to CSV for all items in lists
The below snippet "works" but is only outputting the first record to the CSV. I'm trying to get it to output the same output, but for each gun in the list of gun urls in the all_links list. Any modification i've made to it with prints for the output (just to see it working) prints the same result or if i make a gun_details list and try to print it, get the same one item output. How would i go about printing all the gun_details labels and spans into a CSV? import csv import urllib.request import requests from bs4 import BeautifulSoup all_links = [] url = "https://www.guntrader.uk/dealers/minsterley/minsterley-ranges/guns?page={}" for page in range(1, 3): res = requests.get(url).text soup = BeautifulSoup(res, "html.parser") for link in soup.select( 'a[href*="dealers/minsterley/minsterley-ranges/guns/shotguns/"]' ): all_links.append("https://www.guntrader.uk" + link["href"]) for a_link in all_links: gun_label = [] gun_span = [] res = urllib.request.urlopen(a_link) # res = requests.get(a_link) soup = BeautifulSoup(res, "html.parser") for gun_details in soup.select("div.gunDetails"): for l in gun_details.select("label"): gun_label.append(l.text.replace(":", "")) for s in gun_details.select("span"): gun_span.append(s.text) my_dict = dict(zip(gun_label, gun_span)) with open("mycsvfile.csv", "w") as csvfile: writer = csv.DictWriter(csvfile, fieldnames=None) for key in my_dict.keys(): csvfile.write(f"{key},{my_dict[key]}\n")
Try running the middle section this way: for a_link in all_links: gun_label = [] gun_span = [] res = requests.get(a_link) soup = bs(res.content, 'html.parser') #note it's 'res.content', not just 'res' for gun_details in soup.select('div.gunDetails'): for l in gun_details.select('label'): gun_label.append(l.text.replace(':','')) for s in gun_details.select('span'): gun_span.append(s.text) #this block is now indented differently - it's INSIDE the 'for' loop my_dict = dict(zip(gun_label, gun_span)) with open('mycsvfile.csv', 'a') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=None) for key in my_dict.keys(): csvfile.write(f"{key},{my_dict[key]}\n")
How do I creat CSV file with webscraped content from several URLs?
I want to create a CSV file from webscraped content. The content is from FinViz.com I want to scrape the table from this website 20 times for 20 different stocks and input all the content into a CSV file. Within my code, I generate a list of stocks from a scrape of twitter content. The list of stocks that is generated is the same list that I want to get information on from the FinViz.com tables. Here is my code: import csv import urllib.request from bs4 import BeautifulSoup twiturl = "https://twitter.com/ACInvestorBlog" twitpage = urllib.request.urlopen(twiturl) soup = BeautifulSoup(twitpage,"html.parser") print(soup.title.text) tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')] print(tweets) url_base = "https://finviz.com/quote.ashx?t=" url_list = [url_base + tckr for tckr in tweets] for url in url_list: fpage = urllib.request.urlopen(url) fsoup = BeautifulSoup(fpage, 'html.parser') # scrape single page and add data to list # write datalist with open('today.csv', 'a') as file: writer = csv.writer(file) # write header row writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'}))) # write body row writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'}))) The trouble that I am running into is that my CSV file only has the webscraped data from the last item in the list. Instead I want the entire list in a sequence of rows. Here is what my CSV file looks like: Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change -,-,-1.75,7.94%,79.06M,-22.52%,296.48M,-,-1.74,-4.61%,72.41M,-23.16%,-85.70M,-,-0.36,62.00%,3.21%,1.63%,15.10M,19.63,-197.00%,18.05%,2.57,66.67%,-0.65,-,-8.10%,-127.70%,12.17,-6.25%,0.93,4.03,-,146.70%,2.05 - 5.86,3.59%,-,-,-,385.80%,-36.01%,-,-,1.30,-,76.50%,82.93%,0.41,100,1.30,-59.60%,-,36.98,16.13% 9.32%,Yes,-,90.00%,-,0.82,3.63,Yes,-,Nov 08,-,902.43K,3.75,2.30,-22.08%,-10.43%,11.96%,"742,414",3.31%
It would be better to open your output file first, rather than keep on opening/closing it for each URL that you fetch. Exception handling is needed to catch cases where the URL does not exist. Also on your output, you should open the file with newline='' to avoid extra empty lines being written to the file: import csv import urllib.request from bs4 import BeautifulSoup write_header = True twiturl = "https://twitter.com/ACInvestorBlog" twitpage = urllib.request.urlopen(twiturl) soup = BeautifulSoup(twitpage,"html.parser") print(soup.title.text) tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')] print(tweets) url_base = "https://finviz.com/quote.ashx?t=" url_list = [url_base + tckr for tckr in tweets] with open('today.csv', 'w', newline='') as file: writer = csv.writer(file) for url in url_list: try: fpage = urllib.request.urlopen(url) fsoup = BeautifulSoup(fpage, 'html.parser') # write header row (once) if write_header: writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'}))) write_header = False # write body row writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'}))) except urllib.error.HTTPError: print("{} - not found".format(url)) So today.csv would start like: Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change -,-,-10.85,4.60%,2.36M,11.00%,8.09M,-,-,-62.38%,1.95M,-16.14%,-14.90M,-,-,2.30%,10.00%,-44.42%,0.00M,-,21.80%,-5.24%,3.10,-38.16%,1.46,2.35,-,-155.10%,65.00,-50.47%,-,-,-,-238.40%,2.91 - 11.20,-38.29%,-,-,54.50%,-,-69.37%,1.63,-,2.20,-,-,17.87%,0.36,15,2.20,-,-,39.83,11.38% 10.28%,No,0.00,68.70%,-,1.48,3.30,Yes,0.00,Feb 28 AMC,-,62.76K,3.43,1.00,-5.21%,-25.44%,-37.33%,"93,166",3.94% -,-,-0.26,1.50%,268.98M,3.72%,2.25B,38.05,0.22,-0.64%,263.68M,-9.12%,-55.50M,-,0.05,-,9.96%,-12.26%,1.06B,2.12,-328.10%,25.95%,2.32,17.72%,12.61,0.66,650.00%,-0.90%,12.64,-38.73%,0.03,264.87,-,-1.90%,6.69 - 15.27,-0.48%,-,-,-28.70%,0.00%,-45.17%,2.20,-,0.70,16.40%,67.80%,25.11%,0.41,477,0.80,71.90%,5.30%,52.71,4.83% 5.00%,Yes,0.80,7.80%,-5.20%,0.96,7.78,Yes,0.80,Feb 27 AMC,-,11.31M,8.37,2.20,0.99%,-1.63%,-4.72%,"10,843,026",7.58% If you only want your file to contain data from one run of the script, you do not need a to append, just use w instead.
BeautifulSoup and CSV files
I'm looking to pull the table from http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1# and put all the information in a csv file. I've done this but am having a few issues. The first column of the table contains both the ranking of the player and their name. I want to split these up so that one column just contains the ranking and the other column contains the player name. Here's the code: import urllib2 from bs4 import BeautifulSoup import csv URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#' req = urllib2.Request(URL) page = urllib2.urlopen(req) soup = BeautifulSoup(page) tables = soup.findAll('table') my_table = tables[0] with open('out2.csv', 'w') as f: csvwriter = csv.writer(f) for row in my_table.findAll('tr'): cells = [c.text.encode('utf-8') for c in row.findAll('td')] if len(cells) == 16: csvwriter.writerow(cells) Here's the output of a few players: "1 Novak Djokovic",SRB,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46% "2 Roger Federer",SUI,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33% "3 Andy Murray",GBR,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0% "4 Rafael Nadal",ESP,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38% "5 Kei Nishikori",JPN,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42% As you can see the first column isn't displayed properly with the number being on a higher line than the rest of the data as well as the extremely large gap. The HTML code for the problem column is slightly more complex than the rest of the columns: <td class="col1" rel="1">1 Novak Djokovic</td> I tried separating it from that but I couldn't get it to work and thought it might be easier to fix the current CSV file.
Separating the field after pulling it out is pretty easy. You've got a number, a bunch of whitespace, and a name. So just use split, with the default delimiter, and a max split of 1: cells = [c.text.encode('utf-8') for c in row.findAll('td')] if len(cells) == 16: cells[0:1] = cells[0].split(None, 1) csvwriter.writerow(cells) But you can also separate it from within the soup, and that's probably more robust: cells = row.find_all('td') cell0 = cells.pop(0) rank = next(cell0.children).strip().encode('utf-8') name = cell0.find('a').text.encode('utf-8') cells = [rank, name] + [c.text.encode('utf-8') for c in cells]
Since the value you're concerned with contains multiple tabs and the player's name is directly after the final tab, I'd suggest split by tab and collect the last item from the resulting tuple. The line I added is cells[0] = cells[0].split('\t')[-1] import urllib2 from bs4 import BeautifulSoup import csv URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#' req = urllib2.Request(URL) page = urllib2.urlopen(req) soup = BeautifulSoup(page) tables = soup.findAll('table') my_table = tables[0] with open('out2.csv', 'w') as f: csvwriter = csv.writer(f) for row in my_table.findAll('tr'): cells = [c.text.encode('utf-8') for c in row.findAll('td')] if len(cells) == 16: cells[0] = cells[0].split('\t')[-1] csvwriter.writerow(cells) f.close()