Filesize is unproportional

Filesize is unproportional - python

A webscraper written in Python extracts waterleveldata. One read per hour.
When written to a .txt-file using the code below each line is appended with datetime, thus each line takes up something like 20 characters.
Example: "01/01-2010 11:10,-32"
Using the code below results in a file containing data from 01/01-2010 00:10 to 28/02-2010 23:50 which equals something like 60 days. 60 days, with a reading per hour results in 1440 lines and approx. 30000 characters. Microsoft word, however, tell me the file contains 830000 characters on 42210 lines, which fits very well with an observed filesize of 893 kB.
Apparently some lines and characters are hidden somewhere. I cant seem to find them anywhere.
import requests
import time
totaldata =[]
filnavn='Vandstandsdata_Esbjerg_KDI_TEST_recode.txt'
file = open(filnavn,'w')
file.write("")
file.close()
from datetime import timedelta, date
from bs4 import BeautifulSoup
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = date(2010, 1, 1)
end_date = date(2010, 3, 1)
values=[]
datoer=[]
for single_date in daterange(start_date, end_date):
valuesTemp=[]
datoerTemp=[]
dato = single_date.strftime("%d-%m-%y")
url = "http://kysterne.kyst.dk/pages/10852/waves/showData.asp?targetDay="+dato+"&ident=6401&subGroupGuid=16410"
page = requests.get(url)
if page.status_code == 200:
soup = BeautifulSoup(page.content, 'html.parser')
dataliste = list(soup.find_all(class_="dataTable"))
#dataliste =list(dataliste.find_all('td'))
#dataliste =dataliste[0].getText()
print(url)
dataliste = str(dataliste)
dataliste = dataliste.splitlines()
dataliste = dataliste[6:] #18
#print(dataliste[0])
#print(dataliste[1])
for e in range (0,len(dataliste),4): #4
#print(dataliste[e])
datoerTemp.append(dataliste[e])
#print(" -------- \n")
for e in range (1,len(dataliste),4): #4
valuesTemp.append(dataliste[e])
for e in valuesTemp:
#print (e)
e=e[4:]
e=e[:-5]
values.append(e)
for e in datoerTemp:
#print (e)
e=e[4:]
e=e[:-5]
datoer.append(e)
file = open(filnavn,'a')
for i in range(0,len(datoer),6):
file.write(datoer[i]+","+values[i]+"\n")
print("- skrevet til fil\n")
file.close()
print("done")

Ah, heureka.
Seconds before posting this question I realized I forgot to reset the list.
I added:
datoer=[]
everything now works as intended.
The old code would write data from a given day and all data of all previous days, for each loop in the code.
I hope someone can use this newbie-experience.

Related

Speed up python web-scraping

I am writing this python script to scrap a website for collecting information. After I entered the date and the no. of game on that date, this script will help me to go to the site and scrap a specific table and save each game into a CSV file. There are 12 rows of the tables, so that's why you see I have hardcoded it in the for loop.
The script works but I would like to seek for the suggestion from you experts to optimize and speed up the script. I thought using concurrent would speed up but it doesn't give a obvious improvement.
It would be great if anyone can help. At the moment, this script would take 20-30 seconds to complete for one date scrap.
Thank you very much for your time!
import concurrent.futures
import pandas as pd
from requests_html import HTMLSession
import requests_cache
session = HTMLSession()
requests_cache.install_cache(expire_after=3600)
game_date = input("Please input the date of the game that you want to scrap (in YYYY/MM/DD): ")
game_no = int(input("Please input how many game on that date: "))
def split_list(big_list, chunk_size):
return [big_list[i:i + chunk_size] for i in range(0, len(big_list), chunk_size)]
def get_game_result(game):
print(f"Processing game {game}")
url = f"https://example.com{game_date}&{game}" <<< example link
response = session.get(url)
response.html.render(sleep=5, keep_page=True, scrolldown=1)
row_body = response.html.xpath(f"/html/body/div[1]/div[3]/div[2]/div[2]/div[2]/div[5]/table/tbody/tr[1]")
final_list = []
for i in range(2, 14):
for item in row_body:
item_table = item.text.split("\n")
final_list.append(item_table)
row_body = response.html.xpath(f"/html/body/div[1]/div[3]/div[2]/div[2]/div[2]/div[5]/table/tbody/tr[{i}]")
i += 1
df = pd.DataFrame(final_list)
df.to_csv(f"game_{game}.csv", index=False, header=False)
print(f"Finished processing game {game}")
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
results = [executor.submit(get_game_result, game) for game in range(1, game_no + 1)]
for f in concurrent.futures.as_completed(results):
f.result()

Alternatives to Python beautiful soup

I wrote a few lines to get data from a financial data website.
It simply uses beautiful soup to parse and requests to get.
Is there any other simpler or sleeker ways of getting the same result?
I'm just after a discussion to see what others have come up with.
from pandas import DataFrame
import bs4
import requests
def get_webpage():
symbols = ('ULVR','AZN','HSBC')
for ii in symbols:
url = 'https://uk.finance.yahoo.com/quote/' + ii + '.L/history?p=' + ii + '.L'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('tr')
data = [[td.getText() for td in rows[i].find_all('td')] for i in range(len(rows))]
#for i in data:
# [-7:] Date
# [-6:] Open
# [-5:] High
# [-4:] Low
# [-3:] Close
# [-2:] Adj Close
# [-1:] Volume
data = DataFrame(data)
print(ii, data)
if __name__ == "__main__":
get_webpage()
Any thoughts?

You can try with read_html() method:
symbols = ('ULVR','AZN','HSBC')
df=[pd.read_html('https://uk.finance.yahoo.com/quote/' + ii + '.L/history?p=' + ii + '.L') for ii in symbols]
df1=df[0][0]
df2=df[1][0]
df3=df[2][0]

As its just the entire table that I want, it seems easier to use pandas.read_html, especially as I have no need to scrape anything in particular apart from the entire table.
There is some helpful information on this site as a guidance https://pbpython.com/pandas-html-table.html
By just using import pandas as pd I get the result I am after.
import pandas as pd
def get_table()
symbols = ('ULVR','AZN','HSBC')
position=0
for ii in symbols:
table=[pd.read_html('https://uk.finance.yahoo.com/quote/' + ii + '.L/history?
p=' + ii + '.L')]
print (symbols[position])
print (table, '\n')
position += 1
if __name__ == "__main__":
get_table()

NameError: name 'all_df' is not defined

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 14 2017
Modified on Wed Aug 16 2017
Author: Yanfei Wu
Get the past 500 S&P 500 stocks data
"""
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
def get_ticker_and_sector(url='https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'):
"""
get the s&p 500 stocks from Wikipedia:
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
---
return: a dictionary with ticker names as keys and sectors as values
"""
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
# we only want to parse the first table of this wikipedia page
table = soup.find('table')
sp500 = {}
# loop over the rows and get ticker symbol and sector name
for tr in table.find_all('tr')[1:]:
tds = tr.find_all('td')
ticker = tds[0].text
sector = tds[3].text
sp500[ticker] = sector
return sp500
def get_stock_data(ticker, start_date, end_date):
""" get stock data from google with stock ticker, start and end dates """
data = web.DataReader(ticker, 'google', start_date, end_date)
return data
if __name__ == '__main__':
""" get the stock data from the past 5 years """
# end_date = datetime.now()
end_date = datetime(2017, 8, 14)
start_date = datetime(end_date.year - 5, end_date.month , end_date.day)
sp500 = get_ticker_and_sector()
sp500['SPY'] = 'SPY' # also include SPY as reference
print('Total number of tickers (including SPY): {}'.format(len(sp500)))
bad_tickers =[]
for i, (ticker, sector) in enumerate(sp500.items()):
try:
stock_df = get_stock_data(ticker, start_date, end_date)
stock_df['Name'] = ticker
stock_df['Sector'] = sector
if stock_df.shape[0] == 0:
bad_tickers.append(ticker)
#output_name = ticker + '_data.csv'
#stock_df.to_csv(output_name)
if i == 0:
all_df = stock_df
else:
all_df = all_df.append(stock_df)
except:
bad_tickers.append(ticker)
print(bad_tickers)
all_df.to_csv('./data/all_sp500_data_2.csv')
""" Write failed queries to a text file """
if len(bad_tickers) > 0:
with open('./data/failed_queries_2.txt','w') as outfile:
for ticker in bad_tickers:
outfile.write(ticker+'\n')

Your problem is in your try/except block. It is good style to always catch a specific exception, not just blindly throw except statements after a long block of code. The problem with this approach, as demonstrated in your problem, is that if you have an unrelated or unexpected error, you won't know about it. In this case, this is the exception I get from running your code:
NotImplementedError: data_source='google' is not implemented
I'm not sure what that means, but it looks like the pandas_datareader.data.DataReader docs have good information about how to use that DataReader correctly.

Using each iteration of a loop as a variable elsewhere with Python

This is a side project I am doing as I am attempting to learn Python.
I am trying to write a python script that will iterate through a date range and use each date that is returned in a GET request URL.
The URL uses a LastModified parameter and limits GET requests to a 24 hour period so I would like to run the GET request for each day from the start date.
Below is what I have currently, the major issue I am having is how to separate the returned dates in a way that I can use each date separately for the GET, the GET will also need to be looped to use each date I suppose.
Any pointer in the right direction would be helpful as I am trying to learn as much as possible.
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date.today()
delta = datetime.timedelta(days=1)
while start_date <= end_date:
last_mod = start_date + delta
print(last_mod)
start_date += delta
import requests
from requests.auth import HTTPBasicAuth
vend_key = 'REDACTED'
user_key = 'REDACTED'
metrc_license = 'A12-0000015-LIC'
base_url = 'https://sandbox-api-ca.metrc.com'
last_mod_date = ''
a = HTTPBasicAuth(vend_key, user_key)
def get(path):
url = '{}/{}/?licenseNumber={}&lastModifiedStart={}'.format(base_url, path, metrc_license, last_mod_date, )
print('URL:', url)
r = requests.get(url, auth=a)
print("The server response is: ", r.status_code)
if r.status_code == 200:
return r.json()
# Would like an elif that is r.status_code is 500 wait _ seconds and try again
elif r.status_code == 500:
print("500 error, try again.")
else:
print("Error")
print((get('/packages/v1/active')))
Here is an example return from the current script, I do not need it to return each date so I can remove the print, but how can I make each loop from the date be its own variable to use in a loop of the GET?
2020-01-02
2020-01-03
2020-01-04
2020-01-05
2020-01-06
etc...
etc...
etc...
2020-05-24
2020-05-25
2020-05-26
2020-05-27
URL: https://sandbox-api-ca.metrc.com//packages/v1/active/?licenseNumber=A12-0000015-LIC&lastModifiedStart=2020-05-27
The server response is: 200
[]

It's super simple, you need to use use the while loop that generates all these dates into your get() function. Here is what I mean:
import requests
from requests.auth import HTTPBasicAuth
vend_key = 'REDACTED'
user_key = 'REDACTED'
metrc_license = 'A12-0000015-LIC'
base_url = 'https://sandbox-api-ca.metrc.com'
a = HTTPBasicAuth(vend_key, user_key)
def get(path):
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date.today()
delta = datetime.timedelta(days=1)
while start_date <= end_date:
last_mod_date = start_date + delta
print(last_mod_date)
start_date += delta
url = '{}/{}/?licenseNumber={}&lastModifiedStart={}'.format(base_url, path, metrc_license, last_mod_date, )
print('URL:', url)
r = requests.get(url, auth=a)
print("The server response is: ", r.status_code)
if r.status_code == 200:
return r.json()
# Would like an elif that is r.status_code is 500 wait _ seconds and try again
elif r.status_code == 500:
print("500 error, try again.")
else:
print("Error")
print((get('/packages/v1/active')))

One thing you could do is call your get function inside the while loop. First modify the get function to take a new parameter date and then use this parameter when you build your url.
For instance:
def get(path, date):
url = '{}/{}/?licenseNumber={}&lastModifiedStart={}'.format(base_url, path, metrc_license, date, )
...
And then call get inside the while loop.
while start_date <= end_date:
last_mod = start_date + delta
get(some_path, last_mod)
start_date += delta
This would make a lot of GET requests in a short period of time, so you might want to be careful not to overload the server with requests.

Trying to crawl all newslinks in a site (the parsed link only shows 10 results per page, I would need to find ALL links)

I am trying to crawl all news link that has a certain keyword that is looking for.
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
key_word = urllib.parse.quote("금리")
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=0&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
anchor_set = soup.findAll('a')
news_link = []
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)'
Untill this section (code above), I parse into the url and retrieve all links that has a certian read.nhn(naver news platform) and append it to news_link.
This is working fine, but the proble is the url used above only shows 10 articles in the page.
count_tag = soup.find("div",{"class","title_desc all_my"})
count_text=count_tag.find("span").get_text().split()
total_num=count_text[-1][0:-1].replace(",","")
print(total_num)'
Using the code above I've found out there are a total of 1297 articles that I need to collect. but since the original link above only has 10 articles in the page.
for val in range(int(total_num)//10+1):
start_val=str(val*10+1)
I was told i needed to insert this into the url to retrieve ALL newslinks.
Thus, I've used the while method
while start_val <= total_num:
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=" + start_val + "&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
news_link = []
anchor_set = soup.findAll('a')
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)
However, when I run the program, it seems the loop does not stop. obviously there is no else or break.. How can i break this loop and successfully collect all the links?

Your current while loop doesn't stop because you haven't incremented the value of start_val. Also, later you have range(int(total_num)//10+1) so if you converted total_num to a string, then the string comparison in while start_val <= total_num is wrong - for strings "21" > "1297", because "2" > "1". Compare them as int's.
And since you're creating the sequence of vals to use, you don't need a separate upper bound check.
So far, this would give you the correct finite loop:
for val in range(int(total_num)//10+1): # no upper bound check needed
start_val=str(val*10+1)
url = "https://search.naver.com/search.naver?where=news&query=" ...
html = urllib.request.urlopen(url).read()
...
For the values needed for the pages/next starting item, instead of doing:
for val in range(int(total_num)//10+1):
start_val = str(val*10+1)
You can get the actual val's from range(). To starting at 1 and going in steps of 10 to get: 1, 11, 21, ... , upto and including the total:
for val in range(1, total_num + 1, 10):
start_val = str(val) # don't need this assignment actually
Next thing: the URL for page 2 onwards is wrong. Currently, your while loop will generate the following URL for page 2:
https://search.naver.com/search.naver?where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=11&related=0
But if you click on page "2" of the results, you get the URL:
https://search.naver.com/search.naver?&where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=35&start=11&refresh_start=0
The main difference is at the end: &refresh_start=11 in yours vs &start=11&refresh_start=0 actual. Since that format also works for page 1 (just checked), use that instead.
You have some extra characters in the section after the keyword: ...&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt. That %EA%B8%88%EB%A6%AC is from your previous search keyword.
You can also skip several unneeded URL parameters, by testing which are actually not needed.
Putting all that together:
for val in range(1, total_num + 1, 10):
start_val = str(val)
url = ("https://search.naver.com/search.naver?&where=news&query=" +
key_word +
"&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14" +
"&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=51" +
"&refresh_start=0&start=" +
start_val)
html = urllib.request.urlopen(url).read()
... # etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filesize is unproportional - python

Ah, heureka. Seconds before posting this question I realized I forgot to reset the list. I added: datoer=[] everything now works as intended. The old code would write data from a given day and all data of all previous days, for each loop in the code. I hope someone can use this newbie-experience.

Related

Speed up python web-scraping

Alternatives to Python beautiful soup

NameError: name 'all_df' is not defined

Using each iteration of a loop as a variable elsewhere with Python

Trying to crawl all newslinks in a site (the parsed link only shows 10 results per page, I would need to find ALL links)

Categories

Resources