Getting more than 100 days of data web scraping Yahoo - python

Like many others I have been looking for an alternative source of stock prices now that the Yahoo and Google APIs are defunct. I decided to take a try at web scraping the Yahoo site from which historical prices are still available. I managed to put together the following code which almost does what I need:
import urllib.request as web
import bs4 as bs
def yahooPrice(tkr):
tkr=tkr.upper()
url='https://finance.yahoo.com/quote/'+tkr+'/history?p='+tkr
sauce=web.urlopen(url)
soup=bs.BeautifulSoup(sauce,'lxml')
table=soup.find('table')
table_rows=table.find_all('tr')
allrows=[]
for tr in table_rows:
td=tr.find_all('td')
row=[i.text for i in td]
if len(row)==7:
allrows.append(row)
vixdf= pd.DataFrame(allrows).iloc[0:-1]
vixdf.columns=['Date','Open','High','Low','Close','Aclose','Volume']
vixdf.set_index('Date',inplace=True)
return vixdf
which produces a dataframe with the information I want. Unfortunately, even though the actual web page shows a full year's worth of prices, my routine only returns 100 records (including dividend records). Any idea how I can get more?

The Yahoo Finance API was depreciated in May '17, I believe. Now, there are to many options for downloading time series data for free, at least that I know of. Nevertheless, there is always some kind of alternative. Check out the URL below to find a tool to download historical price.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/
See this too.
https://blog.quandl.com/api-for-stock-data

I don't have the exact solution to your question but I have a workaround (I had the same problem and hence used this approach)....basically, you can use Bday() method - 'import pandas.tseries.offset' and look for x number of businessDays for collecting the data. In my case, i ran the loop thrice to get 300 businessDays data - knowing that 100 was maximum I was getting by default.
Basically, you run the loop thrice and set the Bday() method such that the iteration on first time grabs 100 days data from now, then the next 100 days (200 days from now) and finally the last 100 days (300 days from now). The whole point of using this is because at any given point, one can only scrape 100 days data. So basically, even if you loop through 300 days in one go, you may not get 300 days data - your original problem (possibly yahoo limits amount of data extracted in one go). I have my code here : https://github.com/ee07kkr/stock_forex_analysis/tree/dataGathering
Note, the csv files for some reason are not working with /t delimiter in my case...but basically u can use the data frame. One more issue I currently have is 'Volume' is a string instead of float....the way to get around is :
apple = pd.DataFrame.from_csv('AAPL.csv',sep ='\t')
apple['Volume'] = apple['Volume'].str.replace(',','').astype(float)

First - Run the code below to get your 100 days.
Then - Use SQL to insert the data into a small db (Sqlite3 is pretty easy to use with python).
Finally - Amend code below to then get daily prices which you can add to grow your database.
from pandas import DataFrame
import bs4
import requests
def function():
url = 'https://uk.finance.yahoo.com/quote/VOD.L/history?p=VOD.L'
response = requests.get(url)
soup=bs4.BeautifulSoup(response.text, 'html.parser')
headers=soup.find_all('th')
rows=soup.find_all('tr')
ts=[[td.getText() for td in rows[i].find_all('td')] for i in range (len(rows))]
date=[]
days=(100)
while days > 0:
for i in ts:
data.append (i[:-6])
now=data[num]
now=DataFrame(now)
now=now[0]
now=str(now[0])
print now, item
num=num-1

Related

How do I separate text after using BeautifulSoup in order to plot?

I am trying to make a program that scrapes the data from open insider and take that data and plot it. Open insider shows what insiders of the company are buying or selling the stock. I want to be able to show, in an easy to read format, what company, insider type and how much of the stock was purchased.
Here is my code so far:
from bs4 import BeautifulSoup
import requests
page = requests.get("http://openinsider.com/top-insider-purchases-of-the-month")
'''print(page.status_code)
checks to see if the page was downloaded successfully'''
soup = BeautifulSoup(page.content,'html.parser')
table = soup.find(class_="tinytable")
data = table.get_text()
#results = data.prettify
print(data, '\n')
Here is an example of some of the results:
X
Filing Date
Trade Date
Ticker
Company NameInsider NameTitle
Trade Type  
Price
Qty
Owned
ΔOwn
Value
1d
1w
1m
6m
2022-12-01 16:10:122022-11-30 AKUSAkouos, Inc.Kearny Acquisition Corp10%P - Purchase$12.50+29,992,668100-100%+$374,908,350
2022-11-30 20:57:192022-11-29 HHCHoward Hughes CorpPershing Square Capital Management, L.P.Dir, 10%P - Purchase$70.00+1,560,20515,180,369+11%+$109,214,243
2022-12-02 17:29:182022-12-02 IOVAIovance Biotherapeutics, Inc.Rothbaum Wayne P.DirP - Purchase$6.50+10,000,00018,067,333+124%+$65,000,000
However, for me each year starts a new line.
Is there a better way to use BeautifulSoup? Or is there an easy way to sort through this data and retrieve the specific information I am looking for? Thank You in advance I have been stuck on this for a while.

How to fetch total market value of csgo market

I'm working on a site and I'm trying to find an api that returns the total value in $ of every skin in csgo.
What I want to achieve is something like this: https://pbs.twimg.com/media/E2-bYmJXEAQmO5u.jpg
How can I do that?
Thank you #NewbieCody for linking me to the answer.
Example:
data = requests.get("https://steamcommunity.com/market/search/render/?search_descriptions=0&sort_column=name&sort_dir=desc&appid=730&norender=1&count=100&start=0")
json_data = json.loads(data.text)
print(json_data)
Every page returns 100 items so I itinerated over [the number of pages]/100 adding 100 every time to the start parameter and extracted the prices to make the graph.

Only scrape google news articles containing an exact phrase in python

I'm trying to build a media tracker in python that each day returns all google news articles containing a specific phrase, "Center for Community Alternatives". If, one day, there are no new news articles that exactly contain this phrase, then no new links should be added to the data frame. The problem I am having is that even on days when there are no news articles containing my phrase, my code adds articles that with similar phrases to the data frame. How can I only append links that contain my exact phrase?
Below I have attached an example code looking at 03/01/22:
from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd
googlenews=GoogleNews(start='03/01/2022',end='03/01/2022')
googlenews.search('"' + "Center for Community Alternatives" + '"')
googlenews.getpage(1)
result=googlenews.result()
df=pd.DataFrame(result)
df
Even though, when you search "Center for Community Alternatives" (with quotes around it) in Google News for this specific date, there are No results found for "center for community alternatives", the code scrapes the links that appear below this, which are Results for center for community alternatives (without quotes).
The API you're using does not support exact match.
In https://github.com/Iceloof/GoogleNews/blob/master/GoogleNews/__init__.py:
def search(self, key):
"""
Searches for a term in google.com in the news section and retrieves the first page into __results.
Parameters:
key = the search term
"""
self.__key = "+".join(key.split(" "))
if self.__encode != "":
self.__key = urllib.request.quote(self.__key.encode(self.__encode))
self.get_page()
As an alternative, you could probably just filter your data frame using an exact match:
df = df['Center for Community Alternatives' in df['title or some column']]
Probably you're not getting any results due to:
There are not search results that matches your search term - "Center for Community Alternatives" and not in the date range you add in your question - 03/01/2022.
If you consider change the search term removing their double quotes AND if you increase the date range - you might have some results - that will depend entirely of how active they [the source] post news and how Google handles such topics.
What I suggest is to change your code for:
Keep the search term - Center for Community Alternatives without double quotes
Apply a longer date range to search
Get only distinct values - while doing test to this code, I got duplicated entries.
Get more than one page for increase the changes of get results.
Code:
#!pip install GoogleNews # https://pypi.org/project/GoogleNews/
#!pip install newspaper3k # https://pypi.org/project/newspaper3k/
from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd
search_term = "Center for Community Alternatives"
googlenews=GoogleNews(start='03/01/2022',end='03/03/2022') # I suppose the date is in "MM/dd/yyyy" format...
googlenews=GoogleNews(lang='en', region='US')
googlenews.search(search_term)
# Initial list of results - it will contain a list of dictionaries (dict).
results = []
# Contains the final results = news filtered by the criteria
# (news that in their description contains the search term).
final_results = []
# Get first 4 pages with the results and append those results to the list - you can set any other range according to your needs:
for page in range(1,4):
googlenews.getpage(page) # Consider add an timer for avoid multiple calls and get "HTTP Error 429: Too Many Requests" error.
results.extend(googlenews.result())
# Remove duplicates and include to the "final_results" list
# only the news that includes in their description the search term:
for item in results:
if (item not in final_results and (search_term in item["desc"])):
final_results.append(item)
# Build and show the final dataframe:
df=pd.DataFrame(results)
df
Keep in mind that probably you won't get results for factors outside of your reach.

How to query with time filters in GoogleScraper?

Even if Google's official API does not offer time information in the query results - even no time filtering for keywords, there is time filtering option in the advanced search:
Google results for stackoverflow in the last one hour
GoogleScraper library offers many flexible options BUT time related ones. How to add time features using the library?
After a bit of inspection, I've found that time Google sends the filtering information by qdr value to the tbs key (possibly means time based search although not officially stated):
https://www.google.com/search?tbs=qdr:h1&q=stackoverflow
This gets the results for the past hour. m and y letters can be used for months and years respectively.
Also, to add sorting by date feature, add the sbd (should mean sort by date) value as well:
https://www.google.com/search?tbs=qdr:h1,sbd:1&q=stackoverflow
I was able to insert these keywords to the BASE Google URL of GoogleScraper. Insert below lines to the end of get_base_search_url_by_search_engine() method (just before return) in scraping.py:
if("google" in str(specific_base_url)):
specific_base_url = "https://www.google.com/search?tbs=qdr:{},sbd:1".format(config.get("time_filter", ""))
Now use the time_filter option in your config:
from GoogleScraper import scrape_with_config
config = {
'use_own_ip': True,
'keyword_file': "keywords.txt",
'search_engines': ['google'],
'num_pages_for_keyword': 2,
'scrape_method': 'http',
"time_filter": "d15" #up to 15 days ago
}
search = scrape_with_config(config)
Results will only include the time range. Additionally, text snippets in the results will have raw date information:
one_sample_result = search.serps[0].links[0]
print(one_sample_result.snippet)
4 mins ago It must be pretty easy - let propertytotalPriceOfOrder =
order.items.map(item => +item.unit * +item.quantity * +item.price);.
where order is your entire json object.

Script for a changing URL

I am having a bit of trouble in coding a process or a script that would do the following:
I need to get data from the URL of:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
But the file URL's (the days and model runs change), so it has to assume this base structure for variables.
Y - Year
M - Month
D - Day
C - Model Forecast/Initialization Hour
F- Model Frame Hour
Like so:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
This script would run, and then import that date (in the YYYYMMDD, as well as CC) with those variables coded -
So while the mission is to get
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
While these variables correspond to get the current dates in the format of:
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
Can you please advise how to go about and get the URL's to find the latest date in this format? Whether it'd be a script or something with wget, I'm all ears. Thank you in advance.
In Python, the requests library can be used to get at the URLs.
You can generate the URL using a combination of the base URL string plus generating the timestamps using the datetime class and its timedelta method in combination with its strftime method to generate the date in the format required.
i.e. start by getting the current time with datetime.datetime.now() and then in a loop subtract an hour (or whichever time gradient you think they're using) via timedelta and keep checking the URL with the requests library. The first one you see that's there is the latest one, and you can then do whatever further processing you need to do with it.
If you need to scrape the contents of the page, scrapy works well for that.
I'd try scraping the index one level up at http://nomads.ncep.noaa.gov/dods/gfs_hd ; the last link-of-particular-form there should take you to the daily downloads pages, where you could do something similar.
Here's an outline of scraping the daily downloads page:
import BeautifulSoup
import urllib
grdd = urllib.urlopen('http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140522')
soup = BeautifulSoup.BeautifulSoup(grdd)
datalinks = 'http://nomads.ncep.noaa.gov:80/dods/gfs_hd/gfs_hd'
for link in soup.findAll('a'):
if link.get('href').startswith(datalinks):
print('Suitable link: ' + link.get('href')[len(datalinks):])
# Figure out if you already have it, choose if you want info, das, dds, etc etc.
and scraping the page with the last thirty would, of course, be very similar.
The easiest solution would be just to mirror the parent directory:
wget -np -m -r http://nomads.ncep.noaa.gov:9090/dods/gfs_hd
However, if you just want the latest date, you can use Mojo::UserAgent as demonstrated on Mojocast Episode 5
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'http://nomads.ncep.noaa.gov:9090/dods/gfs_hd';
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;
my #links = $dom->find('a')->attr('href')->each;
my #gfs_hd = reverse sort grep {m{gfs_hd/}} #links;
print $gfs_hd[0], "\n";
On May 23rd, 2014, Outputs:
http://nomads.ncep.noaa.gov:9090/dods/gfs_hd/gfs_hd20140523

Categories

Resources