Only scrape google news articles containing an exact phrase in python - python

I'm trying to build a media tracker in python that each day returns all google news articles containing a specific phrase, "Center for Community Alternatives". If, one day, there are no new news articles that exactly contain this phrase, then no new links should be added to the data frame. The problem I am having is that even on days when there are no news articles containing my phrase, my code adds articles that with similar phrases to the data frame. How can I only append links that contain my exact phrase?
Below I have attached an example code looking at 03/01/22:
from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd
googlenews=GoogleNews(start='03/01/2022',end='03/01/2022')
googlenews.search('"' + "Center for Community Alternatives" + '"')
googlenews.getpage(1)
result=googlenews.result()
df=pd.DataFrame(result)
df
Even though, when you search "Center for Community Alternatives" (with quotes around it) in Google News for this specific date, there are No results found for "center for community alternatives", the code scrapes the links that appear below this, which are Results for center for community alternatives (without quotes).

The API you're using does not support exact match.
In https://github.com/Iceloof/GoogleNews/blob/master/GoogleNews/__init__.py:
def search(self, key):
"""
Searches for a term in google.com in the news section and retrieves the first page into __results.
Parameters:
key = the search term
"""
self.__key = "+".join(key.split(" "))
if self.__encode != "":
self.__key = urllib.request.quote(self.__key.encode(self.__encode))
self.get_page()
As an alternative, you could probably just filter your data frame using an exact match:
df = df['Center for Community Alternatives' in df['title or some column']]

Probably you're not getting any results due to:
There are not search results that matches your search term - "Center for Community Alternatives" and not in the date range you add in your question - 03/01/2022.
If you consider change the search term removing their double quotes AND if you increase the date range - you might have some results - that will depend entirely of how active they [the source] post news and how Google handles such topics.
What I suggest is to change your code for:
Keep the search term - Center for Community Alternatives without double quotes
Apply a longer date range to search
Get only distinct values - while doing test to this code, I got duplicated entries.
Get more than one page for increase the changes of get results.
Code:
#!pip install GoogleNews # https://pypi.org/project/GoogleNews/
#!pip install newspaper3k # https://pypi.org/project/newspaper3k/
from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd
search_term = "Center for Community Alternatives"
googlenews=GoogleNews(start='03/01/2022',end='03/03/2022') # I suppose the date is in "MM/dd/yyyy" format...
googlenews=GoogleNews(lang='en', region='US')
googlenews.search(search_term)
# Initial list of results - it will contain a list of dictionaries (dict).
results = []
# Contains the final results = news filtered by the criteria
# (news that in their description contains the search term).
final_results = []
# Get first 4 pages with the results and append those results to the list - you can set any other range according to your needs:
for page in range(1,4):
googlenews.getpage(page) # Consider add an timer for avoid multiple calls and get "HTTP Error 429: Too Many Requests" error.
results.extend(googlenews.result())
# Remove duplicates and include to the "final_results" list
# only the news that includes in their description the search term:
for item in results:
if (item not in final_results and (search_term in item["desc"])):
final_results.append(item)
# Build and show the final dataframe:
df=pd.DataFrame(results)
df
Keep in mind that probably you won't get results for factors outside of your reach.

Related

Tweet scraping while using Twint

I am doing some research on sentiment analysis of tweets. I have been using twint to scrape tweets from selected cities where I was getting more tweets. when I compared to scraping tweets for the whole world for the same hashtag for a duration of 5 years from 2010 to 2015. I was not able to understand why twint is doing like that. Here is my code:
import twint
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
cities=['Hyderabad','Mumbai','Kolkata','Vijayawada', 'Warangal', 'Visakhapatnam']
unique_cities=set(cities) #To get unique cities of country
cities = sorted(unique_cities) #Sort & convert datatype to list
for city in cities:
print(city)
config = twint.Config()
config.Search = "#MarutiSuzuki"
config.Lang = "en"
config.Near = city
config.Limit = 1000000
config.Since = "2010–01–01"
config.Until = "2015–12–01"
config.Store_csv = True
config.Output = "my_finding.csv"
twint.run.Search(config)`
Maybe Twitter has a limit for showing the number of tweets when searched globally, for example, it only showcases X entries but when you narrow down the search more specifically based on the location it shows the maximum amount for that area. For instance, Amazon would only show 400 pages of the searched item even though there may be more, likewise, if you specify the details it may show more items than with the previous search.

How to query with time filters in GoogleScraper?

Even if Google's official API does not offer time information in the query results - even no time filtering for keywords, there is time filtering option in the advanced search:
Google results for stackoverflow in the last one hour
GoogleScraper library offers many flexible options BUT time related ones. How to add time features using the library?
After a bit of inspection, I've found that time Google sends the filtering information by qdr value to the tbs key (possibly means time based search although not officially stated):
https://www.google.com/search?tbs=qdr:h1&q=stackoverflow
This gets the results for the past hour. m and y letters can be used for months and years respectively.
Also, to add sorting by date feature, add the sbd (should mean sort by date) value as well:
https://www.google.com/search?tbs=qdr:h1,sbd:1&q=stackoverflow
I was able to insert these keywords to the BASE Google URL of GoogleScraper. Insert below lines to the end of get_base_search_url_by_search_engine() method (just before return) in scraping.py:
if("google" in str(specific_base_url)):
specific_base_url = "https://www.google.com/search?tbs=qdr:{},sbd:1".format(config.get("time_filter", ""))
Now use the time_filter option in your config:
from GoogleScraper import scrape_with_config
config = {
'use_own_ip': True,
'keyword_file': "keywords.txt",
'search_engines': ['google'],
'num_pages_for_keyword': 2,
'scrape_method': 'http',
"time_filter": "d15" #up to 15 days ago
}
search = scrape_with_config(config)
Results will only include the time range. Additionally, text snippets in the results will have raw date information:
one_sample_result = search.serps[0].links[0]
print(one_sample_result.snippet)
4 mins ago It must be pretty easy - let propertytotalPriceOfOrder =
order.items.map(item => +item.unit * +item.quantity * +item.price);.
where order is your entire json object.

Getting more than 100 days of data web scraping Yahoo

Like many others I have been looking for an alternative source of stock prices now that the Yahoo and Google APIs are defunct. I decided to take a try at web scraping the Yahoo site from which historical prices are still available. I managed to put together the following code which almost does what I need:
import urllib.request as web
import bs4 as bs
def yahooPrice(tkr):
tkr=tkr.upper()
url='https://finance.yahoo.com/quote/'+tkr+'/history?p='+tkr
sauce=web.urlopen(url)
soup=bs.BeautifulSoup(sauce,'lxml')
table=soup.find('table')
table_rows=table.find_all('tr')
allrows=[]
for tr in table_rows:
td=tr.find_all('td')
row=[i.text for i in td]
if len(row)==7:
allrows.append(row)
vixdf= pd.DataFrame(allrows).iloc[0:-1]
vixdf.columns=['Date','Open','High','Low','Close','Aclose','Volume']
vixdf.set_index('Date',inplace=True)
return vixdf
which produces a dataframe with the information I want. Unfortunately, even though the actual web page shows a full year's worth of prices, my routine only returns 100 records (including dividend records). Any idea how I can get more?
The Yahoo Finance API was depreciated in May '17, I believe. Now, there are to many options for downloading time series data for free, at least that I know of. Nevertheless, there is always some kind of alternative. Check out the URL below to find a tool to download historical price.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/
See this too.
https://blog.quandl.com/api-for-stock-data
I don't have the exact solution to your question but I have a workaround (I had the same problem and hence used this approach)....basically, you can use Bday() method - 'import pandas.tseries.offset' and look for x number of businessDays for collecting the data. In my case, i ran the loop thrice to get 300 businessDays data - knowing that 100 was maximum I was getting by default.
Basically, you run the loop thrice and set the Bday() method such that the iteration on first time grabs 100 days data from now, then the next 100 days (200 days from now) and finally the last 100 days (300 days from now). The whole point of using this is because at any given point, one can only scrape 100 days data. So basically, even if you loop through 300 days in one go, you may not get 300 days data - your original problem (possibly yahoo limits amount of data extracted in one go). I have my code here : https://github.com/ee07kkr/stock_forex_analysis/tree/dataGathering
Note, the csv files for some reason are not working with /t delimiter in my case...but basically u can use the data frame. One more issue I currently have is 'Volume' is a string instead of float....the way to get around is :
apple = pd.DataFrame.from_csv('AAPL.csv',sep ='\t')
apple['Volume'] = apple['Volume'].str.replace(',','').astype(float)
First - Run the code below to get your 100 days.
Then - Use SQL to insert the data into a small db (Sqlite3 is pretty easy to use with python).
Finally - Amend code below to then get daily prices which you can add to grow your database.
from pandas import DataFrame
import bs4
import requests
def function():
url = 'https://uk.finance.yahoo.com/quote/VOD.L/history?p=VOD.L'
response = requests.get(url)
soup=bs4.BeautifulSoup(response.text, 'html.parser')
headers=soup.find_all('th')
rows=soup.find_all('tr')
ts=[[td.getText() for td in rows[i].find_all('td')] for i in range (len(rows))]
date=[]
days=(100)
while days > 0:
for i in ts:
data.append (i[:-6])
now=data[num]
now=DataFrame(now)
now=now[0]
now=str(now[0])
print now, item
num=num-1

Discogs API => How to retrieve genre?

I've crawled a tracklist of 36.000 songs, which have been played on the Danish national radio station P3. I want to do some statistics on how frequently each of the genres have been played within this period, so I figured the discogs API might help labeling each track with genre. However, the documentation for the API doesent seem to include an example for querying the genre of a particular song.
I have a CSV-file with with 3 columns: Artist, Title & Test(Test where i want the API to label each song with the genre).
Here's a sample of the script i've built so far:
import json
import pandas as pd
import requests
import discogs_client
d = discogs_client.Client('ExampleApplication/0.1')
d.set_consumer_key('key-here', 'secret-here')
input = pd.read_csv('Desktop/TEST.csv', encoding='utf-8',error_bad_lines=False)
df = input[['Artist', 'Title', 'Test']]
df.columns = ['Artist', 'Title','Test']
for i in range(0, len(list(df.Artist))):
x = df.Artist[i]
g = d.artist(x)
df.Test[i] = str(g)
df.to_csv('Desktop/TEST2.csv', encoding='utf-8', index=False)
This script has been working with a dummy file with 3 records in it so far, for mapping the artist of a given ID#. But as soon as the file gets larger(ex. 2000), it returns a HTTPerror when it cannot find the artist.
I have some questions regarding this approach:
1) Would you recommend using the search query function in the API for retrieving a variable as 'Genre'. Or do you think it is possible to retrieve Genre with a 'd.' function from the API?
2) Will I need to aquire an API-key? I have succesfully mapped the 3 records without an API-key so far. Looks like the key is free though.
Here's the guide I have been following:
https://github.com/discogs/discogs_client
And here's the documentation for the API:
https://www.discogs.com/developers/#page:home,header:home-quickstart
Maybe you need to re-read the discogs_client examples, i am not an expert myself, but a newbie trying to use this API.
AFAIK, g = d.artist(x) fails because x must be a integer not a string.
So you must first do a search, then get the artist id, then d.artist(artist_id)
Sorry for no providing an example, i am python newbie right now ;)
Also have you checked acoustid for
It's a probably a rate limit.
Read the status code of your response, you should find an 429 Too Many Requests
Unfortunately, if that's the case, the only solution is to add a sleep in your code to make one request per second.
Checkout the api doc:
http://www.discogs.com/developers/#page:home,header:home-rate-limiting
I found this guide:
https://github.com/neutralino1/discogs_client.
Access the api with your key and try something like:
d = discogs_client.Client('something.py', user_token=auth_token)
release = d.release(774004)
genre = release.genres
If you found a better solution please share.

How to access pubmed data for forms with multiple pages in python crawler

I am trying to crawl pubmed with python and get the pubmed ID for all papers that an article was cited by.
For example this article (ID: 11825149)
http://www.ncbi.nlm.nih.gov/pubmed/11825149
Has a page linking to all articles that cite it:
http://www.ncbi.nlm.nih.gov/pubmed?linkname=pubmed_pubmed_citedin&from_uid=11825149
The problem is it has over 200 links but only shows 20 per page. The 'next page' link is not accessible by url.
Is there a way to open the 'send to' option or view the content on the next pages with python?
How I currently open pubmed pages:
def start(seed):
webpage = urlopen(seed).read()
print webpage
citedByPage = urlopen('http://www.ncbi.nlm.nih.gov/pubmedlinkname=pubmed_pubmed_citedin&from_uid=' + pageid).read()
print citedByPage
From this I can extract all the cited by links on the first page, but how can I extract them from all pages? Thanks.
I was able to get the cited by IDs using a method from this page
http://www.bio-cloud.info/Biopython/en/ch8.html
Back in Section 8.7 we mentioned ELink can be used to search for citations of a given paper. Unfortunately this only covers journals indexed for PubMed Central (doing it for all the journals in PubMed would mean a lot more work for the NIH). Let’s try this for the Biopython PDB parser paper, PubMed ID 14630660:
>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other#example.com"
>>> pmid = "14630660"
>>> results = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
... LinkName="pubmed_pmc_refs", from_uid=pmid))
>>> pmc_ids = [link["Id"] for link in results[0]["LinkSetDb"][0]["Link"]]
>>> pmc_ids
['2744707', '2705363', '2682512', ..., '1190160']
Great - eleven articles. But why hasn’t the Biopython application note been found (PubMed ID 19304878)? Well, as you might have guessed from the variable names, there are not actually PubMed IDs, but PubMed Central IDs. Our application note is the third citing paper in that list, PMCID 2682512.
So, what if (like me) you’d rather get back a list of PubMed IDs? Well we can call ELink again to translate them. This becomes a two step process, so by now you should expect to use the history feature to accomplish it (Section 8.15).
But first, taking the more straightforward approach of making a second (separate) call to ELink:
>>> results2 = Entrez.read(Entrez.elink(dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed",
... from_uid=",".join(pmc_ids)))
>>> pubmed_ids = [link["Id"] for link in results2[0]["LinkSetDb"][0]["Link"]]
>>> pubmed_ids
['19698094', '19450287', '19304878', ..., '15985178']
This time you can immediately spot the Biopython application note as the third hit (PubMed ID 19304878).
Now, let’s do that all again but with the history …TODO.
And finally, don’t forget to include your own email address in the Entrez calls.

Categories

Resources