I am building a scraper that searches reddit comments for keywords. I am having two problems
Problem 1. in order to build the commentforest I need to get the submission_id from the comment so I can pull all of the comments related to that submission.I am having trouble figuring out how to get the submission id.
problem 2. for some reason every time I run this code it gives me new comments that include the keywords (I am assuming this is just because new comments have been added) BUT some of the old comments arent showing up when i run the code again. this is supposed to pull ALL the comments that match my keyword from the subreddit what am i doing wrong?
from psaw import PushshiftAPI
from datetime import datetime, timezone, timedelta
from dateutil.relativedelta import relativedelta
api = PushshiftAPI()
comments = api.search_comments(q='OP', subreddit='askreddit')
max_response_cache = 1000
cache = []
commentcount = 0
for c in comments:
cache.append(c)
commentcount += 1
print (f' comment {commentcount}: {c.body}')
I want the code to print out the submission id related to the comments and I want to be able to pull all the comments that match my keywords.
Related
I am having issues with the python API sportsreference, I am trying to pull information for every nba matchup on a given date. I have been able to do this for ncaab, ncaaf, and nfl but am finding that the nba returns an empty dictionary.
Current code is as follows:
from sportsreference.nba import boxscore
import sportsreference
from datetime import datetime
now = datetime.now()
box_scores_nba = sportsreference.nba.boxscore.Boxscores(now)
print(box_scores_nba.games)
Output is:
{'12-26-2020': []}
Does anyone have any idea why I am not pulling any info when there are games scheduled this day? I have been trying to read the documentation for sportsreference and am getting nowhere.
Thanks -
Edited based on Comments
The BoxScore() method only gives a value for the previous day and does not give any detail for today/upcoming days, this is because the website itself does not provide these details! check here.
On the other hand you can get the schedule of a particular team using this code:
from sportsreference.nba.schedule import Schedule
houston_schedule = Schedule('HOU')
for game in houston_schedule:
print(game.date) # Prints the date the game was played
print(game.result) # Prints whether the team won or lost
Even then there are incorrect output I receive in the dataset!
Like the result of the upcoming games are shown as 'win'.
In my opinion, it is better to avoid this API and go for better websites and use webscraping (unless you don't want real-time data) as I find these results very raw, confusing and misleading here and there!
I'm a nub when it comes to python. I literally just started today and have little understanding of programming. I have managed to make the following code work:
from twitter import *
config = {}
execfile("config.py", config)
twitter = Twitter(
auth = OAuth(config["access_key"], config["access_secret"],
config["consumer_key"], config["consumer_secret"]))
user = "skiftetse"
results = twitter.statuses.user_timeline(screen_name = user)
for status in results:
print "(%s) %s" % (status["created_at"], status["text"].encode("ascii",
"ignore"))
The problem is that it's only printing 20 results. The twitter page i'd like to get data from has 22k posts, so something is wrong with the last line of code.
screenshot
I would really appreciate help with this! I'm doing this for a research sentiment analysis, so I need several 100's to analyze. Beyond that it'd be great if retweets and information about how many people re tweeted their posts were included. I need to get better at all this, but right now I just need to meet that deadline at the end of the month.
You need to understand how the Twitter API works. Specifically, the user_timeline documentation.
By default, a request will only return 20 Tweets. If you want more, you will need to set the count parameter to, say, 50.
e.g.
results = twitter.statuses.user_timeline(screen_name = user, count = 50)
Note, count:
Specifies the number of tweets to try and retrieve, up to a maximum of 200 per distinct request.
In addition, the API will only let you retrieve the most recent 3,200 Tweets.
I am working on a project where I scrape a number of blogs, and save a selection of the data to a SQLite database. Such as the title of the post, the date it was posted, and the content of the post.
The goal in the end is to do some fancy textual analyses, but right now I have a problem with writing the data to the database.
I work with the library pattern for Python. (the module about databases can be found here)
I am busy with the third blog now. The data from the two other blogs is already saved in the database, and for the third blog, which is similarly structured, I adapted the code.
There are several functions well integrated with each other, they work fine. I also got access to all the data the right way, when I try it out in IPython Notebook it works fine. When I ran the code as a trial in the Console for only one blog page (it has 43 altogether), it also worked and saved everything nicely in the database. But when I ran it again for 43 pages, it threw a data error.
There are some comments and print statements inside the functions now which I used for debugging. The problem seems to happen in the function parse_post_info, which passes a dictionary on to the function that goes over all blog pages and opens every single post, and then saves the dictionary that the function parse_post_info returns IF it is not None, but I think it IS empty because something about the date format goes wrong.
Also - why does the code work once, and the same code throws a dateerror the second time:
DateError: unknown date format for '2015-06-09T07:01:55+00:00'
Here is the function:
from pattern.db import Database, field, pk, date, STRING, INTEGER, BOOLEAN, DATE, NOW, TEXT, TableError, PRIMARY, eq, all
from pattern.web import URL, Element, DOM, plaintext
def parse_post_info(p):
""" This function receives a post Element from the post list and
returns a dictionary with post url, post title, labels, date.
"""
try:
post_header = p("header.entry-header")[0]
title_tag = post_header("a < h1")[0]
post_title = plaintext(title_tag.content)
print post_title
post_url = title_tag("a")[0].href
date_tag = post_header("div.entry-meta")[0]
post_date = plaintext(date_tag("time")[0].datetime).split("T")[0]
#post_date = date(post_date_text)
print post_date
post_id = int(((p).id).split("-")[-1])
post_content = get_post_content(post_url)
labels = " "
print labels
return dict(blog_no=blog_no,
post_title=post_title,
post_url=post_url,
post_date=post_date,
post_id=post_id,
labels=labels,
post_content=post_content
)
except:
pass
The date() function returns a new Date, a convenient subclass of Python's datetime.datetime. It takes an integer (Unix timestamp), a string or NOW.
You can have diff with local time.
Also the format is "YYYY-MM-DD hh:mm:ss".
The convert time format can be found here
I've crawled a tracklist of 36.000 songs, which have been played on the Danish national radio station P3. I want to do some statistics on how frequently each of the genres have been played within this period, so I figured the discogs API might help labeling each track with genre. However, the documentation for the API doesent seem to include an example for querying the genre of a particular song.
I have a CSV-file with with 3 columns: Artist, Title & Test(Test where i want the API to label each song with the genre).
Here's a sample of the script i've built so far:
import json
import pandas as pd
import requests
import discogs_client
d = discogs_client.Client('ExampleApplication/0.1')
d.set_consumer_key('key-here', 'secret-here')
input = pd.read_csv('Desktop/TEST.csv', encoding='utf-8',error_bad_lines=False)
df = input[['Artist', 'Title', 'Test']]
df.columns = ['Artist', 'Title','Test']
for i in range(0, len(list(df.Artist))):
x = df.Artist[i]
g = d.artist(x)
df.Test[i] = str(g)
df.to_csv('Desktop/TEST2.csv', encoding='utf-8', index=False)
This script has been working with a dummy file with 3 records in it so far, for mapping the artist of a given ID#. But as soon as the file gets larger(ex. 2000), it returns a HTTPerror when it cannot find the artist.
I have some questions regarding this approach:
1) Would you recommend using the search query function in the API for retrieving a variable as 'Genre'. Or do you think it is possible to retrieve Genre with a 'd.' function from the API?
2) Will I need to aquire an API-key? I have succesfully mapped the 3 records without an API-key so far. Looks like the key is free though.
Here's the guide I have been following:
https://github.com/discogs/discogs_client
And here's the documentation for the API:
https://www.discogs.com/developers/#page:home,header:home-quickstart
Maybe you need to re-read the discogs_client examples, i am not an expert myself, but a newbie trying to use this API.
AFAIK, g = d.artist(x) fails because x must be a integer not a string.
So you must first do a search, then get the artist id, then d.artist(artist_id)
Sorry for no providing an example, i am python newbie right now ;)
Also have you checked acoustid for
It's a probably a rate limit.
Read the status code of your response, you should find an 429 Too Many Requests
Unfortunately, if that's the case, the only solution is to add a sleep in your code to make one request per second.
Checkout the api doc:
http://www.discogs.com/developers/#page:home,header:home-rate-limiting
I found this guide:
https://github.com/neutralino1/discogs_client.
Access the api with your key and try something like:
d = discogs_client.Client('something.py', user_token=auth_token)
release = d.release(774004)
genre = release.genres
If you found a better solution please share.
I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!