tldr: how to access the FULL tweet body with JSON?
Hello
I have a problem finding the full text of a tweet in JSON.
I am making a python app with tweepy. I would like to take a status, and then access the text
EDIT
I used user_timeline() to get a tweet_list. Then got one tweet from them like this:
tweet=tweet_list[index]._json
now when I do this:
tweet['text']
it returns a shortened tweet with a link to the original
eg:
Unemployment for Black Americans is the lowest ever recorded. Trump
approval ratings with Black Americans has doubl…
(the shortened link, couldn't directly link due to stackoverflow rules)
I want to return this:
Unemployment for Black Americans is the lowest ever recorded. Trump
approval ratings with Black Americans has doubled. Thank you, and it
will get even (much) better! #FoxNews
I don't mind if the link is added as long as the full tweet is shown
Okay after looking a bit more. I believe it is impossible to do it directly with JSON
There is a solution here about getting the full tweet. you can see it here
The problem with the answer above is that full_text turn the object into a string. if you need the object in its initial state to use it later with json to get other info. do the following:
use tweet_mode="extended" in user_timeline() and save it in tweet_list. eg:
tweet_list = api.user_timeline("user", count=10, tweet_mode="extended")
take one tweet only like this: tweet=tweet_list[0]
if you want the full text tweet, do this: tweet.full_text
if you need a json version of the object do this jtweet = tweet._json or just access the key like this tweet._json['id']
Hope that helps
You didn't provide any information about, how you want to achieve your goal. Looking at tweepy API, there is optional flag argument full_text which you can pass to function. get direct message function
It defaults to false causing that returned messages are shortened to 140 chars. Just set it at True and see what happen.
Related
Basically I want to get the conversation_id if the Tweet is a reply to another Tweet. So I can get the list of replies to each other to analyze.
My code:
class Listener(StreamingClient):
def on_response(self, response):
print(response)
listener = Listener(auth['bearer_token'])
listener.sample(expansions=['in_reply_to_user_id'], tweet_fields=['conversation_id'])
When using this, I only get the user_id to which it is replying, but I cannot get any type of conversation_id.
I have a slight feeling I am missing something essential.
From the relevant FAQ section about this in Tweepy's documentation:
If you are simply printing the objects and looking at that output, the string representations of API v2 models/objects only include the default fields that are guaranteed to exist.
The objects themselves still include the relevant data, which you can access as attributes or by subscription.
I'm looking to create a program (right now just trying to see how far I can take it) that is able to retrieve nyc 311 department of building complaints. I did find the api documentation online and I was able to search per complaint number, as far as my program is concerned that defeats the purpose I want to to be able to search by address to see if there is an active complaint so therefore someone wouldn't know there complaint number if they haven't been notified of it. Here is an example of to search with the complaint number; which works.
comNum = "4830407"
response = requests.get('https://data.cityofnewyork.us/resource/eabe-havv.json?complaint_number=%s' %(comNum))
Ok so on the api documentation there are options for zip_code= , house_number= , and house_street= ,
When I attempt to add these to the url like in this example:
responseAddress = requests.get('https://data.cityofnewyork.us/resource/eabe-havv.json?zip_code=11103&house_number=123&house_street=50street'
nothing returns if I eliminate lets say zip and house_number in printed back an empty sting like so : []
I want to be able to have this program searchable by address but I can't seem to get the url to function the way I'm trying to. You can’t possible search an address by only zip or only house number.
If you look at the raw data (no parameters), looking specifically at zipcodes, there are spaces in it. You'll need to url encode those spaces.
This returns []: https://data.cityofnewyork.us/resource/eabe-havv.json?zip_code=11103
This does not. https://data.cityofnewyork.us/resource/eabe-havv.json?zip_code=11103%20%20%20%20
Looks like house numbers are always 12 characters long, so you could do something like this to get a left-padded string
>>> "{:<12d}".format(113)
'113 '
Related: How to urlencode a querystring in Python?
Hello I am trying to scrape the tweets of a certain user using tweepy.
Here is my code :
tweets = []
username = 'example'
count = 140 #nb of tweets
try:
# Pulling individual tweets from query
for tweet in api.user_timeline(id=username, count=count, include_rts = False):
# Adding to list that contains all tweets
tweets.append((tweet.text))
except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)
The problem I am having is the tweets are coming back unfinished with "..." at the end.
I think I've looked at all the other similar problems on stack overflow and elsewhere but nothing works. Most do not concern me because I am NOT dealing with retweets .
I have tried putting tweet_mode = 'extended' and/or tweet.full_text or tweet._json['extended_tweet']['full_text'] in different combinations .
I don't get an error message but nothing works, just an empty list in return.
And It looks like the documentation is out of date because it says nothing about the 'tweet_mode' nor the 'include_rts' parameter :
Has anyone managed to get the full text of each tweet?? I'm really stuck on this seemingly simple problem and am losing my hair so I would appreciate any advice :D
Thanks in advance!!!
TL;DR: You're most likely running into a Rate Limiting issue. And use the full_text attribute.
Long version:
First,
The problem I am having is the tweets are coming back unfinished with "..." at the end.
From the Tweepy documentation on Extended Tweets, this is expected:
Compatibility mode
... It will also be discernible that the text attribute of the Status object is truncated as it will be suffixed with an ellipsis character, a space, and a shortened self-permalink URL to the Tweet.
Wrt
And It looks like the documentation is out of date because it says nothing about the 'tweet_mode' nor the 'include_rts' parameter :
They haven't explicitly added it to the documentation of each method, however, they specify that tweet_mode is added as a param:
Standard API methods
Any tweepy.API method that returns a Status object accepts a new tweet_mode parameter. Valid values for this parameter are compat and extended , which give compatibility mode and extended mode, respectively. The default mode (if no parameter is provided) is compatibility mode.
So without tweet_mode added to the call, you do get the tweets with partial text? And with it, all you get is an empty list? If you remove it and immediately retry, verify that you still get an empty list. ie, once you get an empty list result, check if you keep getting an empty list even when you change the params back to the one which worked.
Based on bug #1329 - API.user_timeline sometimes returns an empty list - it appears to be a Rate Limiting issue:
Harmon758 commented on Feb 13
This API limitation would manifest itself as exactly the issue you're describing.
Even if it was working, it's in the full_text attribute, not the usual text. So the line
tweets.append((tweet.text))
should be
tweets.append(tweet.full_text)
(and you can skip the extra enclosing ())
Btw, if you're not interested in retweets, see this example for the correct way to handle them:
Given an existing tweepy.API object and id for a Tweet, the following can be used to print the full text of the Tweet, or if it’s a Retweet, the full text of the Retweeted Tweet:
status = api.get_status(id, tweet_mode="extended")
try:
print(status.retweeted_status.full_text)
except AttributeError: # Not a Retweet
print(status.full_text)
If status is a Retweet, status.full_text could be truncated.
As per the twitter API v2:
tweet_mode does not work at all. You need to add expansions=referenced_tweets.id. Then in the response, search for includes. You can find all the truncated tweets as full tweets in the includes. You will still see the truncated tweets in response but do not worry about it.
I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.
I want to get all restaurants in London by using python 3.5 and the module googleplaces with the Google Places API. I read the googleplaces documentation and searched here, but I don't get it. Here is my code so far:
from googleplaces import GooglePlaces, types, lang
API_KEY = 'XXXCODEXXX'
google_places = GooglePlaces(API_KEY)
query_result = google_places.nearby_search(
location='London', keyword='Restaurants',
radius=1000, types=[types.TYPE_RESTAURANT])
if query_result.has_attributions:
print query_result.html_attributions
for place in query_result.places:
place.get_details()
print place.rating
The code doesn't work. What can I do to get a list with all restaurants in this area?
It'll be better if you drop the keyword parameter, types already searches for restaurants.
Bear in mind the Places API (as other Google Maps APIs) is not a database, it will not return all results that match. Actually returns only 20, and you can get an extra 40 or so, but that's all.
If I'm reading the GooglePlaces correctly, your code will send an API request such like:
http://maps.googleapis.com/maps/api/place/nearbysearch/json?location=51.507351,-0.127758&radius=1000&types=restaurant&keyword=Restaurants&key=YOUR_API_KEY
If you just drop the keyword parameter, it'll be like:
http://maps.googleapis.com/maps/api/place/nearbysearch/json?location=51.507351,-0.127758&radius=1000&types=restaurant&key=YOUR_API_KEY
The difference is subtle: keyword=Restaurants will make the API match results that have the word "Restaurants" in their name, address, etc. Some of these may not be restaurants (and will be discarded), while some actual restaurants may not have the word "Restaurants" in them.
Try to change city value by latitude and longitude and it's not necessary to put the keyword because you are specified that on Type try to put this code it's work for me :
query_result = google_places.nearby_search(
lat_lng={'lat': 46.1667, 'lng': -1.15},
radius=5000,
types=[types.TYPE_RESTAURANT] or [types.TYPE_CAFE] or [type.TYPE_BAR] or [type.TYPE_CASINO])
The only thing missing is as the error says, parentheses (). Your code should be
if query_result.has_attributions:
print (query_result.html_attributions)