hi guys so i am working on a personal project in which i was searching for tweets containing specific keywords. I collected about 100 recent tweets for each of the keywords and saved them to variable x1_tweets, x2_tweets and x3_tweets. The data is basically a list of dictionaries and the fields look like this:
['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang']
i then wanted to save the tweets(just the text) from each of the variables to json file. for that i defined a function(the function saves a list of dictionaries to a json file, obj being the list of dictionaries and filename being the name i want to save it as):
def save_to_json(obj, filename):
with open(filename, 'w') as fp:
json.dump(obj, fp, indent=4, sort_keys=True)
In order to get only the tweets i implemented the following code:
for i, tweet in enumerate(x1_tweets):
save_to_json(tweet['text'],'bat')
However i have had no success thus far, can anyone please guide me to the right direction? thanks in advance!
edit: I am using twitterAPI
First thing you need to do is change the below code as:
def save_to_json(obj, filename):
with open(filename, 'a') as fp:
json.dump(obj, fp, indent=4, sort_keys=True)
You need to change the mode in which file is open because of the below reason.
w: Opens in write-only mode. The pointer is placed at the beginning of the file and this will overwrite any existing file with the same name. It will create a new file if one with the same name doesn't exist.
a: Opens a file for appending new information to it. The pointer is placed at the end of the file. A new file is created if one with the same name doesn't exist.
Also, there is no meaning of sort_keys as you are only passing a string and not a dict. Similarly, there is no meaning of indent=4 for strings.
If you need some indexing with the tweet text you can use the below code:
tweets = {}
for i, tweet in enumerate(x1_tweets):
tweets[i] = tweet['text']
save_to_json(tweets,'bat.json')
The above code will create a dict with index to the tweet and write to the file once all tweets are processed.
Related
So I am querying Twitter API with a list of tweet IDs. What I need to do is looping through the IDs in order to get the corresponding data from Twitter. Then I need to store those JSON files into a txt file where each tweet's JSON data is on its own line. Later I will have to read the txt file line by line to create a pandas df from it.
I try to give you some fake data to show you the structure.
twt.tweet_id.head()
0 000000000000000001
1 000000000000000002
2 000000000000000003
3 000000000000000004
4 000000000000000005
Name: tweet_id, dtype: int64
I don't know how to share the JSON files and I don't even know if I can. After calling tweet._json what I get is a JSON file.
drop_lst = [] # this is needed to collect the IDs which don't work
for i in twt.tweet_id: # twt.tweet_id is the pd.series with the IDs
try:
tweet = api.get_status(i)
with open('tweet_json.txt', 'a') as f:
f.write(str(tweet._json)+'\n') # tweet._json is the JSON file I need
except tp.TweepError:
drop_lst.append(i)
the above works but I think I have lost the JSON structure which I need later to create the dataframe
drop_lst = []
for i in twt.tweet_id:
try:
tweet = api.get_status(i)
with open('data.txt', 'a') as outfile:
json.dump(tweet._json, outfile)
except tp.TweepError:
drop_lst.append(i)
the above doesn't put each file on its own line.
I hope I was able to provide you with enough information to help me.
Thank you in advance for all your help.
Appending json to a file using json.dump doesn't include newlines, so they all wind up on the same line together. I'd recommend collecting all of your json records into a list, then use join and dump that to a file
tweets, drop_lst = [], []
for i in twt.tweet_id:
try:
tweet = api.get_status(i)
tweets.append(tweet._json)
except tp.TweepError:
drop_lst.append(i)
with open('data.txt', 'a') as fh:
fh.write('\n') # to ensure that the json is on its own line to start
fh.write('\n'.join(json.dumps(tweet) for tweet in tweets)) # this will concatenate the tweets into a newline delimited string
Then, to create your dataframe, you can read that file and stitch everything back together
with open("data.txt") as fh:
tweets = [json.loads(line) for line in fh if line]
df = pd.DataFrame(tweets)
This assumes that the json itself doesn't have newlines, which tweets might contain, so be wary
I have created a JSON file after scraping data online with the following simplified code:
for item in range(items_to_scrape)
az_text = []
for n in range(first_web_page, last_web_page):
reviews_html = requests.get(page_link)
tree = fromstring(reviews_html.text)
page_link = base_url + str(n)
review_text_tags = tree.xpath(xpath_1)
for r_text in review_text_tags:
review_text = r_text.text
az_text.append(review_text)
az_reviews = {}
az_reviews[item] = az_text
with open('data.json', 'w') as outfile:
json.dump(az_reviews , outfile)
There might be a better way to create a JSON file with the first key equal to the item number and the second key equal to the list of reviews for that item, however I am currently stuck at opening the JSON file to see the items have been already scraped.
The structure of the JSON file looks like this:
{
"asin": "0439785960",
"reviews": [
"Don’t miss this one!",
"Came in great condition, one of my favorites in the HP series!",
"Don’t know how these books are so good and I’ve never read them until now. Whether you’ve watched the movies or not, read these books"
]
}
The unsuccessful attempt that seems to be closer to the solution is the following:
import json
from pprint import pprint
json_data = open('data.json', 'r').read()
json1_file = json.loads(json_data)
print(type(json1_file))
print(json1_file["asin"])
It returns a string that replicates exactly the result of the print() function I used during the scraping process to check what the JSON file was going to be look like, but I can't access the asins or reviews using json1_file["asin"] or json1_file["reviews"] since the file read is a string and not a dictionary.
TypeError: string indices must be integers
Using the json.load() function I still print the right content, but I have cannot figure out how to access the dictionary-like object from the JSON file to iterate through keys and values.
The following code prints the content of the file, but raises an error (AttributeError: '_io.TextIOWrapper' object has no attribute 'items') when I try to iterate through keys and values:
with open('data.json', 'r') as content:
print(json.load(content))
for key, value in content.items():
print(key, value)
What is wrong with the code above and what should be adjusted to load the file into a dictionary?
string indices must be integers
You're writing out the data as a string, not a dictionary. Remove the dumps, and only dump
with open('data.json', 'w') as outfile:
json.dump(az_reviews, outfile, indent=2, ensure_ascii=False)
what should be adjusted to load the file into a dictionary?
Once you're parsing a JSON object, and not a string, then nothing except maybe not using reads, then loads and rather only json.load
Another problem seems to be that you're overwriting the file on every loop iteration
Instead, you probably want to open one file then loop and write to it afterwards
data = {}
for item in range(items_to_scrape):
pass # add to data
# put all data in one file
with open('data.json', 'w') as f:
json.dump(data, f)
In this scenario, I suggest that you store the asin as a key, with the reviews as values
asin = "123456" # some scraped value
data[asin] = reviews
Or write a unique file for each scrape, which you then must loop over to read them all.
for item in range(items_to_scrape):
data = {}
# add to data
with open('data{}.json'.format(item), 'w') as f:
json.dump(data, f)
Scenario is i need to convert dictionary object as json and write to a file . New Dictionary objects would be sent on every write_to_file() method call and i have to append Json to the file .Following is the code
def write_to_file(self, dict=None):
f = open("/Users/xyz/Desktop/file.json", "w+")
if json.load(f)!= None:
data = json.load(f)
data.update(dict)
f = open("/Users/xyz/Desktop/file.json", "w+")
f.write(json.dumps(data))
else:
f = open("/Users/xyz/Desktop/file.json", "w+")
f.write(json.dumps(dict)
Getting this error "No JSON object could be decoded" and Json is not written to the file. Can anyone help ?
this looks overcomplex and highly buggy. Opening the file several times, in w+ mode, and reading it twice won't get you nowhere but will create an empty file that json won't be able to read.
I would test if the file exists, if so I'm reading it (else create an empty dict).
this default None argument makes no sense. You have to pass a dictionary or the update method won't work. Well, we can skip the update if the object is "falsy".
don't use dict as a variable name
in the end, overwrite the file with a new version of your data (w+ and r+ should be reserved to fixed size/binary files, not text/json/xml files)
Like this:
def write_to_file(self, new_data=None):
# define filename to avoid copy/paste
filename = "/Users/xyz/Desktop/file.json"
data = {} # in case the file doesn't exist yet
if os.path.exists(filename):
with open(filename) as f:
data = json.load(f)
# update data with new_data if non-None/empty
if new_data:
data.update(new_data)
# write the updated dictionary, create file if
# didn't exist
with open(filename,"w") as f:
json.dump(data,f)
I used this tweepy-based code to pull the tweets of a given user by user_id. I then saved a list of all tweets of a given user (alltweets) to a json file as follows. Note that without "repr", i wasn't able to dump the alltweets list into json file. The code worked as expected
with open(os.path.join(output_file_path,'%s_tweets.json' % user_id), 'a') as f:
json.dump(repr(alltweets), f)
However, I have a side problem with retrieving the tweets after saving them to the json file. I need to access the text in each tweet, but I'm not sure how to deal with the "Status" wrapper that tweepy uses (See a sample of the json file attached).sample json file content
I tried to iterate over the lines in the file as follows, but the file is being seen as a single line.
with open(fname, 'r') as f:
for line in f:
tweet = json.loads(line)
I also tried to iterate over statuses after reading the json file as a string, as follows, but iteration rather takes place on the individual characters in the json file.
with open(fname, 'r') as f:
x = f.read()
for status in x:
"""code"""
Maybe not the prettiest solution but you could just declare Status as a dict and then eval the list (the whole content of the files).
Status = dict
f = open(fname, 'r')
data = eval(f.read())
f.close()
for status in data:
""" do your stuff"""
I am giving a try in backend and I am failing in parsing Twitters stream API. I want to create Json file with timestamp, name, tweet and screen name. That seems to be working. But when I try to write it in file - entry overwrites already existing one. It does not continues. I red here that for some outfile.write('\n') worked. Tried and still new entry overwrites previous one
with open('text2', 'w') as outfile:
json.dump({'time': time.time(), 'screenName': screenName, 'text': text, 'name': name}, outfile, indent = 4, sort_keys=True)
outfile.write('\n')
When you open the file use a (append) instead of w (write).
https://docs.python.org/2/library/functions.html#open