I used this tweepy-based code to pull the tweets of a given user by user_id. I then saved a list of all tweets of a given user (alltweets) to a json file as follows. Note that without "repr", i wasn't able to dump the alltweets list into json file. The code worked as expected
with open(os.path.join(output_file_path,'%s_tweets.json' % user_id), 'a') as f:
json.dump(repr(alltweets), f)
However, I have a side problem with retrieving the tweets after saving them to the json file. I need to access the text in each tweet, but I'm not sure how to deal with the "Status" wrapper that tweepy uses (See a sample of the json file attached).sample json file content
I tried to iterate over the lines in the file as follows, but the file is being seen as a single line.
with open(fname, 'r') as f:
for line in f:
tweet = json.loads(line)
I also tried to iterate over statuses after reading the json file as a string, as follows, but iteration rather takes place on the individual characters in the json file.
with open(fname, 'r') as f:
x = f.read()
for status in x:
"""code"""
Maybe not the prettiest solution but you could just declare Status as a dict and then eval the list (the whole content of the files).
Status = dict
f = open(fname, 'r')
data = eval(f.read())
f.close()
for status in data:
""" do your stuff"""
Related
So I am querying Twitter API with a list of tweet IDs. What I need to do is looping through the IDs in order to get the corresponding data from Twitter. Then I need to store those JSON files into a txt file where each tweet's JSON data is on its own line. Later I will have to read the txt file line by line to create a pandas df from it.
I try to give you some fake data to show you the structure.
twt.tweet_id.head()
0 000000000000000001
1 000000000000000002
2 000000000000000003
3 000000000000000004
4 000000000000000005
Name: tweet_id, dtype: int64
I don't know how to share the JSON files and I don't even know if I can. After calling tweet._json what I get is a JSON file.
drop_lst = [] # this is needed to collect the IDs which don't work
for i in twt.tweet_id: # twt.tweet_id is the pd.series with the IDs
try:
tweet = api.get_status(i)
with open('tweet_json.txt', 'a') as f:
f.write(str(tweet._json)+'\n') # tweet._json is the JSON file I need
except tp.TweepError:
drop_lst.append(i)
the above works but I think I have lost the JSON structure which I need later to create the dataframe
drop_lst = []
for i in twt.tweet_id:
try:
tweet = api.get_status(i)
with open('data.txt', 'a') as outfile:
json.dump(tweet._json, outfile)
except tp.TweepError:
drop_lst.append(i)
the above doesn't put each file on its own line.
I hope I was able to provide you with enough information to help me.
Thank you in advance for all your help.
Appending json to a file using json.dump doesn't include newlines, so they all wind up on the same line together. I'd recommend collecting all of your json records into a list, then use join and dump that to a file
tweets, drop_lst = [], []
for i in twt.tweet_id:
try:
tweet = api.get_status(i)
tweets.append(tweet._json)
except tp.TweepError:
drop_lst.append(i)
with open('data.txt', 'a') as fh:
fh.write('\n') # to ensure that the json is on its own line to start
fh.write('\n'.join(json.dumps(tweet) for tweet in tweets)) # this will concatenate the tweets into a newline delimited string
Then, to create your dataframe, you can read that file and stitch everything back together
with open("data.txt") as fh:
tweets = [json.loads(line) for line in fh if line]
df = pd.DataFrame(tweets)
This assumes that the json itself doesn't have newlines, which tweets might contain, so be wary
I have created a JSON file after scraping data online with the following simplified code:
for item in range(items_to_scrape)
az_text = []
for n in range(first_web_page, last_web_page):
reviews_html = requests.get(page_link)
tree = fromstring(reviews_html.text)
page_link = base_url + str(n)
review_text_tags = tree.xpath(xpath_1)
for r_text in review_text_tags:
review_text = r_text.text
az_text.append(review_text)
az_reviews = {}
az_reviews[item] = az_text
with open('data.json', 'w') as outfile:
json.dump(az_reviews , outfile)
There might be a better way to create a JSON file with the first key equal to the item number and the second key equal to the list of reviews for that item, however I am currently stuck at opening the JSON file to see the items have been already scraped.
The structure of the JSON file looks like this:
{
"asin": "0439785960",
"reviews": [
"Don’t miss this one!",
"Came in great condition, one of my favorites in the HP series!",
"Don’t know how these books are so good and I’ve never read them until now. Whether you’ve watched the movies or not, read these books"
]
}
The unsuccessful attempt that seems to be closer to the solution is the following:
import json
from pprint import pprint
json_data = open('data.json', 'r').read()
json1_file = json.loads(json_data)
print(type(json1_file))
print(json1_file["asin"])
It returns a string that replicates exactly the result of the print() function I used during the scraping process to check what the JSON file was going to be look like, but I can't access the asins or reviews using json1_file["asin"] or json1_file["reviews"] since the file read is a string and not a dictionary.
TypeError: string indices must be integers
Using the json.load() function I still print the right content, but I have cannot figure out how to access the dictionary-like object from the JSON file to iterate through keys and values.
The following code prints the content of the file, but raises an error (AttributeError: '_io.TextIOWrapper' object has no attribute 'items') when I try to iterate through keys and values:
with open('data.json', 'r') as content:
print(json.load(content))
for key, value in content.items():
print(key, value)
What is wrong with the code above and what should be adjusted to load the file into a dictionary?
string indices must be integers
You're writing out the data as a string, not a dictionary. Remove the dumps, and only dump
with open('data.json', 'w') as outfile:
json.dump(az_reviews, outfile, indent=2, ensure_ascii=False)
what should be adjusted to load the file into a dictionary?
Once you're parsing a JSON object, and not a string, then nothing except maybe not using reads, then loads and rather only json.load
Another problem seems to be that you're overwriting the file on every loop iteration
Instead, you probably want to open one file then loop and write to it afterwards
data = {}
for item in range(items_to_scrape):
pass # add to data
# put all data in one file
with open('data.json', 'w') as f:
json.dump(data, f)
In this scenario, I suggest that you store the asin as a key, with the reviews as values
asin = "123456" # some scraped value
data[asin] = reviews
Or write a unique file for each scrape, which you then must loop over to read them all.
for item in range(items_to_scrape):
data = {}
# add to data
with open('data{}.json'.format(item), 'w') as f:
json.dump(data, f)
Scenario is i need to convert dictionary object as json and write to a file . New Dictionary objects would be sent on every write_to_file() method call and i have to append Json to the file .Following is the code
def write_to_file(self, dict=None):
f = open("/Users/xyz/Desktop/file.json", "w+")
if json.load(f)!= None:
data = json.load(f)
data.update(dict)
f = open("/Users/xyz/Desktop/file.json", "w+")
f.write(json.dumps(data))
else:
f = open("/Users/xyz/Desktop/file.json", "w+")
f.write(json.dumps(dict)
Getting this error "No JSON object could be decoded" and Json is not written to the file. Can anyone help ?
this looks overcomplex and highly buggy. Opening the file several times, in w+ mode, and reading it twice won't get you nowhere but will create an empty file that json won't be able to read.
I would test if the file exists, if so I'm reading it (else create an empty dict).
this default None argument makes no sense. You have to pass a dictionary or the update method won't work. Well, we can skip the update if the object is "falsy".
don't use dict as a variable name
in the end, overwrite the file with a new version of your data (w+ and r+ should be reserved to fixed size/binary files, not text/json/xml files)
Like this:
def write_to_file(self, new_data=None):
# define filename to avoid copy/paste
filename = "/Users/xyz/Desktop/file.json"
data = {} # in case the file doesn't exist yet
if os.path.exists(filename):
with open(filename) as f:
data = json.load(f)
# update data with new_data if non-None/empty
if new_data:
data.update(new_data)
# write the updated dictionary, create file if
# didn't exist
with open(filename,"w") as f:
json.dump(data,f)
Here is my json file format,
[{
"name": "",
"official_name_en": "Channel Islands",
"official_name_fr": "Îles Anglo-Normandes",
}, and so on......
while loading the above json which is in a file I get this error,
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:
here is my python code,
import json
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
,} is not allowed in JSON (I guess that's the problem according to the data given).
You appear to be processing the entire file one line at a time. Why not simply use .read() to get the entire contents at once, then feed that to json?
with open('file') as f:
contents = f.read()
data = json.loads(contents)
Better yet, why not use json.load() to pass the readable directly and let it handle the slurping?
with open('file') as f:
data = json.load(f)
The problem is in your reading and decoding the file line by line. Any single line in your file (e.g., "[{") is not a valid JSON expression.
Your individual lines are not valid JSON. For instance, the first line '[{' by itself is not a valid JSON. If your entire file is actually valid JSON and you want individual lines, first load the entire JSON and then browse through the python dictionary.
import json
data = json.loads(open('file').read()) # this should be a list
for list_item in data:
print(list_item['name'])
I have a json file that is a synonime dicitonnary in French (I say French because I had an error message with ascii encoding... due to the accents 'é',etc). I want to read this file with python to get a synonime when I input a word.
Well, I can't even read my file...
That's my code:
data=[]
with open('sortieDES.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
print(data)
So I have a list quite ugly, but my question is: how can I use the file like a dictionary ? I want to input data['Académie']and have the list of the synonime... Here an example of the json file:
{"Académie française":{
"synonymes":["Institut","Quai Conti","les Quarante"]
}
You only need to call json.load on the File object (you gave it the name data_file):
data=[]
with open('sortieDES.json', encoding='utf-8') as data_file:
data = json.load(data_file)
print(data)
Instead of
json.load(line)
you have to use
json.loads(line)
Your s is missing in loads(...)