Convert a raw tweet string to a JSON object in Python

Convert a raw tweet string to a JSON object in Python - python

I'm using twitter's API to download raw tweets so I can play with them. The iterator loop they gave in the example looks something like this (I added an if condition to run the loop n times, not shown here):
iterator = twitter_stream.statuses.sample()
for tweet in iterator:
print (json.dumps(tweet))
break
These commands output the entire JSON object in the correct format.
To extract the "text" item from the raw tweet json object, I tried using the .get("text") operator on the
txts = []
for tweet in iterator:
txts.append((json.dumps(tweet)).get("text"))
break
print (txts)
But I get an error saying "AttributeError: 'str' object has no attribute 'get'"
So I searched around and found a solution where they wrote all the outputs from json.dumps(tweet) to a file, use json.loads(jsonfile) to a variable, and tried to use the .get("text") operator on it to load the text:
fl = open("ipjson.json", "a")
for tweet in iterator:
fl.write(json.dumps(tweet))
break
fl.flush()
decode = json.loads(fl)
for item in decode:
txt = item.get("text")
txts.append(txt)
print (txts)
But this gives me another error saying "TypeError: the JSON object must be str, not 'TextIOWrapper'"
What am I doing wrong? Is there a better/easier way to extract text from a raw tweet JSON object?

For the first example you don't need JSON you can just do:
txts = []
for status in statuses:
txts.append(status.text)
For the second example you're handling the JSON incorrectly. You should instead do:
txts = []
for status in statuses:
txts.append(json.dumps(status))
with open('ipjson.json','w') as fou:
json.dump(txts,fou)
And to read it back in:
with open('ipjson.json','r') as fin:
txts = json.load(fin)
for txt in txts:
print(json.loads(txt)['text'])
Please note that when you're writing and reading the JSON you use dump and load but with the individual JSON objects you're using dumps and loads.

JSON files require recursive scanning,
https://stackoverflow.com/a/42855667/3342050
or known locations within the structure.
After you get your dict, list, & entries, you parse through for specific values:
https://stackoverflow.com/a/42860573/3342050
This is entirely dependent upon what data is returned,
because keys will be unique to that structure.

Related

Optimal way to parse Twitter JSON objects from one file/multiple files into python

I have the Twitter dataset (multiple JSON files), but let's start from one file. I have to parse JSON objects to python but json.loads() only parse one object. A similar question is asked here but solutions are not working or good enough.
1- I can not convert JSON objects into the list as it is not efficient and I have too much data. Also proposed solutions are based on "\n" while my Twitter data objects end like }{ there is no newline and I can not add manually. (Twitter objects are also not line by line)
2- The second solution is JSONStream and there is not much available about it on official documentation.
3- Is there any other efficient way? One I have in consideration is using MongoDB. but I never worked on MongoDB. so I don't know if this is possible with this or not.
below picture shows the length of tweet object and }{
with open('sampledata.json','r',encoding='utf8') as json_file:
#for i in json_file:
while(True):
dataobj = json.load(json_file)
print(dataobj)
print("Printing each JSON Decoded Object")
Error:
As there are 287 lines for one object.
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 287 column 2 (char 10528)

The while loop used while reading the json file is not needed
You can use this to read a json file:
def read_json(path):
with open(path, 'r') as file:
return json.load(file)
my_data = read_json('sampledata.json')

How do I parse faulty json file from python using module json?

I have big size of json file to parse with python, however it's incomplete (i.e., missing parentheses in the end). The json file consist of one big json object which contains json objects inside. All json object in outer json object is complete, just finishing parenthese are missing.
for example, its structure is like this.
{bigger_json_head:value, another_key:[{small_complete_json1},{small_complete_json2}, ...,{small_complete_json_n},
So final "]}" are missing. however, each small json forms a single row so when I tried to print each line of the json file I have, I get each json object as a single string.
so I've tried to use:
with open("file.json","r",encoding="UTF-8") as f:
for line in f.readlines()
line_arr.append(line)
I expected to have a list with line of json object as its element
and then I tried below after the process:
for json_line in line_arr:
try:
json_str = json.loads(json_line)
print(json_str)
except json.decoder.JSONDecodeError:
continue
I expected from this code block, except first and last string, this code would print json string to console. However, it printed nothing and just got decode error.
Is there anyone who solved similar problem? please help. Thank you

If the faulty json file only miss the final "]}", then you can actually fix it before parse it.
Here is an example code to illustrate:
with open("file.json","r",encoding="UTF-8") as f:
faulty_json_str = f.read()
fixed_json_str = faulty_json_str + ']}'
json_obj = json.loads(fixed_json_str)

Ijson parse from list

I have a list in which each item contains JSON data, so I am trying to parse the data using Ijson since the data load will be huge.
This is what I am trying to achieve:
article_data=#variable which contains the list
parser = ijson.parse(article_data)
for id in ijson.items(parser, 'item'):
if(id['article_type'] != "Monthly Briefing" and id['article_type']!="Conference"):
data_article_id.append(id['article_id'])
data_article_short_desc.append(id['short_desc'])
data_article_long_desc.append(id['long_desc'])
This is the error I get:
AttributeError: 'generator' object has no attribute 'read'
I thought of converting the list into string and then try to parse in Ijson, but it fails and gives me the same error.
Any suggestions please?
data_article_id=[]
data_article_short_desc=[]
data_article_long_desc=[]
for index in article_data:
parser = ijson.parse(index)
for id in ijson.items(parser, 'item'):
if(id['article_type'] != "Monthly Briefing" and id['article_type']!="Conference"):
data_article_id.append(id['article_id'])
data_article_short_desc.append(id['short_desc'])
data_article_long_desc.append(id['long_desc'])
since it is in list, i tried this one also .. but it is giving me the same error.
'generator' object has no attribute 'read'

I am assuming that you have a list of byte string json object that you want to parse.
ijson.items(JSON, prefix) takes a readable byte object as input. That is it takes a opened file or file-like object as input. Specifically, the input should be bytes file-like objects.
If you are using Python 3, you can use io module with
io.BytesIO to create a in-memory binary stream.
Example
Suppose input is [b'{"id": "ab"}', b'{"id": "cd"}']
list_json = [b'{"id": "ab"}', b'{"id": "cd"}']
for json in list_json:
item = ijson.items(io.BytesIO(json), "")
for i in item:
print(i['id'])
Output:
ab
cd

Writing Json in for loop in Python

I am downloading Json files from an API, I use the following code to write the JSON. Each item the loop gives me a JSON file. I need to save it and extract entities from the appended JSON file using a loop.
for item in style_ls:
dat = get_json(api, item)
specs_dict[item] = dat
with open("specs_append.txt", "a") as myfile:
json.dump(dat, myfile)
myfile.close()
print item
with open ("specs_data.txt", "w") as my file:
json.dump(spec_dict, myfile)
myfile.close()
I know that I cannot get a valid JSON format from the specs_append.txt, but I can get one from the specs_data.txt. I am doing the first one just because my program needs atleast 3-4 days to complete and there are high chances that my system may shutdown. So is there anyway I can do this efficiently ?
If not is there anyway I can extract it from specs_append.txt <{JSON}{JSON}> format (which is not a valid JSON format)?
If not should I write specs_dict to a txt file every time in the loop, so that even if program gets terminated i can start if from that point in loop and still get a valid json format?

I suggest several possible solutions.
One solution is to write custom code to slurp in the input file. I would suggest putting a special line before each JSON object in the file, such as: ###
Then you could write code like this:
import json
def json_get_objects(f):
temp = ''
line = next(f) # pull first line
assert line == SPECIAL_LINE
for line in f:
if line != SPECIAL_LINE:
temp += line
else:
# found special marker, temp now contains a complete JSON object
j = json.loads(temp)
yield j
temp = ''
# after loop done, yield up last JSON object
if temp:
j = json.loads(temp)
yield j
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
Two notes on this. First, I am simply appending to a string over and over; this used to be a very slow way to do this in Python, so if you are using a very old version of Python, don't do it this way unless your JSON objects are very small. Second, I wrote code to split the input and yield up JSON objects one at a time, but you could also use a guaranteed-unique string, slurp in all the data with a single call to f.read() and then split on your guaranteed-unique string using the str.split() method function.
Another solution would be to write the whole file as a valid JSON list of valid JSON objects. Write the file like this:
{"mylist":[
# first JSON object, followed by a comma
# second JSON object, followed by a comma
# third JSON object
]}
This would require your file appending code to open the file with writing permission, and seek to the last ] in the file before writing a comma plus newline, then the new JSON object on the end, and then finally writing ]} to close out the file. If you do it this way, you can use json.loads() to slurp the whole thing in and have a list of JSON objects.
Finally, I suggest that maybe you should just use a database. Use SQLite or something and just throw the JSON strings in to a table. If you choose this, I suggest using an ORM to make your life simple, rather than writing SQL commands by hand.
Personally, I favor the first suggestion: write in a special line like ###, then have custom code to split the input on those marks and then get the JSON objects.
EDIT: Okay, the first suggestion was sort of assuming that the JSON was formatted for human readability, with a bunch of short lines:
{
"foo": 0,
"bar": 1,
"baz": 2
}
But it's all run together as one big long line:
{"foo":0,"bar":1,"baz":2}
Here are three ways to fix this.
0) write a newline before the ### and after it, like so:
###
{"foo":0,"bar":1,"baz":2}
###
{"foo":0,"bar":1,"baz":2}
Then each input line will alternately be ### or a complete JSON object.
1) As long as SPECIAL_LINE is completely unique (never appears inside a string in the JSON) you can do this:
with open("specs_data.txt", "r") as f:
temp = f.read() # read entire file contents
lst = temp.split(SPECIAL_LINE)
json_objects = [json.loads(x) for x in lst]
for j in json_objects:
pass # do something with JSON object j
The .split() method function can split up the temp string into JSON objects for you.
2) If you are certain that each JSON object will never have a newline character inside it, you could simply write JSON objects to the file, one after another, putting a newline after each; then assume that each line is a JSON object:
import json
def json_get_objects(f):
for line in f:
if line.strip():
yield json.loads(line)
with open("specs_data.txt", "r") as f:
for j in json_get_objects(f):
pass # do something with JSON object j
I like the simplicity of option (2), but I like the reliability of option (0). If a newline ever got written in as part of a JSON object, option (0) would still work, but option (2) would error.
Again, you can also simply use an actual database (SQLite) with an ORM and let the database worry about the details.
Good luck.

Append json data to a dict on every loop.
In the end dump this dict as a json and write it to a file.
For getting you an idea for appending data to dict:
>>> d1 = {'suku':12}
>>> t1 = {'suku1':212}
>>> d1.update(t1)
>>> d1
{'suku1': 212, 'suku': 12}

Adding brackets and commas to multiple JSON objects

I've created a very simple piece of code to read in tweets in JSON format in text files, determine if they contain an id and coordinates and if so, write these attributes to a csv file. This is the code:
f = csv.writer(open('GeotaggedTweets/ListOfTweets.csv', 'wb+'))
all_files = glob.glob('SampleTweets/*.txt')
for filename in all_files:
with open(filename, 'r') as file:
data = simplejson.load(file)
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
I've been having some difficulties but with the help of the excellent JSON Lint website have realised my mistake. I have multiple JSON objects and from what I read these need to be separated by commas and have square brackets added to the start and end of the file.
How can I achieve this? I've seen some examples online where each individual line is read and it's added to the first and last line, but as I load the whole file I'm not entirely sure how to do this.

You have a file that either contains too many newlines (in the JSON values themselves) or too few (no newlines between the tweets at all).
You can still repair this by using some creative re-stitching. The following generator function should do it:
import json
def read_objects(filename):
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
line = next(inputfile).strip()
while line:
try:
obj, index = decoder.raw_decode(line)
yield obj
line = line[index:]
except ValueError:
# Assume we didn't have a complete object yet
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
This should be able to read all your JSON objects in sequence:
for filename in all_files:
for data in read_objects(filename):
if 'text' and 'coordinates' in data:
f.writerow([data['id'], data['geo']['coordinates']])
It is otherwise fine to have multiple JSON strings written to one file, but you need to make sure that the entries are clearly separated somehow. Writing JSON entries that do not use newlines, then using newlines in between them, for example, makes sure you can later on read them one by one again and process them sequentially without this much hassle.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a raw tweet string to a JSON object in Python - python

Related

Optimal way to parse Twitter JSON objects from one file/multiple files into python

How do I parse faulty json file from python using module json?

Ijson parse from list

Writing Json in for loop in Python

Adding brackets and commas to multiple JSON objects

Categories

Resources