Set up a crawler and downloaded tweets. Unable to parse JSON file

Set up a crawler and downloaded tweets. Unable to parse JSON file - python

I have been trying to parse a JSON file and it keeps giving me additional data errors. Since I am new to Python, I have no idea how I can resolve this. It seems there are multiple objects within the file. How do I parse it without getting any errors?
Edit: (Not my code but I am trying to work on it)
import json
import csv
import io
'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''
data_json = io.open('filename', mode='r', encoding='utf-8').read() #reads in
the JSON file
data_python = json.loads(data_json)
csv_out = io.open('filename', mode='w', encoding='utf-8') #opens csv file
fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field
names
csv_out.write(fields)
csv_out.write(u'\n')
for line in data_python:
#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two
get methods
row = [line.get('created_at'),
'"' + line.get('text').replace('"','""') + '"', #creates double
quotes
line.get('user').get('screen_name'),
unicode(line.get('user').get('followers_count')),
unicode(line.get('user').get('friends_count')),
unicode(line.get('retweet_count')),
unicode(line.get('favorite_count'))]
row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')
csv_out.close()
Edit 2: I found another recipe to parse it but there is no way for me to save the output. Any recommendations?
import json
import re
json_as_string = open('filename.json', 'r')
# Call this as a recursive function if your json is highly nested
lines = [re.sub("[\[\{\]]*", "", one_object.rstrip()) for one_object in
json_as_string.readlines()]
json_as_list = "".join(lines).split('}')
for elem in json_as_list:
if len(elem) > 0:
print(json.loads(json.dumps("{" + elem[::1] + "}")))

Related

After reading urls from a text file, how can I save all the responses into separate files?

I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)

Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)

You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!

I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.

There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)

Convert individual json objects without delimiter to valid json file which can be processed at once

I have a JSON Lines format text file in which each line contains a valid JSON object. However,these JSON objects are not separated by a delimiter, so the file on a whole is not a valid JSON file.
I want to add a comma after each JSON object, so as to make the the whole file a valid JSON file, which can be processed at once using json.load().
I have written the following code to add a comma at the end of each line:
import json
import csv
testdata = open('resutdata.csv', 'wb')
csvwriter = csv.writer(testdata)
with open('data.json') as f:
for line in f:
csvwriter.writerow([json.loads(line), ','])
testdata.close()
However, the csv file obtained adds a each line with quotes and a comma with quotes at the end. How do I solve my problem?

As you need to convert json lines to json file, you can directly convert it into json file as follows:
import json
# Contains the output json file
resultfile = open('resultdata.json', 'wt')
data = []
with open('data.json') as f:
for line in f:
data.append(json.loads(line))
resultfile.write(json.dumps(data))
resultfile.close()

Python reading CSV file using URL giving error

I am trying to read a CSV file using the requests library but I am having issues.
import requests
import csv
url = 'https://storage.googleapis.com/sentiment-analysis-dataset/training_data.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
I then tried
for row in reader:
print(row)
but it gave me this error:
Error: iterator should return strings, not bytes (did you open the file in text mode?)
How should I fix this?

What you probably want is:
text = r.iter_lines(decode_unicode=True)
This will return a strings-iterator instead of a bytes-iterator. (See here for documentation.)

Parsing tweets' text out of "Status" wrapper in json file

I used this tweepy-based code to pull the tweets of a given user by user_id. I then saved a list of all tweets of a given user (alltweets) to a json file as follows. Note that without "repr", i wasn't able to dump the alltweets list into json file. The code worked as expected
with open(os.path.join(output_file_path,'%s_tweets.json' % user_id), 'a') as f:
json.dump(repr(alltweets), f)
However, I have a side problem with retrieving the tweets after saving them to the json file. I need to access the text in each tweet, but I'm not sure how to deal with the "Status" wrapper that tweepy uses (See a sample of the json file attached).sample json file content
I tried to iterate over the lines in the file as follows, but the file is being seen as a single line.
with open(fname, 'r') as f:
for line in f:
tweet = json.loads(line)
I also tried to iterate over statuses after reading the json file as a string, as follows, but iteration rather takes place on the individual characters in the json file.
with open(fname, 'r') as f:
x = f.read()
for status in x:
"""code"""

Maybe not the prettiest solution but you could just declare Status as a dict and then eval the list (the whole content of the files).
Status = dict
f = open(fname, 'r')
data = eval(f.read())
f.close()
for status in data:
""" do your stuff"""

What is the best way to write json objects to a text file and then read them back in?

I am trying to write a service to read twitter feed stream data and then write it to a file. I am writing each JSON structure to a line in the file. With a different service I need to read each line of the file and load the json structure for further operations.
My problem is that I can read the first line then the JSON loader says the rest are not JSON structures. They look fine. Not sure what is going on.
Writting file:
self.output = open(os.path.join(self.outputdir,self.filename,'w')
self.output.write(status + "\n")
Reading File:
with open(file) as f:
line = line.replace("\n","")
tweet = json.loads(line)
print tweet['text']
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Example json file:
JSON File
JSON File

Your json is composed with multiple json objects and empty lines.
You need to load each line as a new json object and ignore empty lines:
>>> with open('streamer.151205-071156.json') as f:
>>> data = [json.loads(l) for l in f if len(l) > 1]
>>> len(data)
7
>>> print(data[0]['text'])
u'Mnjd \U0001f642\U0001f602 https://t.co/BL5Ezxtt0i'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Set up a crawler and downloaded tweets. Unable to parse JSON file - python

Related

After reading urls from a text file, how can I save all the responses into separate files?

Convert individual json objects without delimiter to valid json file which can be processed at once

Python reading CSV file using URL giving error

Parsing tweets' text out of "Status" wrapper in json file

What is the best way to write json objects to a text file and then read them back in?

Categories

Resources