I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
I have a JSON Lines format text file in which each line contains a valid JSON object. However,these JSON objects are not separated by a delimiter, so the file on a whole is not a valid JSON file.
I want to add a comma after each JSON object, so as to make the the whole file a valid JSON file, which can be processed at once using json.load().
I have written the following code to add a comma at the end of each line:
import json
import csv
testdata = open('resutdata.csv', 'wb')
csvwriter = csv.writer(testdata)
with open('data.json') as f:
for line in f:
csvwriter.writerow([json.loads(line), ','])
testdata.close()
However, the csv file obtained adds a each line with quotes and a comma with quotes at the end. How do I solve my problem?
As you need to convert json lines to json file, you can directly convert it into json file as follows:
import json
# Contains the output json file
resultfile = open('resultdata.json', 'wt')
data = []
with open('data.json') as f:
for line in f:
data.append(json.loads(line))
resultfile.write(json.dumps(data))
resultfile.close()
I am trying to read a CSV file using the requests library but I am having issues.
import requests
import csv
url = 'https://storage.googleapis.com/sentiment-analysis-dataset/training_data.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
I then tried
for row in reader:
print(row)
but it gave me this error:
Error: iterator should return strings, not bytes (did you open the file in text mode?)
How should I fix this?
What you probably want is:
text = r.iter_lines(decode_unicode=True)
This will return a strings-iterator instead of a bytes-iterator. (See here for documentation.)
I used this tweepy-based code to pull the tweets of a given user by user_id. I then saved a list of all tweets of a given user (alltweets) to a json file as follows. Note that without "repr", i wasn't able to dump the alltweets list into json file. The code worked as expected
with open(os.path.join(output_file_path,'%s_tweets.json' % user_id), 'a') as f:
json.dump(repr(alltweets), f)
However, I have a side problem with retrieving the tweets after saving them to the json file. I need to access the text in each tweet, but I'm not sure how to deal with the "Status" wrapper that tweepy uses (See a sample of the json file attached).sample json file content
I tried to iterate over the lines in the file as follows, but the file is being seen as a single line.
with open(fname, 'r') as f:
for line in f:
tweet = json.loads(line)
I also tried to iterate over statuses after reading the json file as a string, as follows, but iteration rather takes place on the individual characters in the json file.
with open(fname, 'r') as f:
x = f.read()
for status in x:
"""code"""
Maybe not the prettiest solution but you could just declare Status as a dict and then eval the list (the whole content of the files).
Status = dict
f = open(fname, 'r')
data = eval(f.read())
f.close()
for status in data:
""" do your stuff"""
I am trying to write a service to read twitter feed stream data and then write it to a file. I am writing each JSON structure to a line in the file. With a different service I need to read each line of the file and load the json structure for further operations.
My problem is that I can read the first line then the JSON loader says the rest are not JSON structures. They look fine. Not sure what is going on.
Writting file:
self.output = open(os.path.join(self.outputdir,self.filename,'w')
self.output.write(status + "\n")
Reading File:
with open(file) as f:
line = line.replace("\n","")
tweet = json.loads(line)
print tweet['text']
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Example json file:
JSON File
JSON File
Your json is composed with multiple json objects and empty lines.
You need to load each line as a new json object and ignore empty lines:
>>> with open('streamer.151205-071156.json') as f:
>>> data = [json.loads(l) for l in f if len(l) > 1]
>>> len(data)
7
>>> print(data[0]['text'])
u'Mnjd \U0001f642\U0001f602 https://t.co/BL5Ezxtt0i'