Python reading CSV file using URL giving error - python

I am trying to read a CSV file using the requests library but I am having issues.
import requests
import csv
url = 'https://storage.googleapis.com/sentiment-analysis-dataset/training_data.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
I then tried
for row in reader:
print(row)
but it gave me this error:
Error: iterator should return strings, not bytes (did you open the file in text mode?)
How should I fix this?

What you probably want is:
text = r.iter_lines(decode_unicode=True)
This will return a strings-iterator instead of a bytes-iterator. (See here for documentation.)

Related

Python-3 Trying to iterate through a csv and get http response codes

I am attempting to read a csv file that contains a long list of urls. I need to iterate through the list and get the urls that throw a 301, 302, or 404 response. In trying to test the script I am getting an exited with code 0 so I know it is error free but it is not doing what I need it to. I am new to python and working with files, my experience has been ui automation primarily. Any suggestions would be gladly appreciated. Below is the code.
import csv
import requests
import responses
from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open('redirect.csv', 'r')
contents = []
with open('redirect.csv', 'r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
def run():
resp = urllib.request.urlopen(url)
print(self.url, resp.getcode())
run()
print(run)
Given you have a CSV similar to the following (the heading is URL)
URL
https://duckduckgo.com
https://bing.com
You can do something like this using the requests library.
import csv
import requests
with open('urls.csv', newline='') as csvfile:
errors = []
reader = csv.DictReader(csvfile)
# Iterate through each line of the csv file
for row in reader:
try:
r = requests.get(row['URL'])
if r.status_code in [301, 302, 404]:
# print(f"{r.status_code}: {row['url']}")
errors.append([row['url'], r.status_code])
except:
pass
Uncomment the print statement if you want to see the results in the terminal. The code at the moment appends a list of URL and status code to an errors list. You can print or continue processing this if you prefer.

csv.Error: did you open the file in text mode?

the python file contains the following code
import csv
import urllib.request
url = "https://gist.githubusercontent.com/aparrish/cb1672e98057ea2ab7a1/raw/13166792e0e8436221ef85d2a655f1965c400f75/lebron_james.csv"
stats = list(csv.reader(urllib.request.urlopen(url)))
When I run the above code in python, I get the following exception:
Error
Traceback (most recent call last)
in ()
1 url = "https://gist.githubusercontent.com/aparrish/cb1672e98057ea2ab7a1/raw/13166792e0e8436221ef85d2a655f1965c400f75/lebron_james.csv"
----> 2 stats = list(csv.reader(urllib.request.urlopen(url)))
Error: iterator should return strings, not bytes (did you open the file in text mode?)
How can I fix this problem?
The documentation for urllib recommends using the requests module.
You must pay attention to two things:
you must decode the data you receive from the internet (which is bytes) in order to have text. With requests, using text takes care of the decoding.
csvreader expects a list of lines, not a block of text. Here, we split it with splitlines.
So, you can do it like this:
import csv
import requests
url = "https://gist.githubusercontent.com/aparrish/cb1672e98057ea2ab7a1/raw/13166792e0e8436221ef85d2a655f1965c400f75/lebron_james.csv"
text = requests.get(url).text
lines = text.splitlines()
stats = csv.reader(lines)
for row in stats:
print(row)
# ['Rk', 'G', 'Date', 'Age', 'Tm', ...]
# ['1', '1', '2013-10-29', '28-303', 'MIA',... ]
I don't really know what is that data, but if you interested in separating them with ,, you can try something like this:
stats = list(csv.reader(urllib.request.urlopen(url).read().decode()))
1.It reads from response data
2.Decode from Bytes to String
3.CSV reader
4.Cast CSV object to list
let me know if you want that data somehow differently so i can edit my answer. Good Luck.
You should read the response of urllib.request.urlopen.
stats = list(csv.reader(urllib.request.urlopen(url).read().decode("UTF-8")))
Try the following code.
import csv
import io
import urllib.request
csv_response = urllib.request.urlopen(url)
lst = list(csv.reader(io.TextIOWrapper(csv_response)))

Set up a crawler and downloaded tweets. Unable to parse JSON file

I have been trying to parse a JSON file and it keeps giving me additional data errors. Since I am new to Python, I have no idea how I can resolve this. It seems there are multiple objects within the file. How do I parse it without getting any errors?
Edit: (Not my code but I am trying to work on it)
import json
import csv
import io
'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''
data_json = io.open('filename', mode='r', encoding='utf-8').read() #reads in
the JSON file
data_python = json.loads(data_json)
csv_out = io.open('filename', mode='w', encoding='utf-8') #opens csv file
fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field
names
csv_out.write(fields)
csv_out.write(u'\n')
for line in data_python:
#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two
get methods
row = [line.get('created_at'),
'"' + line.get('text').replace('"','""') + '"', #creates double
quotes
line.get('user').get('screen_name'),
unicode(line.get('user').get('followers_count')),
unicode(line.get('user').get('friends_count')),
unicode(line.get('retweet_count')),
unicode(line.get('favorite_count'))]
row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')
csv_out.close()
Edit 2: I found another recipe to parse it but there is no way for me to save the output. Any recommendations?
import json
import re
json_as_string = open('filename.json', 'r')
# Call this as a recursive function if your json is highly nested
lines = [re.sub("[\[\{\]]*", "", one_object.rstrip()) for one_object in
json_as_string.readlines()]
json_as_list = "".join(lines).split('}')
for elem in json_as_list:
if len(elem) > 0:
print(json.loads(json.dumps("{" + elem[::1] + "}")))

Python Error: iterator should return strings, not bytes (did you open the file in text mode?) [duplicate]

I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.
I have tried the following code:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]
It results in the following error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.
The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.
A similar problem was addressed here.
It can be solved decoding bytes to strings with the appropriate encoding. For example:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8')) # with the appropriate encoding
data = [row for row in csvfile]
The last line could also be: data = list(csvfile) which can be easier to read.
By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.
EDIT:
Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.
import csv
import urllib.request
import codecs
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
print(line) # do something with line
Note that the list is not created either for the same reason.
Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requests package (which is sometimes seen as an alternative to urlib.request).
The basis of using codecs.itercode() to solve the original problem is still the same as in the accepted answer.
import codecs
from contextlib import closing
import csv
import requests
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
for row in reader:
print row
Here we also see the use of streaming provided through the requests package in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).
I thought it might be useful since it helped me, as I was using requests rather than urllib.request in Python 3.6.
Some of the ideas (e.g using closing()) are picked from this similar post
I had a similar problem using requests package and csv.
The response from post request was type bytes.
In order to user csv library, first I a stored them as a string file in memory (in my case the size was small), decoded utf-8.
import io
import csv
import requests
response = requests.post(url, data)
# response.content is something like:
# b'"City","Awb","Total"\r\n"Bucuresti","6733338850003","32.57"\r\n'
csv_bytes = response.content
# write in-memory string file from bytes, decoded (utf-8)
str_file = io.StringIO(csv_bytes.decode('utf-8'), newline='\n')
reader = csv.reader(str_file)
for row_list in reader:
print(row_list)
# Once the file is closed,
# any operation on the file (e.g. reading or writing) will raise a ValueError
str_file.close()
Printed something like:
['City', 'Awb', 'Total']
['Bucuresti', '6733338850003', '32.57']
urlopen will return a urllib.response.addinfourl instance for an ftp request.
For ftp, file, and data urls and requests explicity handled by legacy
URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager...
>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>
At this point ftpstream is a file like object, using .read() would return the contents however csv.reader requires an iterable in this case:
Defining a generator like so:
def to_lines(f):
line = f.readline()
while line:
yield line
line = f.readline()
We can create our csv reader like so:
reader = csv.reader(to_lines(ftps))
And with a url
url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"
The code:
for row in reader: print row
Prints
>>>
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']

Write contents of URL request to file

I am trying to fetch a list from a php file using python and save it to a file:
import urllib.request
page = urllib.request.urlopen('http://crypto-bot.hopto.org/server/list.php')
f = open("test.txt", "w")
f.write(str(page))
f.close()
print(page.read())
Output on screen (divided onto four lines for readability):
ALF\nAMC\nANC\nARG\nBQC\nBTB\nBTE\nBTG\nBUK\nCAP\nCGB\nCLR\nCMC\nCRC\nCSC\nDGC\n
DMD\nELC\nEMD\nFRC\nFRK\nFST\nFTC\nGDC\nGLC\nGLD\nGLX\nHBN\nIXC\nKGC\nLBW\nLKY\n
LTC\nMEC\nMNC\nNBL\nNEC\nNMC\nNRB\nNVC\nPHS\nPPC\nPXC\nPYC\nQRK\nSBC\nSPT\nSRC\n
STR\nTRC\nWDC\nXPM\nYAC\nYBC\nZET\n
Output in file:
<http.client.HTTPResponse object at 0x00000000031DAEF0>
Can you tell me what I am doing wrong?
Use urllib.urlretrieve (urllib.request.urlretrieve in Python 3).
In the console:
>>> import urllib
>>> urllib.urlretrieve('http://crypto-bot.hopto.org/server/list.php','test.txt')
('test.txt', <httplib.HTTPMessage instance at 0x101338050>)
This results in a file, test.txt, being saving in the current working directory with the contents
ALF
AMC
ANC
ARG
...etc...
You need to read from the file object before writing to the file. Also you should the same object to both file and screen.
Do this:
import urllib.request
page = urllib.request.urlopen('http://crypto-bot.hopto.org/server/list.php')
f = open("test.txt", "w")
content = page.read()
f.write(content)
f.close()
print(content)
You're not reading the content from the urlopen file-like when you write to the file.
Also, shutil.copyfileobj().

Categories

Resources