What is the proper way to read text file from internet.
For example text file here https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt
Code below works but produces extra 'b in front of each word
from urllib.request import urlopen
#url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt'
url = 'https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt'
#data = urlopen(url)
#print('H w')
# it's a file like object and works just like a file
l = set()
data = urlopen(url)
for line in data: # files are iterable
word = line.strip()
print(word)
l.add(word)
print(l)
You have to decode each byte object to unicode. For that you can use the method decode('utf-8'). Here's the code:
from urllib.request import urlopen
url = 'https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt'
l = set()
data = urlopen(url)
for line in data: # files are iterable
word = line.strip().decode('utf-8') # decode the line into unicode
print(word)
l.add(word)
print(l)
It's simple using pandas. Just execute
import pandas as pd
pd.read_csv('https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt')
and you are all set :)
Related
So I was trying to make a filter that filter's out the crap from this scrape, but I have an issue where it filters out the words. I would like to filter out the whole line instead of the words.
from bs4 import BeautifulSoup
import requests
import os
def Scrape():
page = input("Page: ")
url = "https://openuserjs.org/?p=" + page
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')
os.system('cls')
Filter(soup)
def Filter(soup):
crap = ""
f = open("Data/Crap.txt", "r")
for craptext in f:
crap = craptext
for Titles in soup.select("a.tr-link-a>b"):
print(Titles.text.replace(crap, "").strip())
while True:
Scrape()
Instead of:
print(Titles.text.replace(crap, "").strip())
Try using:
if crap not in Titles.text:
print(Titles.text.strip())
i am trying to save the list that is generated to a file, i see the print out of the list fine but it will not write to the compoundlist.csv file. i am not sure what i am doing wrong, i have tried to write after the list is generated and also during the loop. I have gotten the same result.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
import csv
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
compoundlist = []
soup = make_soup("http://www.genome.jp/dbget-bin/www_bget?ko00020")
i = 1
file = open("Compoundlist.csv", "wb")
for record in soup.findAll("nobr"):
compound = ''
if (record.text[0] == "C" and record.text[1] == '0') or (record.text[0] == "C" and record.text[1] == '1'):
compoundlist = "http://www.genome.jp/dbget-bin/www_bget?cpd:" + record.text
file.write(compoundlist)
print(compoundlist)
Try adding the following to the end of your code
file.close()
To flush the open file buffer into the file
My current professor is using Python 2.7 for examples in class, but other professors that I will be taking classes from in the future have suggested I use Python 3.5. I am trying to convert my current Professor's examples from 2.7 to 3.5. Right now I'm having an issue with the urllib2 package, which I understand has been split in Python 3.
The original code in the iPython notebook looks like this :
import csv
import urllib2
data_url = 'http://archive.ics.uci.edu/ml/machine-learning- databases/adult/adult.data'
response = urllib2.urlopen(data_url)
myreader = csv.reader(response)
for i in range(5):
row = next(myreader)
print ','.join(row)
Which I have converted to:
import csv
import urllib.request
data_url = 'http://archive.ics.uci.edu/ml/machine-learning- databases/adult/adult.data'
response = urllib.request.urlopen(data_url)
myreader = csv.reader(response)
for i in range(5):
row = next(myreader)
print(','.join(row))
But that leaves me with the error:
Error Traceback (most recent call last)
<ipython-input-19-20da479e256f> in <module>()
7 myreader = csv.reader(response)
8 for i in range(5):
----> 9 row = next(myreader)
10 print(','.join(row))
Error: iterator should return strings, not bytes (did you open the file in text mode?)
I'm unsure how to proceed from here. Any ideas?
Wrap response with another iterator which decode bytes to string and yield the strings:
import csv
import urllib.request
def decode_iter(it):
# iterate line by line
for line in it:
# convert bytes to string using `bytes.decode`
yield line.decode()
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
response = urllib.request.urlopen(data_url)
myreader = csv.reader(decode_iter(response))
for i in range(5):
row = next(myreader)
print(','.join(row))
UPDATE
Instead of decode_iter, you can use codecs.iter_decode:
import csv
import codecs
import urllib.request
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
response = urllib.request.urlopen(data_url)
myreader = csv.reader(codecs.iterdecode(response, 'utf-8'))
for i in range(5):
row = next(myreader)
print(','.join(row))
I'm building a simple scraper in order to learn python.
After writing the csvWriter function below, I'm having issues. It seems that the encoding can't be written to csv file (I assume this is because of price information I'm scraping).
Also, I'm wondering if I am correct in thinking that in this case, it is best to go from set -> list to get the information zipped and presented in the way that I want before writing.
Also - any general advice on how I am approaching this?
from bs4 import BeautifulSoup
import requests
import time
import csv
response = request.get('http://website.com/subdomain/logqueryhere')
baseurl = 'http://website.com'
soup = BeautifulSoup(response.text)
hotelInfo = soup.find_all("div", {'class': "hotel-wrap"})
#retrieveLinks: A function to generate a list of hotel URL's to be passed to the price checker.
def retrieveLinks():
for hotel in hotelInfo:
urllist = []
hotelLink = hotel.find('a', attrs={'class': ''})
urllist.append(hotelLink['href'])
scraper(urllist)
hotelnameset = set()
hotelurlset = set()
hotelpriceset = set()
# Scraper: A function to scrape from the lists generated above with retrieveLinks
def scraper(inputlist):
global hotelnameset
global hotelurlset
global hotelpriceset
#Use a set here to avoid any dupes.
for url in inputlist:
fullurl = baseurl + url
hotelurlset.add(str(fullurl))
hotelresponse = requests.get(fullurl)
hotelsoup = BeautifulSoup(hotelresponse.text)
hoteltitle = hotelsoup.find('div', attrs={'class': 'vcard'})
hotelhighprice = hotelsoup.find('div', attrs={'class': 'pricing'}).text
hotelpriceset.add(hotelhighprice)
for H1 in hoteltitle:
hotelName = hoteltitle.find('h1').text
hotelnameset.add(str(hotelName))
time.sleep(2)
csvWriter()
#csvWriter: A function to write the above mentioned sets/lists to a CSV file.
def csvWriter():
global hotelnameset
global hotelurlset
global hotelpriceset
csvname = list(hotelnameset)
csvurl = list(hotelurlset)
csvprice = list(hotelpriceset)
#lets zip the values we neded (until we learn a better way to do it)
zipped = zip(csvname, csvurl, csvprice)
c = csv.writer(open("hoteldata.csv", 'wb'))
for row in zipped:
c.writerow(row)
retrieveLinks()
Error is as follows -
± |Add_CSV_Writer U:2 ✗| → python main.py
Traceback (most recent call last):
File "main.py", line 62, in <module>
retrieveLinks()
File "main.py", line 18, in retrieveLinks
scraper(urllist)
File "main.py", line 44, in scraper
csvWriter()
File "main.py", line 60, in csvWriter
c.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
Posting your actual error will really help! In any case, in python 2.X the CSV writer does not automatically encode unicode for you. You essentially have to write your own using unicodecsv (https://pypi.python.org/pypi/unicodecsv/0.9.0) or use one of the unicode CSV implementations on the web (1):
import unicodecsv
def csvWriter():
global hotelnameset
global hotelurlset
global hotelpriceset
csvname = list(hotelnameset)
csvurl = list(hotelurlset)
csvprice = list(hotelpriceset)
#lets zip the values we neded (until we learn a better way to do it)
zipped = zip(csvname, csvurl, csvprice)
with open('hoteldata.csv', 'wb') as f_in:
c = unicodecsv.writer(f_in, encoding='utf-8')
for row in zipped:
c.writerow(row)
I'm looking to do the equivalent of _grep -B14 MMA
I have a URL that I open and it spits out many lines.
I want to
find the line that has 'MMa'
then print the 14th line before it
I don't even know where to begin with this.
import urllib
import urllib2
url = "https://longannoyingurl.com"
opts = {
'action': 'Dump+It'
}
data = urllib.urlencode(opts)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
print response.read() # gives the full html output
Instead of just doing a bare read on the response object, call readlines instead, and then run a regular expression through each line. If the line matches, print the 14th line before it, but check to see that you're not negative indexing. E.g.
import re
lines = response.readlines()
r = re.compile(r'MMa')
for i in range(len(lines)):
if r.search(lines[i]):
print lines[max(0, i-14)]
thanks to Dan I got my result
import urllib
import urllib2
import re
url="https://somelongannoyingurl/blah/servlet"
opts = {
'authid': 'someID',
'action': 'Dump+It'
}
data = urllib.urlencode(opts)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
lines = response.readlines()
r = re.compile(r'MMa')
for i in range(len(lines)):
if r.search(lines[i]):
line = lines[max(0, i-14)].strip()
junk,mma = line.split('>')
print mma.strip()
~
You can split a single string into a list of lines using mystr.splitlines(). You can test if a string matches a regular expression using re.match(). Once you find the matching line(s), you can index backwards into your list of lines to find the 14th line before.