Download the content of a url and save it - python

I have this format of data from a url
[{"column1":"something","column2":"something 2","column1":"something3","column2":"something4"}, etc]
I want to download this content from the url and save it into a csv file in this format:
column1 column2
something something2
something3 something4
Can I do it with python using urllib2? Or any other library?

Requests is probably the easiest for the download. You'd just use:
import codecs, requests
req = requests.get(url)
text = req.text.encode(req.encoding)
with codecs.open(filename, 'w', encoding='utf-8') as f:
f.write(text)
As yedpodtrzitko mentioned, however, parsing that particular input as JSON will leave you with two key/value pairs, and it'll get overwritten with the last ones it encounters for each key. You could split and parse it manually into rows, but it shouldn't be valid JSON.

Related

how to store bytes like b'PK\x03\x04\x14\x00\x08\x08\x08\x009bwR\x00\x00\x00\x00\x00\x00\x00 to dataframe or csv in python

I am requesting a URL and getting a return in bytes. I want to store this in a data frame and then to CSV.
#Get Data from the CSV
url = "someURL"
req = requests.get(URL)
url_content = req.content
csv_file = open('test.txt', 'wb')
print(type(url_content))
print(url_content)
csv_file.write(url_content)
csv_file.close()
I tried many approaches, but couldn't find the solution. The above code is storing the output in CSV, but getting the below error. My end objective is to store this in CSV then send it to google cloud. And create a google big query table.
Output:
<class 'bytes'>
b'PK\x03\x04\x14\x00\x08\x08\x08\x009bwR\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x13\x00\x00\x00[Content_Types].xml\xb5S\xcbn\xc20\x10\xfc\x95\xc8\xd76\xf4PU\x15\x81C\x1f\xc7\x16\xa9\xf4\x03\{\x93X\xf8%\xaf\xa1\xf0\xf7]\x078\x94R\x89\nq\xf2cfgfW\xf6d\xb6q\xb6ZCB\x13|\xc3\xc6|\xc4\xf0h\xe3\xbb\x86},^\xea{Va\x96^K\x1b<4\xcc\x076\x9bN\x16\xdb\x08XQ\xa9\xc7\x86\xf59\xc7\x07!P\xf5\xe0$\xf2\x10\xc1\x13\xd2\x86\xe4d\xa6c\xeaD\x94j);\x10\xb7\xa3\xd1\x9dP\xc1g\xf0\xb9\xceE\x83M'O\xd0\xca\x95\xcd\xd5\xe3\xee\xbeH7L\xc6h\x8d\x92\x99R\x89\xb5\xd7G\xa2\xf5^\x90'\xb0\x03\x07{\x13\xf1\x86\x08\xacz\xde\x90\xca\xae\x1bB\x91\x893\x1c\x8e\x0b\xcb\x99\xea\xdeh.\xc9h\xf8W\xb4\xd0\xb6F\x81\x0ej\xe5\xa8\x84CQ\xd5\xa0\xeb\x98\x88\x98\xb2\x81}\xce\xb9L\xf9U:\x12\x14D\x9e\x13\x8a\x82\xa4\xf9%\xde\x87\xb1\xa8\x90\xe0,\xc3B\xbc\xc8\xf1\xa8[\x8c\t\xa4\xc6\x1e ;\xcb\xb1\x97\t\xf4{N\xf4\x98~\x87\xd8X\xf1\x83p\xc5\x1cykOL\xa1\x04\x18\x90kN\x80V\xee\xa4\xf1\xa7\xdc\xbfBZ~\x86\xb0\xbc\x9e\x7fq\x18\xf6\x7f\xd9\x0f \x8aa\x19\x1fr\x88\xe1{O\xbf\x01PK\x07\x08z\x94\xcaq;\x01\x00\x00\x1c\x04\x00\x00PK\x03\x04\x14\x00\x08\x08\x08\x009bwR\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x00\x00_rels/.rels\xad\x92\xc1j\xc30\x0c\x86_\xc5\xe8\xde8\xed`\x8cQ\xb7\x972\xe8m\x8c\xee\x014[ILb\xcb\xd8\xda\x96\xbd\xfd\xcc.[K\n\x1b\xec($}\xff\x07\xd2v?\x87I\xbdQ.\x9e\xa3\x81u\xd3\x82\xa2h\xd9\xf9\xd8\x1bx>=\xac\xee#\x15\xc1\xe8p\xe2H\x06"\xc3~\xb7}\xa2\t\xa5n\x94\xc1\xa7\xa2"\x16\x03\x83H\xba\xd7\xba\xd8\x81\x02\x96\x86\x13\xc5\xda\xe98\x07\x94Z\xe6^'\xb4#\xf6\xa47m{\xab\xf3O\x06\x9c3\xd5\xd1\x19\xc8G\xb7\x06u\xc2\xdc\x93\x18\x98'\xfd\xcey|a\x1e\x9b\x8a\xad\x8d\x8fD\xbf\t\xe5\xae\xf3\x96\x0el_\x03EY\xc8\xbe\x98\x00\xbd\xec\xb2\xf9vql\x1f3\xd7ML\xe9\xbfeh\x16\x8a\x8e\xdc*\xd5\x04\xca\xe2\xa9\3\xbaY0\xb2\x9c\xe9oJ\xd7\x8f\xa2\x03\t:\x14\xfc\xa2^\x08\xe9\xb3\x1f\xd8}\x02PK\x07\x08\xa7\x8cz\xbd\xe3\x00\x00\x00I\x02\x00\x00PK\x03\x04\x14\x00\x08\x08\x08\x009bwR\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00docProps/app.xmlM\x8e\xc1\n\xc20\x10D\xef~E\xc8\xbd\xdd\xeaAD\xd2\x94\x82\x08\x9e\xecA? \xa4\xdb6\xd0lB\xb2J?
The original URL (now edited out of the question) suggests that the downloaded file is in .xlsx format. The .xlsx format is essentially one or more xml files in a zip archive (iBug's answer is correct in this respect).
Therefore if you want to get the file's data in a dataframe, tell Pandas to read it as an excel file.
import pandas as pd
url = "someURL"
req = requests.get(URL)
url_content = req.content
# Load into a dataframe
df = pd.read_excel(url_content)
# Write to csv
df.to_csv('data.csv')
The initial bytes PK\x03\x04 suggest that it's PK Zip format. Try unzipping it first, either with unzip x <filename> or with Python builtin zipfile module.

After reading urls from a text file, how can I save all the responses into separate files?

I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)

Python Looping through urls in csv file returns \ufeffhttps://

I am new to python and I am trying to loop through the list of urls in a csv file and grab the website titleusing BeautifulSoup, which I would like then to save to a file Headlines.csv. But I am unable to grab the webpage title. If I use a variable with single url as follows:
url = 'https://www.space.com/japan-hayabusa2-asteroid-samples-landing-date.html'
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
It works just fine and I get the title Japanese capsule carrying pieces of asteroid Ryugu will land on Earth Dec. 6 | Space
But when I use the loop,
import csv
with open('urls_file2.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for url in reader:
print(url)
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
I get the following
['\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']
and an error message
InvalidSchema: No connection adapters were found for "['\\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']"
I am not sure what am I doing wrong.
You have a byte order mark \\ufeff on the URL you parse from your file.
It looks like your file is a signature file and has encoding like utf-8-sig.
You need to read with the file with encoding='utf-8-sig'
Read more here.
As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.
The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.
Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.
import csv
with open('urls_file2.csv', newline='', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for url in reader:
print(url[0])
To read up more on lists, you can refer here.
You can add more columns to the CSV file and experiment to see how the results would appear.
If you would like to refer to the column name while reading each row, you can refer here.

Python : Extract exact word from a url

I just begin to learn python from 2 days and I try to make a script that extract me some data from url and save it, but the problems is I want to extract only a specific data from a long line
EX :
{"2019-11-19":{"period":"2019-11-19T00:00:00+00:00","uniqs":"344627","hits":"0","clicked":"4922","pay":126.52971186,"currency":"RON","subs":0},"2019-11-20":{"period":"2019-11-20T00:00:00+00:00","uniqs":"1569983","hits":"0","clicked":"15621","pay":358.43100342,"currency":"RON","subs":0},"2019-11-21":{"period":"2019-11-21T00:00:00+00:00","uniqs":"1699844","hits":"0","clicked":"16172","pay":363.15667371,"currency":"RON","subs":0},"2019-11-22":{"period":"2019-11-22T00:00:00+00:00","uniqs":"1779319","hits":"0","clicked":"17865","pay":384.67092962,"currency":"RON","subs":0},"2019-11-23":{"period":"2019-11-23T00:00:00+00:00","uniqs":"1825346","hits":"0","clicked":"17740","pay":356.72833095,"currency":"RON","subs":0},"2019-11-24":{"period":"2019-11-24T00:00:00+00:00","uniqs":"1732639","hits":"0","clicked":"16870","pay":308.4201041,"currency":"RON","subs":0},"2019-11-25":{"period":"2019-11-25T00:00:00+00:00","uniqs":"1826060","hits":"0","clicked":"17991","pay":346.29137133,"currency":"RON","subs":0},"2019-11-26":{"period":"2019-11-26T00:00:00+00:00","uniqs":"1873961","hits":"0","clicked":"18645","pay":379.17652358,"currency":"RON","subs":0},"2019-11-27":{"period":"2019-11-27T00:00:00+00:00","uniqs":"1734207","hits":"0","clicked":"16187","pay":251.91152953,"currency":"RON","subs":0},"2019-11-28":{"period":"2019-11-28T00:00:00+00:00","uniqs":"1611611","hits":"0","clicked":"12056","pay":158.96447829,"currency":"RON","subs":0},"2019-11-29":{"period":"2019-11-29T00:00:00+00:00","uniqs":"712011","hits":"0","clicked":"6242","pay":85.70053418,"currency":"RON","subs":0},"2019-11-30":{"period":"2019-11-30T00:00:00+00:00","uniqs":"47957","hits":"0","clicked":"427","pay":8.32775435,"currency":"RON","subs":0},"2019-12-01":{"period":"2019-12-01T00:00:00+00:00","uniqs":"1268892","hits":"0","clicked":"11779","pay":217.42321168,"currency":"RON","subs":0},"2019-12-02":{"period":"2019-12-02T00:00:00+00:00","uniqs":"1130724","hits":"0","clicked":"10694","pay":195.44476902,"currency":"RON","subs":0},"2019-12-03":{"period":"2019-12-03T00:00:00+00:00","uniqs":"1058965","hits":"0","clicked":"8123","pay":151.05243751,"currency":"RON","subs":0},"2019-12-04":{"period":"2019-12-04T00:00:00+00:00","uniqs":"1228326","hits":"0","clicked":"12230","pay":230.84154581,"currency":"RON","subs":0},"2019-12-05":{"period":"2019-12-05T00:00:00+00:00","uniqs":"1181029","hits":"0","clicked":"11467","pay":196.21644271,"currency":"RON","subs":0},"2019-12-06":{"period":"2019-12-06T00:00:00+00:00","uniqs":"951828","hits":"0","clicked":"9379","pay":153.35155293,"currency":"RON","subs":0},"2019-12-07":{"period":"2019-12-07T00:00:00+00:00","uniqs":"1172156","hits":"0","clicked":"11776","pay":181.65819439,"currency":"RON","subs":0},"2019-12-08":{"period":"2019-12-08T00:00:00+00:00","uniqs":"912109","hits":"0","clicked":"9240","pay":147.6364827,"currency":"RON","subs":0}}
I try to extract the after "pay": and save it to a file, after that i write the code that will calculate the amount and give me the result :D for this i worked 1 day :D
i use this code to extract and save the data from link :
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
f = open("dataNEW.txt", "w")
f.write(page)
f.close()
but the problem is it write me all details there, I want to save only what is after pay.
That string is in JSON format that can be easily converted to a Python data structure using the json package. Her is an example:
import json
from urllib.request import urlopen as uReq
url1 = 'http://link.com'
page = uReq(url1).read().decode()
data = json.loads(page)
with open("dataNEW.txt", "w") as f:
for sub_dict in data.values():
f.write("{}\n".format(sub_dict["pay"]))
Your dataNEW.txt should then look like the following:
358.43100342
363.15667371
384.67092962
356.72833095
126.52971186
346.29137133
379.17652358
251.91152953
158.96447829
85.70053418
8.32775435
147.6364827
153.35155293
181.65819439
308.4201041
196.21644271
230.84154581
151.05243751
195.44476902
217.42321168

Python - How to read the content of an URL twice?

I am using 'urllib.request.urlopen' to read the content of an HTML page. Afterwards, I want to print the content to my local file and then do a certain operation (e.g. constuct a parser on that page e.g. BeautifulSoup).
The problem
After reading the content for the first time (and writing it into a file), I can't read the content for the second time in order to do something with it (e.g. construct a parser on it). It is just empty and I can't move the cursor(seek(0)) back to the beginning.
import urllib.request
response = urllib.request.urlopen("http://finance.yahoo.com")
file = open( "myTestFile.html", "w")
file.write( response.read() ) # Tried responce.readlines(), but that did not help me
#Tried: response.seek() but that did not work
print( response.read() ) # Actually, I want something done here... e.g. construct a parser:
# BeautifulSoup(response).
# Anyway this is an empty result
file.close()
How can I fix it?
Thank you very much!
You can not read the response twice. But you can easily reuse the saved content:
content = response.read()
file.write(content)
print(content)

Categories

Resources