Writing data to csv or text file using python - python

I am trying to write some data to csv file by checking some condition as below
I will have a list of urls in a text file as below
urls.txt
www.example.com/3gusb_form.aspx?cid=mum
www.example_second.com/postpaid_mum.aspx?cid=mum
www.example_second.com/feedback.aspx?cid=mum
Now i will go through each url from the text file and read the content of the url using urllib2 module in python and will search a string in the entire html page. If the required string founds i will write that url in to a csv file.
But when i am trying to write data(url) in to csv file,it is saving like each character in to one coloumn as below instead of saving entire url(data) in to one column
h t t p s : / / w w w......
Code.py
import urllib2
import csv
search_string = 'Listen Capcha'
html_urls = open('/path/to/input/file/urls.txt','r').readlines()
outputcsv = csv.writer(open('output/path' + 'urls_contaning _%s.csv'%search_string, "wb"),delimiter=',', quoting=csv.QUOTE_MINIMAL)
outputcsv.writerow(['URL'])
for url in html_urls:
url = url.replace('\n','').strip()
if not len(url) == 0:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
if str(search_string) in response.read():
outputcsv.writerow(url)
So whats wrong with the above code, what needs to be done in order to save the entire url(string) in to one column in a csv file ?
Also how can we write data to a text file as above ?
Edited
Also i had a url suppose like http://www.vodafone.in/Pages/tuesdayoffers_che.aspx ,
this url will be redirected to http://www.vodafone.in/pages/home_che.aspx?cid=che in browser actually, but when i tried through code as below it is same as the above given url
import urllib2, httplib
httplib.HTTPConnection.debuglevel = 1
request = urllib2.Request("http://www.vodafone.in/Pages/tuesdayoffers_che.aspx")
opener = urllib2.build_opener()
f = opener.open(request)
print f.geturl()
Result
http://www.vodafone.in/pages/tuesdayoffers_che.aspx?cid=che
So finally how to catch the redirected url with urllib2 and fetch the data from it ?

Change the last line to:
outputcsv.writerow([url])

Related

send post request from .txt file

I'm new in Python and looking for some help :)
I created simple script which is checking IP reputation (from lists.txt) in IPVoid:
import requests
import re
URL = "https://www.ipvoid.com/ip-blacklist-check/"
ip = open('lists.txt')
DATA = {"ip":ip}
r = requests.post(url = URL, data = {"ip":ip})
text = r.text
bad_ones= re.findall(r'<i class="fa fa-minus-circle text-danger" aria-hidden="true"></i> (.+?)</td>', text)
print(bad_ones)
The lists.txt contain list of IPs:
8.8.8.8
4.4.4.4
etc..
However, the script tooks only 1 line of the script - i would like to do "bulk" checking.
please advice :)
It is not clear whether or not the ip addresses in the txt file are organized line by line, but I assume that this is the case.
You can do something like the following
import requests
import re
URL = "https://www.ipvoid.com/ip-blacklist-check/"
bad_ones = []
with open('lists.txt') as f:
for ip in f.readlines():
r = requests.post(url = URL, data = {"ip":ip.strip()})
text = r.text
bad_ones.append(re.findall(r'<i class="fa fa-minus-circle text-danger" aria-hidden="true"></i> (.+?)</td>', text))
print(bad_ones)
The with open('lists.txt') as f statement lets you
open the file and name the resulting io object as f, when the end of the 'with' block is reached the file will be closed without explicitly calling f.close().
Now for the batch mode it is a simple loop over each line of the text file, with a simple filter to get rid of the newline character by calling strip() on each ip string (the line of the text file).
I am not even sure if your above program works.. The ip variable in your program is basically a io object.
What you need is a for loop to send request for each and every IP.
You can do bulk checking if the API accepts them
import requests
import re
URL = "https://www.ipvoid.com/ip-blacklist-check/"
ips = open('lists.txt')
for ip in ips.readlines():
DATA = {"ip":ip}
r = requests.post(url = URL, data = {"ip":ip})
text = r.text
'''Your processing goes here..'''
Also explore using with clause for opening and closing files.

Python download image from short url by keeping its own name

I would like to download image file from shortener url or generated url which doesn't contain file name on it.
I have tried to use [content-Disposition]. However my file name is not in ASCII code. So it can't print the name.
I have found out i can use urlretrieve, request to download file but i need to save as different name.
I want to download by keeping it's own name..
How can i do this?
matches = re.match(expression, message, re.I)
url = matches[0]
print(url)
original = urlopen(url)
remotefile = urlopen(url)
#blah = remotefile.info()['content-Disposition']
#print(blah)
#value, params = cgi.parse_header(blah)
#filename = params["filename*"]
#print(filename)
#print(original.url)
#filename = "filedown.jpg"
#urlretrieve(url, filename)
These are the list that i have try but none of them work
I was able to get this to work with the requests library because you can use it to get the url that the shortened url redirects to. Then, I applied your code to the redirected url and it worked. There might be a way to only use urllib (I assume thats what you are using) with this, but I dont know.
import requests
from urllib.request import urlopen
import cgi
def getFilenameFromURL(url):
req = requests.request("GET", url)
# req.url is now the url the shortened url redirects to
original = urlopen(req.url)
value, params = cgi.parse_header(original.info()['Content-Disposition'])
filename = params["filename*"]
print(filename)
return filename
getFilenameFromURL("https://shorturl.at/lKOY3")
You can then use urlretrieve with this. Its inefficient but it works... Also since you can get the actual url with the requests library, you can probably get the filname through there.

Loading multiple JSON files

So I am trying to load multiple JSON files with Python HTTP requests, but I cant figure out to do it corecctly.
Loading one JSON file with python is pretty simple:
response = requests.get(url)
te = response.content.decode()
da = json.loads(te[te.find("{"):te.rfind("}")+1]
But how can I load multiple JSON files?
I have a list of URLs and I tried to request every URL with a loop and then load every line of the result, but it seems this does not work.
This is the code I am using:
t = []
for url in urls:
re = requests.get(url)
te = req.content.decode()
daten = json.loads(te[te.find("{"):te.rfind("}")+1])
t.append(daten)
But I am getting this error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0).
I am pretty new with JSOn but I do understand that I cant read it line for line with a loop, becuase it destructs the JSON struture(?).
So how can I read multiple JSON files?
EDIT: Found the error.
Some links are not in correct JSON.
With requests library, If the endpoint you are requesting returns a well formed json response, all you need to do is call the .json() method on the response object:
t = []
for url in urls:
re = requests.get(url)
t.append(re.json())
Then, if you want to handle bad responses, wrap the code above in a try:...except block
Assuming you receive correct json from any site, you didn't construct result json.
You might write something like
t = []
for url in urls:
t.append(requests.get(url).content.decode('utf-8'))
result = json.loads('{{"data": [{}]}}'.format(','.join(t)))

How to save a file with a URL name?

I'm saving HTTP request as an HTML page.
How can I save the HTML file with the name of the URL.
I'm using Linux OS
So the file name will look like this: "http://www.test.com.html"
My code:
url = "http://www.test.com"
page = urllib.urlopen(url).read()
f = open("./file.html", "w")
f.write(page)
f.close()
Unfortunately you cannot save a file with a url name. The character "/" is not allowed in Windows file names.
However, you can create a file with the name www.test.com.html with the following line
file_name = url.split('/')[2]
If you need to anything like https://www.test.com/posts/1, you can try to replace / with another custom character that usually not occurs in url such as __
url = 'https://www.test.com/posts/11111111'
file_name = '__'.join(url.split('/')[2:])
Will result in
www.test.com__posts__1

Find Hard to Match only url from a text file

My text file consist of :
http://www.makemytrip.com/
http://www.makemytrip.com/blog/dil-toh-roaming-hai?intid=Blog_HPHeader_Logo //how do i remove /dil-toh-roaming-hai?intid=Blog_HPHeader_Logo
http://www.makemytrip.com/rewards/?intid=New_ch_mtr_na
javascript:void(0) //how do i remove this
javascript:void(0)
javascript:void(0)
http://www.makemytrip.com/rewards/?intid=new_ch_mtr_dropdwn
https://support.makemytrip.com/MyAccount/MyTripReward/DashBoard
https://support.makemytrip.com/MyAccount/User/User
https://support.makemytrip.com/MyAccount/MyBookings/BookingSummary/
https://support.makemytrip.com/customersupports.aspx?actiontype=PRINTETICKET
how do i go about checking only urls and save them in another file so that i can parse them one at a time . I tried this Python code But it matches and open the 1st url only.
import urllib
with open("s.txt","r") as file:
for line in file:
url = urllib.urlopen(line)
read = url.read()
print read

Categories

Resources