Find Hard to Match only url from a text file - python

My text file consist of :
http://www.makemytrip.com/
http://www.makemytrip.com/blog/dil-toh-roaming-hai?intid=Blog_HPHeader_Logo //how do i remove /dil-toh-roaming-hai?intid=Blog_HPHeader_Logo
http://www.makemytrip.com/rewards/?intid=New_ch_mtr_na
javascript:void(0) //how do i remove this
javascript:void(0)
javascript:void(0)
http://www.makemytrip.com/rewards/?intid=new_ch_mtr_dropdwn
https://support.makemytrip.com/MyAccount/MyTripReward/DashBoard
https://support.makemytrip.com/MyAccount/User/User
https://support.makemytrip.com/MyAccount/MyBookings/BookingSummary/
https://support.makemytrip.com/customersupports.aspx?actiontype=PRINTETICKET
how do i go about checking only urls and save them in another file so that i can parse them one at a time . I tried this Python code But it matches and open the 1st url only.
import urllib
with open("s.txt","r") as file:
for line in file:
url = urllib.urlopen(line)
read = url.read()
print read

Related

Search for a word in webpage and save to TXT in Python

I am trying to: Load links from a .txt file, search for a specific Word, and if the word exists on that webpage, save the link to another .txt file but i am getting error: No scheme supplied. Perhaps you meant http://<_io.TextIOWrapper name='import.txt' mode='r' encoding='cp1250'>?
Note: the links has HTTPS://
The code:
import requests
list_of_pages = open('import.txt', 'r+')
save = open('output.txt', 'a+')
word = "Word"
save.truncate(0)
for page_link in list_of_pages:
res = requests.get(list_of_pages)
if word in res.text:
response = requests.request("POST", url)
save.write(str(response) + "\n")
Can anyone explain why ? thank you in advance !
Try putting http:// behind the links.
When you use res = requests.get(list_of_pages) you're creating HTTP connection to list_of_pages. But requests.get takes URL string as a parameter (e.g. http://localhost:8080/static/image01.jpg), and look what list_of_pages is - it's an already opened file. Not a string. You have to either use requests library, or file IO API, not both.
If you have an already opened file, you don't need to create HTTP request at all. You don't need this request.get(). Parse list_of_pages like a normal, local file.
Or, if you would like to go the other way, don't open this text file in list_of_arguments, make it a string with URL of that file.

grap a URL from a column and paste in chrome

I have an Excel file with a column filled with +4000 URLs each one in a different cell. I need to use Python to open it with Chrome and scraping the website some of the data from a website.
past them in excel.
And then do the same step for the next URL. Could you please help me with that?
export the excel file to csv file read data from it as
def data_collector(url):
# do your code here and return data that you want to write in place of url
return url
with open("myfile.csv") as fobj:
content = fobj.read()
#below line will return you urls in form of list
urls = content.replace(",", " ").strip()
for url in urls:
data_to_be_write = data_collector(url)
# added extra quotes to prevent csv from breaking it is prescribed
# to use csv module to write in csv file but for ease of understanding
# i did it like this, Hoping You will correct it by yourself
content = "\"" + {content.replace(url, data_to_be_write) + "\""
with open("new_file.csv", "wt") as fnew:
fnew.write(content)
after running this code you will get new_file.csv opening it with Excel you will get your desired data in place of url
if you want your url with data just append it like with data in string seprated by colon.

Can write into the file in python and startswith not working

I have a problem. I have a task "download by ID"
This is my previous program which download text (PDB file)
from urllib.request import urlopen
def download(inf):
url = xxxxxxxxxxx
response = urlopen(xxx)
text = response.read().decode('utf-8')
return data
new_download = download('154)
It works perfect, but my function that I must create, don't write to file lines which starts with num
from urllib.request import urlopen #moduule for URL processing
with open('new_test', 'w') as a:
for sent in text: #for every line in sequences file
if line.startswith('num'):
line1.writeline(sent)
You're not iterating over the lines, you're iterating over the characters. Change
for line in data2:
to
for line in data2.splitlines():

Check URL file existe from a text file

I am trying to do the following things:
1 - open a text file containing a list with URLs (http://example.com). <br>
2 - read the text file and check if the path existe. <br>
3 - write the results back in another text file.
I have tried the following code:
import urllib2
file = open('file.txt', 'r')
search = urllib2.urlopen(file + "/js/tools.js")
if search.code == 200:
print "Exists!"
I appreciate any help provided.
considering you have file Filename as links are stores line by line
import requests
with open(filename,'r') as file, open(filename2,'w+') as file2:
for url in file.readlines():
check = requests.get(url)
if check.ok:
file2.write(url)

Writing data to csv or text file using python

I am trying to write some data to csv file by checking some condition as below
I will have a list of urls in a text file as below
urls.txt
www.example.com/3gusb_form.aspx?cid=mum
www.example_second.com/postpaid_mum.aspx?cid=mum
www.example_second.com/feedback.aspx?cid=mum
Now i will go through each url from the text file and read the content of the url using urllib2 module in python and will search a string in the entire html page. If the required string founds i will write that url in to a csv file.
But when i am trying to write data(url) in to csv file,it is saving like each character in to one coloumn as below instead of saving entire url(data) in to one column
h t t p s : / / w w w......
Code.py
import urllib2
import csv
search_string = 'Listen Capcha'
html_urls = open('/path/to/input/file/urls.txt','r').readlines()
outputcsv = csv.writer(open('output/path' + 'urls_contaning _%s.csv'%search_string, "wb"),delimiter=',', quoting=csv.QUOTE_MINIMAL)
outputcsv.writerow(['URL'])
for url in html_urls:
url = url.replace('\n','').strip()
if not len(url) == 0:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
if str(search_string) in response.read():
outputcsv.writerow(url)
So whats wrong with the above code, what needs to be done in order to save the entire url(string) in to one column in a csv file ?
Also how can we write data to a text file as above ?
Edited
Also i had a url suppose like http://www.vodafone.in/Pages/tuesdayoffers_che.aspx ,
this url will be redirected to http://www.vodafone.in/pages/home_che.aspx?cid=che in browser actually, but when i tried through code as below it is same as the above given url
import urllib2, httplib
httplib.HTTPConnection.debuglevel = 1
request = urllib2.Request("http://www.vodafone.in/Pages/tuesdayoffers_che.aspx")
opener = urllib2.build_opener()
f = opener.open(request)
print f.geturl()
Result
http://www.vodafone.in/pages/tuesdayoffers_che.aspx?cid=che
So finally how to catch the redirected url with urllib2 and fetch the data from it ?
Change the last line to:
outputcsv.writerow([url])

Categories

Resources