Check if many URLs exists from a txt file (Python) [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
i'm new in python and I'm wanting to do what I said above, but I don't have any ideas, so how can I?

From the code in your comment (you should put this in your question), it is with reading the lines from a file that you're struggling.
The idiomatic way of doing this is like so:
with open("hello.txt") as f:
for line in f:
print line,
[See File Objects in the official Python documentation].
Plugging this into your code (and removing the newline and any spaces from each line with str.strip()):
#!/usr/bin/env python
import mechanize
br = mechanize.Browser()
br.set_handle_redirect(False)
with open('urls.txt') as urls:
for url in urls:
stripped = url.strip()
print '[{}]: '.format(stripped),
try:
br.open_novisit(stripped)
print 'Funfando!'
except Exception, e:
print e
Note that URLs start with a scheme name (commonly called a protocol, such as http), followed by a colon, and two slashes hence:
[stackoverflow.com]: can't fetch relative reference: not viewing any document
But
[http://stackoverflow.com/]: Funfando!

Open the file. Iterate through the lines. Fetch the files and check for errors.

Related

Escape characters when joining strings [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Trying to read a .csv file content from folder. Example code:
Files_in_folder = os.listdir(r"\\folder1\folder2")
Filename_list = []
For filename in files_in_folder:
If "sometext" in filename:
Filename_list.append(filename)
Read_this_file = "\\folder1\folder2"+max(filename_list)
Data = pandas.read_csv(Read_this_file,sep=',')
Fetching the max filename works, but the Data variable fails:
FileNotFoundError: no such file or directory.
I am able to access the folder as we see in my first line of code, but when I combine two strings, putting the r in front doesn't work, any ideas?
You need to add \ to your path when concatenating:
read_this_file = '\\folder1\\folder2\\' + max(filename_list)
But a better way to avoid that problem is to use
os.path.join("\\folder1\\folder2", max(filename_list))
for a working code, use this:
files_in_folder = os.listdir("folder1/folder2/")
filename_list = []
for filename in files_in_folder:
if "sometext" in filename:
filename_list.append(filename)
read_this_file = "folder1/folder2/"+max(filename_list)
data = pd.read_csv(read_this_file,sep=',')
Explanation:
When you put r before a string, the character following a backslash is included in the string without change, and all backslashes are left in the string.
In your example, if you try to print "\folder1\folder2" Python will read the '\f' part as a special character (just as it would for \n for example).

How to Empty Lines? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I ran into a problem while trying to put lines from a .txt file into a list. I know you get extra lines when you do this, so I used line.split() to take out the trailing lines.
When I did this, the words I was trying to read became weirdly formatted. This is what it looked like...
['word']
Do any of you know how to take out the trailing lines without having this happen?
Just read all of lines with readlines() function and then you can get the n last line with reverse indexing : lines[-n:] and as says in comment its better to call the built-in list of open file handle !
with open('test_file.txt','r') as f :
lines =list(f)
or
with open('test_file.txt','r') as f :
lines =f.readlines()

How to cut link in python? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have the following link:
http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg
How to take just this one part of the link:
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
and remove everything else? I also want to keep the extension.
I want to remove this part:
._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_
and keep this part:
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
How can I do this in python?
You could use:
re.sub(r'\._[\w.,-]*(\.(?:jpg|png|gif))$', r'\1', inputurl)
This makes some assumptions but works on your input. The search starts at the ._ sequence, takes anything after that that is a letter, digit, dash, underscore, dot or comma, then matches the extension. I picked an explicit small group of possible extensions; you could also just use (\.w+)$ at the end instead to widen the acceptable extensions to word characters.
Demo:
>>> import re
>>> inputurl = 'http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg'
>>> re.sub(r'\._[\w.,-]*(\.(?:jpg|png|gif))$', r'\1', inputurl)
'http://ecx.images-amazon.com/images/I51JXXb2vpDL.jpg'
url = "http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg"
l = url.split(".")
print(".".join(l[:-2:])+".{}".format(l[-1]))
prints
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
The following should work:
import re
url = "http://ecx.images-amazon.com/images/I/51JXXb2vpDL._SY344_PJlook-inside-v2,TopRight,1,0_SH20_BO1,204,203,200_.jpg"
print re.sub(r"(https?://.+?)\._.+(\.\w+)", r'\1\2', url)
The above code prints
http://ecx.images-amazon.com/images/I/51JXXb2vpDL.jpg
An important detail: More links are necessary to find the correct pattern. I'm currently assuming you want everything until the first ._
url = re.sub("(/[^./]+)\.[^/]*?(\.[^.]+)$", "\\1\\2", url)

Remove certain type of tag from HTML (without string operations) with Python and BeautifulSoup [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a long html file, and it has a bunch of "span" elements throughout. (Ie they begin with <span and end with /span>. Is there a way with BeautifulSoup to eliminate all of those span attributes and end up with the remaining html?
My alternative would be to use a series of complex string operations to scrub the text before sending it through BeautifulSoup but I would love to avoid that if possible.
EDIT:
I attempted the decompose() function like this:
soup = BeautifulSoup(myhtml)
soup.span.decompose()
print soup.prettify()
and all the parts are still in there. It doesn't seem to have altered the html at all.
It is rather easy...
from bs4 import BeautifulSoup
page = "<span>Hello world</span><h1>Nice to see you</h1><span>no</span><span>Hello babe</span>"
soup = BeautifulSoup(page)
while len(soup.find_all('span')) > 0:
soup.span.extract()
print soup
I think the method you're looking for is unwrap():
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap
There are many ways to implement this, so how you execute it depends on your particular situation, but unwrap() seems like it should work.
I have done some pretty extensive work with this, check out my post here:
http://bobbyrussell.io/posting-to-wordpress-programatically/
Hope that helps!
" > a series of complex string operations "
actually with regular expressions it is quite straight forward:
import re
re.sub( r'<span .*?/span>', '', html_txt , flags=re.DOTALL)

Combine variable with variable and text in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I wondering what i am doing wrong here..
The issue is this line *final = 'PAT_' SID '.txt'*
where SID is a variable
Can anybody have a quick look, I am sure I am doing something stupid.
Below is the complete code...
#!/usr/bin/env python
import os
global SID
global final
with open ('sampleID.txt', 'r') as inF:
for line in inF:
if 'Sample ID:' in line:
SID = line.split(':')[1]
final = 'PAT_' SID '.txt'
os.rename("sampleID.txt",final)
To concatenate variables, you need to add (+) them:
final = 'PAT_' + SID + '.txt'
You can also use the built-in function str.format() here:
final = 'PAT_ {} .txt'.format(SID)
Or even the old way of string formatting, which is still compatible in Python 3 (but str.format is much better to use):
final = 'PAT_ %s .txt' % SID
By the way, your global statements aren't needed. A with statement does not introduce a new scope, hence everything defined in a with statement is a global variable.
use + to concat strings in python

Categories

Resources