Python reading URLs from file until last line

Python reading URLs from file until last line - python

I have script which basically checks domain from the text file and finds its email. I want to add multiple domain names(line by line) then script should take each domain run the function and goes to second line after finishing. I tried to google for specific solution but not sure how do i find appropriate answer.
f = open("demo.txt", "r")
url = f.readline()
extractUrl(url)
def extractUrl(url):
try:
print("Searching emails... please wait")
count = 0
listUrl = []
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': ua.random
})
try:
conn = urllib.request.urlopen(req, timeout=10)
status = conn.getcode()
contentType = conn.info().get_content_type()
html = conn.read().decode('utf-8')
emails = re.findall(
r '[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}', html)
for email in emails:
if (email not in listUrl):
count += 1
print(str(count) + " - " + email)
listUrl.append(email)
print(str(count) + " emails were found")

Python files are iterable, so it's basically a simple as:
for line in f:
extractUrl(line)
But you may want to do it right (ensure you close the file whatever happens, ignore possible empty lines etc):
# use `with open(...)` to ensure the file will be correctly closed
with open("demo.txt", "r") as f:
# use `enumerate` to get line numbers too
#- we might need them for information
for lineno, line in enumerate(f, 1):
# make sure the line is clean (no leading / trailing whitespaces)
# and not empty:
line = line.strip()
# skip empty lines
if not line:
continue
# ok, this one _should_ match - but something could go wrong
try:
extractUrl(line)
except Exception as e:
# mentioning the line number in error report might help debugging
print("oops, failed to get urls for line {} ('{}') : {}".format(lineno, line, e))

Related

Extract IP addresses from text file without using REGEX

I am trying to extract IPv4 addresses from a text file and save them as a list to a new file, however, I can not use regex to parse the file, Instead, I have check the characters individually. Not really sure where to start with that, everything I find seems to have import re as the first line.
So far this is what I have,
#Opens and prints wireShark txt file
fileObject = open("wireShark.txt", "r")
data = fileObject.read()
print(data)
#Save IP adresses to new file
with open('wireShark.txt') as fin, open('IPAdressess.txt', 'wt') as fout:
list(fout.write(line) for line in fin if line.rstrip())
#Opens and prints IPAdressess txt file
fileObject = open("IPAdressess.txt", "r")
data = fileObject.read()
print(data)
#Close Files
fin.close()
fout.close()
So I open the file, and I have created the file that I will put the extracted IP's in, I just don't know ow to pull them without using REGEX.
Thanks for the help.

Here is a possible solution.
The function find_first_digit, position the index at the next digit in the text if any and return True. Else return False
The functions get_dot and get_num read a number/dot and, lets the index at the position just after the number/dot and return the number/dot as str. If one of those functions fails to get the number/dot raise an MissMatch exception.
In the main loop, find a digit, save the index and then try to get an ip.
If sucess, write it to output file.
If any of the called functions raises a MissMatch exception, set the current index to the saved index plus one and start over.
class MissMatch(Exception):pass
INPUT_FILE_NAME = 'text'
OUTPUT_FILE_NAME = 'ip_list'
def find_first_digit():
while True:
c = input_file.read(1)
if not c: # EOF found!
return False
elif c.isdigit():
input_file.seek(input_file.tell() - 1)
return True
def get_num():
num = input_file.read(1) # 1st digit
if not num.isdigit():
raise MissMatch
if num != '0':
for i in range(2): # 2nd 3th digits
c = input_file.read(1)
if c.isdigit():
num += c
else:
input_file.seek(input_file.tell() - 1)
break
return num
def get_dot():
if input_file.read(1) == '.':
return '.'
else:
raise MissMatch
with open(INPUT_FILE_NAME) as input_file, open(OUTPUT_FILE_NAME, 'w') as output_file:
while True:
ip = ''
if not find_first_digit():
break
saved_position = input_file.tell()
try:
ip = get_num() + get_dot() \
+ get_num() + get_dot() \
+ get_num() + get_dot() \
+ get_num()
except MissMatch:
input_file.seek(saved_position + 1)
else:
output_file.write(ip + '\n')

How to read and write the .txt file line by line in python?

input.txt -
I am Hungry
call the shopping mall
connected drive
I want to read the input.txt line by line and send that as a request to the server and later save the response respectively. how to read and write the data line by line ?
my code below works for just one input within input.txt (ex : I am Hungry). Can you please help me how to do it for multiple input ?
Request :
fileInput = os.path.join(scriptPath, "input.txt")
if not os.path.exists(fileInput):
print "error message"
Error_Status = 1
sys.exit(Error_Status)
else:
content = open(fileInput, "r").read()
if len(content):
TEXT_TO_READ["tts_input"] = content
TEXT_TO_READ = json.dumps(TEXT_TO_READ)
else:
print "error message 2"
request = Request()
Response :
res = h.getresponse()
data = """MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=--Nuance_NMSP_vutc5w1XobDdefsYG3wq
""" + res.read()
msg = email.message_from_string(data)
for index, part in enumerate(msg.walk(), start=1):
content_type = part.get_content_type()
payload = part.get_payload()
if content_type == "audio/x-wav" and len(payload):
with open('Sound_File.pcm'.format(index), 'wb') as f_pcm:
f_pcm.write(payload)
elif content_type == "application/json":
with open('TTS_Response.txt'.format(index), 'w') as f_json:
f_json.write(payload)

To keep it stupid simple, let's implement your broad description of what should happen : ''I want to read the input.txt line by line and send that as a request to the server and later save the response respectively. '' :
for line in readLineByLine('input.txt'):
sendAsRequest(line)
saveResponse()
From what I can gather from your question, you already have basically functions sendAsRequest(line) and saveResponse() (maybe under another name), but you miss the function readLineByLine('input.txt'). Here it is:
def readLineByLine(filename):
with open(filename, 'r') as f: #Use with statement to correctly close the file when you read all the lines.
for line in f: # Use implicit iterator over filehandler to minimize memory used
yield line.strip('\n') #Use generator, to minimize memory used, removing trailing carriage return as it is not part of the command.

Basically you can simply:
with open('filename') as f:
for line in f.readlines():
print line
The output will be:
I am Hungry
call the shopping mall
connected drive
Now for an explanation about the "with" statement you can read here:
http://effbot.org/zone/python-with-statement.htm

Parsing with multiple identifiers

I was trying to implement this block of code from Generator not working to split string by particular identifier . Python 2 but I found two bugs in it that I can’t seem to fix.
Input:
#m120204
CTCT
+
~##!
#this_one_has_an_at_sign
CTCTCT
+
#jfik9
#thisoneisempty
+
#empty line after + and then empty line to end file (2 empty lines)
The two bugs are:
(i) when there is a # that starts the line of code after the ‘+’ line such as the 2nd entry (#this_one_has_an_at_sign)
(ii) when there line following the #identification_line or the line following the ‘+’ lines are empty like in 3rd entry (#thisoneisempty)
I would like the output to be the same as the post that i referenced:
yield (name, body, extra)
in the case of #this_one_has_an_at_sign
name= this_one_has_an_at_sign
body= CTCTCT
quality= #jfik9
in the case of #thisoneisempty
name= thisoneisempty
body= ''
quality= ''
I tried using flags but i can’t seem to fix this issue. I know how to do it without using a generator but i’m going to be using big files so i don’t want to go down that path. My current code is:
def organize(input_file):
name = None
body = ''
extra = ''
for line in input_file:
line = line.strip()
if line.startswith('#'):
if name:
body, extra = body.split('+',1)
yield name, body, extra
body = ''
name = line
else:
body = body + line
body, extra = body.split('+',1)
yield name, body, extra
for line in organize(file_path):
print line

Python - reading .htm files listed in a txt file

I am using below to read some .htm files.
from bs4 import BeautifulSoup
import os
BASEDIR = "C:\\designers"
aa = os.listdir(BASEDIR)
text_file = open(os.path.join(BASEDIR, 'all htm.txt'), "w")
for b in aa:
if b.endswith('.htm'):
c = os.path.join(BASEDIR, b)
text_file.write(c)
text_file.write('\n')
text_file.close()
list_open = open(os.path.join(BASEDIR, 'all htm.txt'))
read_list = list_open.read()
line_in_list = read_list.split('\n')
for i, ef in enumerate(line_in_list):
page = open(ef)
soup = BeautifulSoup(page.read())
print i
print soup
however it only reads the first file and then gives error:
IOError: [Errno 22] invalid mode ('r') or filename: ''
what went wrong?
thanks.
'kev' pointed out the problem: there are unwanted line in the txt file.
there are many ways to remove empty lines in txt.
in addition to that, the last part can be changed to:
for i, ef in enumerate(line_in_list):
if '.htm' in ef: # or 'len(ef) > 1' etc...
page = open(ef)
soup = BeautifulSoup(page.read())
print i
print soup

Because you are writing \n at the end of every line when you create 'all htm.txt' (regardless of if it is the last line) you end up with an empty line at the end of your file. You thus end up with an empty string at the end of line_in_listwhen you split on the newline character.
Instead, do enumerate(line_in_list[:-1]) which will ignore the final (empty) element.
Alternatively, you could make your code more robust by putting a try: except: block around each loop of your iteration and gracefully handle/ignore exceptions when they occur. This will protect you against future problems in your code:
For example:
for i, ef in enumerate(line_in_list):
try:
page = open(ef)
soup = BeautifulSoup(page.read())
print i
print soup
except IoError:
print 'ignoring file %s'%ef
except Exception:
print 'an unhandled exception occurred for file %s'%ef

It would be interesting in which line of the code the error occurs.
Be careful with the lines b read from the file aa. They end with a newline \n. So, the IF condition will never be true and you are producing an empty file all html.txt.
Try
x=b.strip()
if(x.endswith(".htm")):
....
This will cut any whitespace (anything like space, carriageReturn, tab, newLine) at the beginning and end of b.

Remove duplicates after altering items

I have a script to clean urls to get base domains from example.com/example1 and example.com/example2 down to example.com My issue is when it goes to through the file of urls it will have duplicate base domains. I want to remove the duplicates while printing the urls to a file. below is the code I currently have.
enter from Tkinter import *
import tkFileDialog
import re
def main():
fileOpen = Tk()
fileOpen.withdraw() #hiding tkinter window
file_path = tkFileDialog.askopenfilename(
title="Open file", filetypes=[("txt file",".txt")])
if file_path != "":
print "you chose file with path:", file_path
else:
print "you didn't open anything!"
fin = open(file_path)
fout = open("URL Cleaned.txt", "wt")
for line in fin.readlines():
editor = (line.replace('[.]', '.')
.replace('[dot]', '.')
.replace('hxxp://www.', '')
.replace('hxxps://www.', '')
.replace('hxxps://', '')
.replace('hxxp://', '')
.replace('www.', '')
.replace('http://www.', '')
.replace('https://www.', '')
.replace('https://', '')
.replace('http://', ''))
editor = re.sub(r'/.*', '', editor)
if __name__ == '__main__':
main()
Any help is appreciated. I have scoured the posts and tried all of the suggestions for my issue and have not found one that works.

You can use regular expresion to find the base domains.
If you have one url per line in your file:
import re
def main():
file = open("url.txt",'r')
domains = set()
# will works for any web like https://www.domain.com/something/somethingmore... , also without www, without https or just for www.domain.org
matcher= re.compile("(h..ps?://)?(?P<domain>(www\.)?[^/]*)/?.*")
for line in file:
# make here any replace you need with obfuscated urls like: line = line.replace('[.]','.')
if line[-1] == '\n': # remove "\n" from end of line if present
line = line[0:-1]
match = matcher.search(line)
if match != None: # If a url has been found
domains.add(match.group('domain'))
print domains
file.close()
main()
For example, with this file, it will print:
set(['platinum-shakers.net', 'wmi.ns01.us', 'adservice.no-ip.org', 'samczeruno.pl', 'java.ns1.name', 'microsoft.dhcp.biz', 'ids.us01.us', 'devsite.quostar.com', 'orlandmart.com'])

perhaps you could use a regular expression:
import re
p = re.compile(r".*\.com/(.*)") # to get for instance 'example1' or 'example2' etc.
with open(file_path) as fin, open("URL Cleaned.txt", "wt") as fout:
lines = fin.readlines():
bases = set(re.search(p, line).groups()[0] for line in lines if len(line) > 1)
for b in bases:
fout.write(b)
Using with open(..) auto closes the files after the executing the block of code
Output:
Using a text file with:
www.example.com/example1
www.example.com/example2
# blank lines are accounted for
www.example.com/example3
www.example.com/example4
www.example.com/example4 # as are duplicates
as the lines, I got the output,
example1
example2
example3
example4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python reading URLs from file until last line - python

Related

Extract IP addresses from text file without using REGEX

How to read and write the .txt file line by line in python?

Parsing with multiple identifiers

Python - reading .htm files listed in a txt file

Remove duplicates after altering items

Categories

Resources