Python search txt file for urls - python

Im trying to search a .txt file and return any objects found that match my criteria. I would like to get the entire line and place the urls in a set or list.
What is the best way to search the txt file and return objects?
Here is what I have so far:
# Target file to search
target_file = 'randomtextfile.txt'
# Open the target file in Read mode
target_open = open(target_file, 'r')
# Start loop. Only return possible url links.
for line in target_open:
if '.' in line:
print(target_open.readline())
And here is the sample .txt file:
This is a file:
Sample file that contains random urls. The goal of this
is to extract the urls and place them in a list or set
with python. Here is a random link ESPN.com
Links will have multiple extensions but for the most part
will be one or another.
python.org
mywebsite.net
firstname.wtf
creepy.onion
How to find a link in the middle of line youtube.com for example

Unless you have any restrictions that require you to parse the urls manually rather than using built-in python libraries, the re can be helpful to accomplish this.
Using an answer from Regular expression to find URLs within a string
# Target file to search
target_file = 'randomtextfile.txt'
# Open the target file in Read mode
target_open = open(target_file, 'r')
# Read the text from the file
text = target_open.read()
# import regex module
import re
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
result:
['ESPN.com', 'python.org', 'mywebsite.net', 'firstname.wtf', 'creepy.onion', 'youtube.com']
Unfortunately, searching if '.' in line: will match on punctuation like urls. The, python. Here and another.
Python's regex module helps specify the pattern of url syntax so only urls are matched and not sentence punctuation.
Hope this helps.

Related

Using regex in python on multiple text files to parse and collect data to add to excel

Hi I am extremely new to python and I need to work with regex.
I have multiple .txt files in a directory that I need to parse. Each of these .txt file has multiple occurrence of the word "instruction" in it. I need the grab the number that follows the word "instruction" and add it to a list that I will display in excel. This is done in a way that I have a column of "Instruction" with all the instruction numbers and I have a row of all the .txt file names. I need to end up putting a yes or no in front of the instruction number if it is present in a particular .txt file.
I want to know how to grab the number that follows the word "instruction" and add it to a list (maybe). And use this list later to formulate an excel file. What is the way to write this regex instruction?
This is my code so far
import csv
import re
import glob
import os
inst_num = []
os.chdir (r"C:\Users\10002\Desktop\work\scripts")
for file in glob.glob("*.txt"):
with open (file, 'r') as f:
for line in f:
inst = re.compile ('instruction:(\d+)',line)
if inst.search(line) is not None:
inst_num = inst.search(line).group(1)
First, compile does not take the text string that is to be searched as a second argument (the optional second argument are flags to be used, e.g. re.IGNORECASE). Second, the call to compile should be taken out of the loop or else you are defeating the purpose of pre-compiling the regular expression. Third, you are asking multiple questions, which is generally frowned upon. I will show you how to create a list of numbers. If you have a separate question on how to create a CSV file from that, post a separate question.
import csv
import re
import glob
import os
inst_num = []
inst = re.compile('instruction:(\d+)') # compiled regex
os.chdir (r"C:\Users\10002\Desktop\work\scripts")
for file in glob.glob("*.txt"):
with open (file, 'r') as f:
for line in f:
match = inst.search(line) # do the search once
if match:
inst_num.append(match.group(1)) # add to list

delete everything except URL with python [duplicate]

This question already has answers here:
Extracting a URL in Python
(10 answers)
Closed 2 years ago.
I have a JSON file that contains metadata for 900 articles. I want to delete all the data except for the lines that contain URLs and resave the file as .txt.
I created this code but I couldn't continue the saving phase:
import re
with open("path\url_example.json") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
print(urls)
A part of the results:
['http://www.google.com.']
['https://www.tutorialspoint.com']
Another issue is the results are marked between [' '] and may end with . I don't need this. My expected result is:
http://www.google.com
https://www.tutorialspoint.com
If you know which key your URLs will be found under in your JSON, you might find an easier approach is to deserialize the JSON using the JSON module from the Python standard library and work with a dict instead of using regex.
However, if you want to work with regex, remember urls is a list of regex matches. If you know there's definitely only going to be only one match per line, then just print the first entry and rstrip off the terminal ".", if it's there.
import re
with open("path\url_example.txt") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
print(urls[0].rstrip('.'))
If you expect to see multiple matches per line:
import re
with open("path\url_example.txt") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
for url in urls:
print(url.rstrip('.'))
Without further information on the file you have (txt, json?) and on the kind of input line you are looping through, here a simple try without re.findall().
with open("path\url_example.txt") as handle:
for line in handle:
if not re.search('http'):
continue
spos = line.find('http')
epos = line.find(' ', spos)
url = line[spos:epos]
print(url)

Searching for and manipulating the content of a keyword in a huge file

I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.

Create new list from old using re.sub() in python 2.7

My goal is to take an XML file, pull out all instances of a specific element, remove the XML tags, then work on the remaining text.
I started with this, which works to remove the XML tags, but only from the entire XML file:
from urllib import urlopen
import re
url = [URL of XML FILE HERE] #the url of the file to search
raw = urlopen(url).read() #open the file and read it into variable
exp = re.compile(r'<.*?>')
text_only = exp.sub('',raw).strip()
I've also got this, text2 = soup.find_all('quoted-block'), which creates a list of all the quoted-block elements (yes, I know I need to import BeautifulSoup).
But I can't figure out how to apply the regex to the list resulting from the soup.find_all. I've tried to use text_only = [item for item in text2 if exp.sub('',item).strip()] and variations but I keep getting this error: TypeError: expected string or buffer
What am I doing wrong?
You don't want to regex this. Instead just use BeautifulSoup's existing support for grabbing text:
quoted_blocks = soup.find_all('quoted-block')
text_chunks = [block.get_text() for block in quoted_blocks]

Reading Regular Expressions from a text file

I'm currently trying to write a function that takes two inputs:
1 - The URL for a web page
2 - The name of a text file containing some regular expressions
My function should read the text file line by line (each line being a different regex) and then it should execute the given regex on the web page source code. However, I've ran in to trouble doing this:
example
Suppose I want the address contained on a Yelp with URL = http://www.yelp.com/biz/liberty-grill-cork
where the regex is \<address\>\s*([^<]*)\\b\s*<. In Python, I then run:
address = re.search('\<address\>\s*([^<]*)\\b\s*<', web_page_source_code)
The above will work, however, if I just write the regex in a text file as is, and then read the regex from the text file, then it won't work. So reading the regex from a text file is what is causing the problem, how can I rectify this?
EDIT: This is how I'm reading the regexes from the text file:
with open("test_file.txt","r") as file:
for regex in file:
address = re.search(regex, web_page_source_code)
Just to add, the reason I want to read regexes from a text file is so that my function code can stay the same and I can alter my list of regexes easily. If anyone can suggest any other alternatives that would be great.
Your string has some backlashes and other things escaped to avoid special meaning in Python string, not only the regex itself.
You can easily verify what happens when you print the string you load from the file. If your backslashes doubled, you did it wrong.
The text you want in the file is:
File
\<address\>\s*([^<]*)\b\s*<
Here's how you can check it
In [1]: a = open('testfile.txt')
In [2]: line = a.readline()
-- this is the line as you'd see it in python code when properly escaped
In [3]: line
Out[3]: '\\<address\\>\\s*([^<]*)\\b\\s*<\n'
-- this is what it actually means (what re will use)
In [4]: print(line)
\<address\>\s*([^<]*)\b\s*<
OK, I managed to get it working. For anyone who wants to read regular expressions from text files, you need to do the following:
Ensure that regex in the text file is entered in the right format (thanks to MightyPork for pointing that out)
You also need to remove the newline '\n' character at the end
So overall, your code should look something like:
a = open("test_file.txt","r")
line = a.readline()
line = line.strip('\n')
result = re.search(line,page_source_code)

Categories

Resources