delete everything except URL with python [duplicate] - python

This question already has answers here:
Extracting a URL in Python
(10 answers)
Closed 2 years ago.
I have a JSON file that contains metadata for 900 articles. I want to delete all the data except for the lines that contain URLs and resave the file as .txt.
I created this code but I couldn't continue the saving phase:
import re
with open("path\url_example.json") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
print(urls)
A part of the results:
['http://www.google.com.']
['https://www.tutorialspoint.com']
Another issue is the results are marked between [' '] and may end with . I don't need this. My expected result is:
http://www.google.com
https://www.tutorialspoint.com

If you know which key your URLs will be found under in your JSON, you might find an easier approach is to deserialize the JSON using the JSON module from the Python standard library and work with a dict instead of using regex.
However, if you want to work with regex, remember urls is a list of regex matches. If you know there's definitely only going to be only one match per line, then just print the first entry and rstrip off the terminal ".", if it's there.
import re
with open("path\url_example.txt") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
print(urls[0].rstrip('.'))
If you expect to see multiple matches per line:
import re
with open("path\url_example.txt") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
for url in urls:
print(url.rstrip('.'))

Without further information on the file you have (txt, json?) and on the kind of input line you are looping through, here a simple try without re.findall().
with open("path\url_example.txt") as handle:
for line in handle:
if not re.search('http'):
continue
spos = line.find('http')
epos = line.find(' ', spos)
url = line[spos:epos]
print(url)

Related

Python search txt file for urls

Im trying to search a .txt file and return any objects found that match my criteria. I would like to get the entire line and place the urls in a set or list.
What is the best way to search the txt file and return objects?
Here is what I have so far:
# Target file to search
target_file = 'randomtextfile.txt'
# Open the target file in Read mode
target_open = open(target_file, 'r')
# Start loop. Only return possible url links.
for line in target_open:
if '.' in line:
print(target_open.readline())
And here is the sample .txt file:
This is a file:
Sample file that contains random urls. The goal of this
is to extract the urls and place them in a list or set
with python. Here is a random link ESPN.com
Links will have multiple extensions but for the most part
will be one or another.
python.org
mywebsite.net
firstname.wtf
creepy.onion
How to find a link in the middle of line youtube.com for example
Unless you have any restrictions that require you to parse the urls manually rather than using built-in python libraries, the re can be helpful to accomplish this.
Using an answer from Regular expression to find URLs within a string
# Target file to search
target_file = 'randomtextfile.txt'
# Open the target file in Read mode
target_open = open(target_file, 'r')
# Read the text from the file
text = target_open.read()
# import regex module
import re
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
result:
['ESPN.com', 'python.org', 'mywebsite.net', 'firstname.wtf', 'creepy.onion', 'youtube.com']
Unfortunately, searching if '.' in line: will match on punctuation like urls. The, python. Here and another.
Python's regex module helps specify the pattern of url syntax so only urls are matched and not sentence punctuation.
Hope this helps.

Searching for and manipulating the content of a keyword in a huge file

I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.

Create new list from old using re.sub() in python 2.7

My goal is to take an XML file, pull out all instances of a specific element, remove the XML tags, then work on the remaining text.
I started with this, which works to remove the XML tags, but only from the entire XML file:
from urllib import urlopen
import re
url = [URL of XML FILE HERE] #the url of the file to search
raw = urlopen(url).read() #open the file and read it into variable
exp = re.compile(r'<.*?>')
text_only = exp.sub('',raw).strip()
I've also got this, text2 = soup.find_all('quoted-block'), which creates a list of all the quoted-block elements (yes, I know I need to import BeautifulSoup).
But I can't figure out how to apply the regex to the list resulting from the soup.find_all. I've tried to use text_only = [item for item in text2 if exp.sub('',item).strip()] and variations but I keep getting this error: TypeError: expected string or buffer
What am I doing wrong?
You don't want to regex this. Instead just use BeautifulSoup's existing support for grabbing text:
quoted_blocks = soup.find_all('quoted-block')
text_chunks = [block.get_text() for block in quoted_blocks]

Python - How to export each item in a list to individual text file

I have a csv file of a couple dozen web pages that I am trying to loop over.
The goal is to get the text from the web page, take out the html markup (using html2text), and then save the clean text as a .txt file. My idea was to save the clean text of each webpage as an item in the list, then export each item in the list to a txt file.
I can get the program to loop over the urls and take out the html, but saving to individual txt files keeps throwing an error. Can anyone give me some ideas on how to do this?
Code:
from stripogram import html2text
import urllib
import csv
text_list = []
urls = csv.reader(open('web_links2.csv'))
for url in urls:
response = urllib.urlopen(url[0])
html = response.read()
text = html2text(html)
text_list.append(text)
print text_list
for item in text_list:
f = open('c:\users\jacob\documents\txt_files\%s.txt'%(item,), 'w')
f.write(item)
f.close
It looks like you are using the same value (item) for both the names of the files and their contents, so unless these files are single words, you are likely generating illegal file names.
Plus, in order to call close, you need to supply the parentheses.
Your main problem is you are not escaping the t use raw string r:
open(r'c:\users\jacob\documents\txt_files\%s.txt'%(item,), 'w')
\t is tab so use raw string as in the example, double \\ or forward slashes / in your file path.
In [11]: s = "\txt_files"
In [12]: print(s)
xt_files
In [13]: s = r"\txt_files"
In [14]: print(s)
\txt_files
f.close <- missing parens to call the method
Use with to open you file and things like forgetting to call close will not be an issue:
with open(r'c:\users\jacob\documents\txt_files\%s.txt'%(item,), 'w') as f: # closes your files automatically
f.write(item)
I think you might not want to add the full item to the filename since the item is all the html of a webpage. In your case I'd either add some logic to give it a neat website name or just use an index so you can iterate over this.
Also the file path definition should be different, try to use double quotes and \ instead of .
You might want to do something like this:
i = 0
for item in text_list:
i += 1
#also use format instead of the %s
f = open("c:\\users\\jacob\\documents\\txt_files\\{0}.txt".format(i), 'w')
f.write(item)
f.close()

Read text file as a whole [duplicate]

This question already has answers here:
Does reading an entire file leave the file handle open?
(4 answers)
Closed 8 years ago.
I need your help. I want to read a text file "as a whole" and not line by line. This is because by doing line by line my regex doesn't work well, it needs the whole text. So far this is what I am being doing:
with open(r"AllText.txt") as fp:
for line in fp:
for i in re.finditer(regexp_v3, line):
print i.group()
I need to open my file, read it all, search if for my regex and print my results. How can I accomplish this?
To get all the content of a file, just use file.read():
all_text = fp.read() # Within your with statement.
all_text is now a single string containing the data in the file.
Note that this will contain newline characters, but if you are extracting things with a regex they shouldn't be a problem.
For that use read:
with open("AllText.txt") as fp:
whole_file_text = fp.read()
Note however, that your test will contain \n where the new-line used to be in your text.
For example, if this was your text file:
#AllText.txt
Hello
How
Are
You
Your whole_file_text string will be as follows:
>>> whole_file_text
'Hello\nHow\nAre\nYou'
You can do either of the following:
>>> whole_file_text.replace('\n', ' ')
'Hello How Are You'
>>> whole_file_text.replace('\n', '')
'HelloHowAreYou'
If you don't want to read the entire file into memery, you can use mmap
Memory-mapped file objects behave like both strings and like file objects.
import re, mmap
with open(r'AllText.txt', 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
mo = re.finditer(regexp_v3, data)

Categories

Resources