I want to delete all characters in all lines after the # sign.
I wrote some piece of code:
#!/usr/bin/env python
import sys, re, urllib2
url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
html = document.read()
html2 = html[0]
for x in html.rsplit('#'):
print x
But it only deletes # sign and copies the rest of characters into next line.
So how I can modify this code, to delete all characters in all lines after #?
Should I use a regex?
You are splitting too many times; use str.rpartition() instead and just ignore the part after #. Do this per line:
for line in html.splitlines():
cleaned = line.rpartition('#')[0]
print cleaned
or, for older Python versions, limit str.rsplit() to just 1 split, and again only take the first result:
for line in html.splitlines():
cleaned = line.rsplit('#', 1)[0]
print cleaned
I used str.splitlines() to cleanly split a text regardless of newline style. You can also loop directly over the urllib2 response file object:
url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
for line in document:
cleaned = line.rpartition('#')[0]
print cleaned
Demo:
>>> import urllib2
>>> url = 'http://varenhor.st/wp-content/uploads/emails.txt'
>>> document = urllib2.urlopen(url)
>>> for line in document:
... cleaned = line.rpartition('#')[0]
... print cleaned
...
ADAKorb...
AllisonSarahMoo...
Artemislinked...
BTBottg...
BennettLee...
Billa...
# etc.
You can use Python's slice notation:
import re
import sys
import urllib2
url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
html = document.read()
for line in html.splitlines():
at_index = line.index('#')
print line[:at_index]
Since strings are sequences, you can slice them. For instance,
hello_world = 'Hello World'
hello = hello_world[:5]
world = hello_world[6:]
Bear in mind, slicing returns a new sequence and doesn't modify the original sequence.
Since you already imported re, you can use it:
document = urllib2.urlopen(url)
reg_ptn = re.compile(r'#.*')
for line in document:
print reg_ptn.sub('', line)
Related
so in this script I am writing to learn python, I would like to just put a wildcard instead of rewriting this whole block just to change line 2. what would be the most efficient way to consolidate this into a loop, where it will just use all d.entries[0-99].content and repeat until finished? if, while, for?
also my try /except does not perform as expected
what gives?
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
print (d.entries[3].title)
sr = str(d.entries[3].content)
spl1 = sr.split("<p>")
ss = str(spl1)
spl2 = ss.split("</p>")
try:
st = str(spl2[0])
# print(st)
except:
binascii.Error
st = str(spl2[1])
print(st)
#st = str(spl2[0])
spl3 =st.split("', '")
stringnow=str(spl3[1])
b64s1 = stringnow.encode('ascii')
b64s2 = base64.b64decode(b64s1)
stringnew = b64s2.decode('ascii')
print(stringnew)
## but line 15 does nothing, how to fix and also loop through all d.entries[?].content
The loop part is done simply by doing the following"
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
# loop from 0 to 99
# range(100) goes from 0 and up to and not including 100
for i in range(100):
print (d.entries[i].title)
sr = str(d.entries[i].content)
<< the rest of your code here>>
The data returned from d.entries[i].content is a dictionary but you are converting to a string so you may want to see if you are doing what you really want too. Also when you use .split() it produces a list of the split items but you convert to a string once again (a few time). You may want to relook at that part of the code.
I haven't used regex much but decided to just to play and got this to work. I retrieved the contents of the 'value' key from the dictionary. Then used regex to get the base64 info. I only tried it for the first 5 items (i.e., I changed range(100) to range(5). Hope it helps. If not, I enjoyed doing this. Oh, I left all of the print statements I used as I was working down the code.
import feedparser, base64
from urlextract import URLExtract
import re
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
for i in range(100):
print (d.entries[i].title)
# .contents is a list.
# print("---------")
# print (type(d.entries[i].content))
print (d.entries[i].content)
print("---------")
# gets the contents of key 'value' in the dictionary that is the 1st item in the list.
string_value = d.entries[3].content[0]['value']
print(string_value)
print("---------")
# this assumes there is always a space between the 1st </p> and the 2nd <p>
# grabs text between using re.search
pattern = "<p>(.*?)</p>"
substring = re.search(pattern, string_value).group(1)
print(substring)
print("---------")
print("---------")
print("---------")
# rest of your code here
I have:
"[15765,22832,15289,15016,15017]"
I want:
[15765,22832,15289,15016,15017]
What should I do to convert this string to list?
P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string.
P.S. №2. My initial code was:
import urllib.request, re
f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js")
lines = f.readlines()
for line in lines:
m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251'))
if m is not None:
varname = m.group(1)
if varname == "aEmitentIds":
aEmitentIds = line #its type is 'bytes', not 'string'
I need to get list from line
line from web page looks like
[15765, 22832, 15289, 15016, 15017]
Assuming s is your string, you can just use split and then cast each number to integer:
s = [int(number) for number in s[1:-1].split(',')]
For detailed information about split function:
Python3 split documentation
What you have is a stringified list. You could use a json parser to parse that information into the corresponding list
import json
test_str = "[15765,22832,15289,15016,15017]"
l = json.loads(test_str) # List that you need.
Or another way to do this would be to use ast
import ast
test_str = "[15765,22832,15289,15016,15017]"
data = ast.literal_eval(test_str)
The result is
[15765, 22832, 15289, 15016, 15017]
To understand why using eval() is bad practice you could refer to this answer
You can also use regex to pull out numeric values from the string as follows:
import re
lst = "[15765,22832,15289,15016,15017]"
lst = [int(number) for number in re.findall('\d+',lst)]
Output of the above code is,
[15765, 22832, 15289, 15016, 15017]
I am trying to get the data from URL.below is the URL Format.
What I am trying to do
1)read line by line and find if the line contains the desired keyword.
3)If yes then store the previous line's content "GETCONTENT" in a list
<http://www.example.com/XYZ/a-b-c/w#>DONTGETCONTENT
a <http://www.example.com/XYZ/mount/v1#NNNN> ,
<http://www.w3.org/2002/w#Individual> ;
<http://www.w3.org/2000/01/rdf-schema#label>
"some content , "some url content ;
<http://www.example.com/XYZ/log/v1#hasRelation>
<http://www.example.com/XYZ/data/v1#Change> ;
<http://www.example.com/XYZ/log/v1#ServicePage>
<https://dev.org.net/apis/someLabel> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some API Content .
<http://www.example.com/XYZ/model/v1#GETBBBBBB>
a <http://www.w3.org/01/07/w#BBBBBB> ;
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#xyz> ;
<http://www.w3.org/2000/01/schema#label1>
"some content , "some url content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.w3.org/2001/XMLSchema#boolean> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some description .
<http://www.example.com/XYZ/datamodel-ee/v1#GETAAAAAA>
a <http://www.w3.org/01/07/w#AAAAAA> ;
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#Version> ;
<http://www.w3.org/2000/01/schema#label>
"some content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.example.com/XYZ/data/v1#uuu> .
<http://www.example.com/XYZ/datamodel/v1#GETCCCCCC>
a <http://www.w3.org/01/07/w#CCCCCC ,
<http://www.w3.org/2002/07/w#Name>
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#xyz> ;
<http://www.w3.org/2000/01/schema#label1>
"some content , "some url content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.w3.org/2001/XMLSchema#boolean> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some description .
below is the code i tried so far but it is printing all the content of the file
import re
def read_from_url():
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
url_link = "examle.com"
html = urlopen(url_link)
previous=None
for line in html:
previous=line
line = re.search(r"^(\s*a\s*)|\#GETBBBBBB|#GETAAAAAA|#GETCCCCCC\b",
line.decode('UTF-8'))
print(previous)
if __name__ == '__main__':
read_from_url()
Expected output:
GETBBBBBB , GETAAAAAA , GETCCCCCC
Thanks in advance!!
When it comes to reading data from URLs, the requests library is much simpler:
import requests
url = "https://www.example.com/your/target.html"
text = requests.get(url).text
If you haven't got it installed you could use the following to do so:
pip3 install requests
Next, why go through the hassle of shoving all of your words into a single regular expression when you could use a word array and then use a for loop instead?
For example:
search_words = "hello word world".split(" ")
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Then you'd output the result, like this:
print(matching_lines)
Running this where the text variable equals:
"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""
Should output:
[
"this word will save the line",
"hello my friend!"
]
You could make the search case insensitive by using the lower method, like this:
search_words = [word for word in "hello word world".lower().split(" ")]
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
line = line.lower()
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Notes and information:
the continue keyword prevents you from searching for more than one word match in the current line
the enumerate function allows us to iterate of the index and the current line
I didn't put the lower function for the words inside of the for loop to prevent you from having to call lower for every word match and every line
I didn't call lower on the line until after the check because there's no point in lowercasing an empty line
Good luck.
I'm puzzled about a few things-- answering which may help the community better assist you. Specifically, I can't tell what form the file is in (ie. is it a txt file or a url you're making a request to and parsing the response of). I also can't tell if you're trying to get the entire line, just the url, or just the bit that follows the hash symbol.
Nonetheless, you stated you were looking for the program to output GETBBBBBB , GETAAAAAA , GETCCCCCC, and here's a quick way to get those specific values (assuming the values are in the form of a string):
search = re.findall(r'#(GET[ABC]{6})>', string)
Otherwise, if you're reading from a txt file, this may help:
with open('example_file.txt', 'r') as file:
lst = []
for line in file:
search = re.findall(r'#(GET[ABC]{6})', line)
if search != []:
lst += search
print(lst)
Of course, these are just some quick suggestions in case they may be of help. Otherwise, please answer the questions I mentioned at the beginning of my response and maybe it can help someone on SO better understand what you're looking to get.
any help as to why this regex isnt' matching<td>\n etc? i tested it successfully on pythex.org. Basically i'm just trying to clean up the output so it just says myfile.doc. I also tried (<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]
It's hard to tell without seeing the repr(filename), but I think your problem is the confusing of real newline characters with escaped newline characters.
Compare and contrast the examples below:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'
If your goal is to just get the stripped string from within the <td> tag you can just let BeautifulSoup do it for you by getting the stripped_strings attribute of a tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
If you want to extract further strings from tags of the same type you can then use findNext to extract the next td tag after the current one:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
And then loop through...
In Python, it possible to cut out a section of text in a document when you only know the beginning and end words?
For example, using the bill of rights as the sample document, search for "Amendment 3" and remove all the text until you hit "Amendment 4" without actually knowing or caring what text exists between the two end points.
The reason I'm asking is I would like to use this Python script to modify my other Python programs when I upload them to the client's computer -- removing sections of code that exists between a comment that says "#chop-begin" and "#chop-end". I do not want the client to have access to all of the functions without paying for the better version of the code.
You can use Python's re module.
I wrote this example script for removing the sections of code in file:
import re
# Create regular expression pattern
chop = re.compile('#chop-begin.*?#chop-end', re.DOTALL)
# Open file
f = open('data', 'r')
data = f.read()
f.close()
# Chop text between #chop-begin and #chop-end
data_chopped = chop.sub('', data)
# Save result
f = open('data', 'w')
f.write(data_chopped)
f.close()
With data.txt
do_something_public()
#chop-begin abcd
get_rid_of_me() #chop-end
#chop-beginner this should stay!
#chop-begin
do_something_private()
#chop-end The rest of this comment should go too!
but_you_need_me() #chop-begin
last_to_go()
#chop-end
the following code
import re
class Chopper(object):
def __init__(self, start='\\s*#ch'+'op-begin\\b', end='#ch'+'op-end\\b.*?$'):
super(Chopper,self).__init__()
self.re = re.compile('{0}.*?{1}'.format(start,end), flags=re.DOTALL+re.MULTILINE)
def chop(self, s):
return self.re.sub('', s)
def chopFile(self, infname, outfname=None):
if outfname is None:
outfname = infname
with open(infname) as inf:
data = inf.read()
with open(outfname, 'w') as outf:
outf.write(self.chop(data))
ch = Chopper()
ch.chopFile('data.txt')
results in data.txt
do_something_public()
#chop-beginner this should stay!
but_you_need_me()
Use regular expressions:
import re
string = re.sub('#chop-begin.*?#chop-end', '', string, flags=re.DOTALL)
.*? will match all between.