Extract Email from Bulk Text - Error - python

I would like to extract all the email addresses included in an HTML code. I wrote this very simple code (I'm a super basic python writer, I'm just trying to learn):
#coding=utf-8
import urllib
import re
html = urllib.urlopen('http://giacomobonvini.com').read()
r = re.compile(r'(\b[\w.]+#+[\w.]+.+[\w.]\b)')
results = r.findall(html)
emails = ""
for x in results:
emails += str(x) + "\n"
print emails
The problem is that, even if the code works, the email are printed in this way:
"giacomo.bonvini#gmail.com < / span"
"giacomo.bonvini#gmail.com < br"
I would line not to have "< / span" and " < br".
Do you have any idea?
Thanks
Giacomo

r'(\b[\w.]+#+[\w.]+.+[\w.]\b)'
The problem is likely the .+ combination, which matches anything. Maybe you meant to match a single dot instead? If so, use for example [.]

Related

How convert text from shell to html?

Probably it really easy and stupid question, but I'm new in Python, so deal with it
What I need to do - execute command in shell and output it as telegra.ph page
Problem - telegra.ph API ignores \n things and all outpot text in one line
Used Python Telegraph API wrapper - https://github.com/python273/telegraph
I understand it needs to convert my text to html-like and remove <p> tags, I've tried some scripts, but my program gave me error:
telegraph.exceptions.NotAllowedTag: span tag is not allowed
So, I've removed all span tags and got same result as if I've put without converting
Then I tried to use replace("\n", "<p>") but stucked in closing tags...
Code:
import subprocess
from telegraph import Telegraph
telegraph = Telegraph()
telegraph.create_account(short_name='1111')
tmp = subprocess.run("arp", capture_output=True, text=True, shell=True).stdout
print( '\n\n\n'+tmp+'\n\n\n\n') ### debug line
response = telegraph.create_page(
'Random',
html_content= '<p>' + tmp + '</p>'
)
print('https://telegra.ph/{}'.format(response['path']))
The closest html equivalent to \n is the "hard break" <br/> tag.
It does not require closing, because it contains nothing, and directly signifies line break.
Assuming it is supported by telegra.ph, you could simply:
tmp.replace('\n', '<br/>');
Add this line to convert all intermediate newlines to individual <p>-sections:
tmp = "</p><p>".join(tmp.split("\n"))
tmp.split("\n") splits the string into an array of lines.
"</p><p>".join(...) glues everything together again, closing the previous <p>-section and starting a new one.
This way, the example works for me and line breaks are correctly displayed on the page.
EDIT: As the other answer suggests, of course you can also use tags. It depends on what you want to achieve!
It is not clear to me, why does the telegraph module replace newlines with spaces. In this case it seems reasonable to disable this functionality.
import subprocess
import re
import telegraph
from telegraph import Telegraph
telegraph.utils.RE_WHITESPACE = re.compile(r'([ ]{10})', re.UNICODE)
telegraph = Telegraph()
telegraph.create_account(short_name='1111')
tmp = subprocess.run("/usr/sbin/arp",
capture_output=True,
text=True,
shell=True).stdout
response = telegraph.create_page(
'Random',
html_content = '<pre>' + tmp + '</pre>'
)
print('https://telegra.ph/{}'.format(response['path']))
Would output
that comes close to actual formatted arp output.

Extracting follower count from Instagram

I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".
I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'
I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.

Using regular expressions to find URL not containing certain info

I'm working on a scraper/web crawler using Python 3.5 and the re module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:
href = re.compile("(/user/|/channel/)(.+)")
What it should return is something like /user/username or /channel/channelname. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like /user/username/videos?view=60 or something else that goes on after the username/ portion.
In an attempt to adress this issue, I rewrote the bit of code above as
href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")
along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include videos?view=60 anywhere in the URL?
Use the following approach with a specific regex pattern:
user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'
pattern = re.compile(r'(/user/|/channel/)([^/]+)')
m = re.match(pattern, user_url)
print(m.group()) # /user/username
m = re.match(pattern, channel_url)
print(m.group()) # /channel/channelname
I used This approach and it seems it does what you want.
import re
user = '/user/username/videos?view=60'
channel = '/channel/channelname/videos?view=60'
pattern = re.compile(r"(/user/|/channel/)[\w]+/")
user_match = re.search(pattern, user)
if user_match:
print user_match.group()
else:
print "Invalid Pattern"
pattern_match = re.search(pattern,channel)
if pattern_match:
print pattern_match.group()
else:
print "Invalid pattern"
Hope this helps!

Python-JSON - How to parse API output?

I'm pretty new.
I wrote this python script to make an API call from blockr.io to check the balance of multiple bitcoin addresses.
The contents of btcaddy.txt are bitcoin addresses seperated by commas. For this example, let it parse this.
import urllib2
import json
btcaddy = open("btcaddy.txt","r")
urlRequest = urllib2.Request("http://btc.blockr.io/api/v1/address/info/" + btcaddy.read())
data = urllib2.urlopen(urlRequest).read()
json_data = json.loads(data)
balance = float(json_data['data''address'])
print balance
raw_input()
However, it gives me an error. What am I doing wrong? For now, how do I get it to print the balance of the addresses?
You've done multiple things wrong in your code. Here's my fix. I recommend a for loop.
import json
import urllib
addresses = open("btcaddy.txt", "r").read()
base_url = "http://btc.blockr.io/api/v1/address/info/"
request = urllib.urlopen(base_url+addresses)
result = json.loads(request.read())['data']
for balance in result:
print balance['address'], ":" , balance['balance'], "BTC"
You don't need an input at the end, too.
Your question is clear, but your tries not.
You said, you have a file, with at least, more than registry. So you need to retrieve the lines of this file.
with open("btcaddy.txt","r") as a:
addresses = a.readlines()
Now you could iterate over registries and make a request to this uri. The urllib module is enough for this task.
import json
import urllib
base_url = "http://btc.blockr.io/api/v1/address/info/%s"
for address in addresses:
request = urllib.request.urlopen(base_url % address)
result = json.loads(request.read().decode('utf8'))
print(result)
HTTP sends bytes as response, so you should to us decode('utf8') as approach to handle with data.

Python IndexError: no such group

I started learning Python earlier today and as my first project I wanted to make a script that shows me today's weather forecast.
My script:
import urllib2, re
url = urllib2.urlopen('http://www.wetter.com/wetter_aktuell/wettervorhersage/heute /deutschland/oberhausen/DE0007740.html')
html = url.read()
url.close()
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
x = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), x.group('temp')
I used this as template. When I run this script I get an Index Error no such groups
You are overwriting x.
Maybe you want:
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
y = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), y.group('temp')
And I can't belive that the site you linked advocates using regular expressions for extracting information from HTML.

Categories

Resources