So im just experimenting, trying to parse through the web using python and i thought i would try to make a script that would search for my favorite links to watch shows online. Im trying to now have my program search through sidereel.com for a good link to my desired show and return to me the links. I know that the site saves the links in the following format:
watch-freeseries.mu'then some long string that i need to ignore followed by '14792088'
So what i need to be able to do is to find this string in the txt file of the site and return to me only the 8 numbers at the end of the string. I not sure how i can get to the numbers and i need them because they are the link number. Any help would be much appreciated
You could use a regular expression to do this fairly easily.
>>> import re
>>> text = "watch-freeseries.mu=lklsflamflkasfmsaldfasmf14792088"
>>> expr = re.compile("watch\-freeseries\.mu.*?(\d{8})")
>>> expr.findall(text)
['14792088']
A breakdown of the expression:
watch\-freeseries\.mu - Match the start of the expected expression. Escape any possible special characters by preceding them with \.
.*? - Match any character. . means any character and * means that appear one after the other an infinite amount of times. The ? is to perform a non-greedy match so that the match will not overlap if two or more urls show up in the same string.
(\d{8}) - Match and save the last 8 digits
Note: If you're trying to parse links out of a webpage there are easier ways. I've seen many recommendations on StackOverflow for the BeautifulSoup package in particular. I've never used it myself so YMMV.
Related
I would like suggestions on extracting a substring from a range of URLs. The code I'm writing should extract this piece of info (the actual id of the URL) from URLs in incoming events from our web tracker.
Take these URLs (the URLs that contain the substrings I'm looking for is in the format of the first three)
https://www.rbnett.no/sport/i/LA8gxP/_
https://www.itromso.no/sport/sprek/i/GGobq6/derfor-vraker-tromsoes-beste-loeper-sesongens-eneste-konkurranse-det-er-for-risikabelt-aa-delta
https://www.adressa.no/sport/fotball/i/9vyQGW/brann-treneren-ferdig-avsluttet-pressekonferansen-med-aa-sitere-max-manus
https://www.rbnett.no/dakapo/banner/
https://www.adressa.no/search/
where I want to extract the substrings "LA8gxP", "GGobq6" and "9vyQGW" from the three former URLs respectively, without hitting "dakapo", "banner" or "search" from the latter two.
I'm asking for suggestions on a regexp to extract that piece of info. As far as I know, the substrings only contain a-z, A-Z, and 0-9. The substrings seem to be only 6 chars long, but that will probably change over time.
The best solution (using Python) I have found so far is this:
match = re.search(r"/i/([a-zA-Z0-9]+)/", url)
substring = match.group(1)
It works, but I don't find it to be very elegant.
Also, it's relying on having the /i/-pattern as a prefix. Even though it looks like a consistent pattern, I'm not 100% sure if it is.
The only other alternative I can think of is:
\/i\/(.+)\/
Here is the demo: https://regex101.com/r/2iOyCE/1
I´ve got a python script to loop through a list of websites/domains to scrape phones and e-mails from my clients websites, 99% of websites scrapes are OK and works. Some websites just hangs and cant even force break operation, like it is on an insane loop. Below an example. Anyone could help me improve or fix this?
import requests,re
try:
r = requests.Session()
f = r.get('http://www.poffoconsultoria.com.br', verify=False, allow_redirects=False,timeout=(5,5) )
s = f.text
tels = set(re.findall(r"\s?\(?0?[1-9][1-9]\)?[-\.\s][2-5]\d{3}\.?-?\s?\d{4}",s))
emails = set(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
print(tels)
print(emails)
except Exception as e:
print(e)
You should remove the \s? from the first regex (you do not really need a whitespace at the start of the match), or replace with (?<!\S) if you want to only match after a whitespace or start of string.
The real problem is with the second regex where . resides in a character class that is quantified with +. The \. that follows it also matches a . and that makes it a problem when no matching text appears in the string. This is catastrophic backtracking.
Since the matches you expect are whole words, I suggest enhancing the pattern by 1) adding word boundaries, 2) making all adjoining subpatterns match different types of chars.
Use
r'\b[A-Za-z0-9._%+-]+#(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,4}\b'
to match emails.
See the (?:[A-Za-z0-9-]+\.)+ part: it matches one or more repetitions of 1 or more alphanumeric/hyphen chars followed with a dot, and there is no \. after this pattern, there is an alpha character class, so there should be no problem like the one present before.
So. I got the website data fine in Python27 using >>> string = requests.get('http://www.poffoconsultoria.com.br').text
I then took the length of the string and it was >>> len(strings)
474038 That's a really high value.
So for problems like these when one sees regex take such a long time (really, after getting the length of the page), you should visit the page in your browser and inspect the page source
When I inspected the page in my browser I found these:
The second regex [A-Za-z0-9._%+-]+ will definitely hang (really, take a long time) because it isn't quantifiable and has to search through those ginormous portions.
You either need to chunk the page or limit your regex. Or you could write a function that discards dictionary data if you suspect that what you need to return won't appear inside of them; basically though, those huge dictionaries above are causing the regex you posted to take a long time.
Use valid email
(?i)(?:("[^"\\]*(?:\\.[^"\\]*)*"#)|((?:[0-9a-z](?:\.(?!\.)|[-!#$%&'*+/=?^`{}|~\w])*)?[0-9a-z]#))(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][-a-z0-9]{0,22}[a-z0-9]))
I am looking to find all matches in a string and print all substrings until I match these strings to a new line.
e.g.
"123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
should print:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234
where "ABC" is the pattern match which is recurring.
Is there an efficient way I can do so using findall?
New to Python here, using python version 2.4.3
Edit just an F.Y.I:
What I am trying to do is basically I have a 250+Gb file which has control characters showing start and end of line but these Ctrl Characters (because of issues.. mostly network) are embedded within these lines i.e. in between the start/end indicating control characters.
With that, there is no specific distinction between the start/end control chars and the ones that come in between these messages.
So I am basically removing these control chars, and have I wish to have a complete message per line pertaining to some specific regex.
The regex here is not necessarily ABC or in order for all of these messages.
I have tried using findall and am able to find all the matches, just I did not know how to get the strings following these until i find the next match. (the regex here can be either -ABC=35nga|DEF=64325:dfaf:1234| or **ABC=35632|DEF=61 and many different forms.
And I have to break for each line and for the ones which have multiple lines embededed within a line.
Using re.findall:
See the regex in action on regex101.
s = "123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
re.findall("ABC.*?(?=ABC|$)",s)
which gives a list:
['ABC97edf', 'ABCaaabbdd1234', 'ABC0009ui50', 'ABC_1234']
And if you wanted to print the elements in this list, you could simply do:
for sub in re.findall("ABC.*?(?=ABC|$)",s):
print(sub)
which would output:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234
I am using beautifulsoup to scrape different data in websites.
I am trying to scrape the source, but not all the source, just the substring which is important for me.
For example, in this item, I would like to pick just the string between / and .png (which in this case is "nyt") and to save it in a list.
<image width="185" height="26"
xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:href="https://a1.nyt.com/assets/shell/20160613-034030/images/foundation/logos/nyt-logo-185x26.svg" src="https://a1.nyt.com/assets/shell/20160613-034030/images/foundation/logos/nyt.png" border="0"></image>
I have been trying with several regular expressions like re.search('[a-z]*.png',src).group(0) but nothing works well.
Can anyone tell me what would be the right way to scrape that info??
If you want to find the name of the png inside of the src attribute you can use this regular expression:
src=\s*(\"|\')[^"']+?([^/]+?)\.png\1
You will have to capture the second group in Python in this case.
Click on the pythex link to try it out.
Here is the explanation:
src=\s* literal to find all "src=" literals followed by any number of optional spaces
(\"|\') group with either a double or single quote.
[^"']+? anything that is not a double or single quote (non greedy).
([^/]+?) anything that is not a a forward slash (non greedy).
\.png literal ".png"
\1 back reference to the first group (\"|\')
I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').