I'm trying to parse YouTube description's of songs to compile into a .csv
Currently I can isolate timecodes, though making an attempt on isolating the song and artist is proving trickier.
First, I catch the whitesapce
# catches whitespace
pattern = re.compile(r'\s+')
Second, the timecodes (to make the string simpler to deal with)
# catches timecodes
pattern1 = re.compile(r'[\d\.-]+:[\d.-]+:[\d\.-]+')
then I sub and remove.
I then try to capture all strings between \n, as this is how the tracklist is formatted
songBeforeDash = re.search(r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*[\\n]*)+$', description)
The format follows \n[string]-[string]\n
Using this excellent visualiser , I've been able to tweak it so it catches the first result, however any subsequent results don't match.
Is this a case of stopping at the first result and not catching the others?
Here's a sample of what I'm trying to catch
\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n
You can do that with split()
t = '\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n'
liste = t.split('\n')
liste = liste[1:-1:]
print(liste)
re.search only returns the first match in the string.
What you want is to use re.findall which returns all matches.
EDIT - Because your matches would overlap, I would suggest editing the regex to capture until the next newline. Right now they cannot overlap. Consider changing the regex to this:
r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*)+$'
If what you want is for them to overlap (meaning you want to capture the newlines too), I suggest looking here to see how to capture overlapping regex patterns.
Also, as suggested by #ChatterOne, using the str.split(seperator) method would work well here, assuming no other type of information is present.
descriptor.split('\n')
Related
I need to get the sequence at the end of many urls to label csv files. The approach I have taken gives me the result I want, but I am struggling to understand how I might use a positive lookbehind to capture all the characters after the word 'series' in the url while ignoring any metacharacters? I know I can use re.sub() to delete them, however, I am interested in learning how I can complete the whole process in one regex.
I have searched through many posts on how I might do this, and experimented with lots of different approaches but I haven't been able to figure it out. Mainly with replacing the .+ after the (?<=series\-) with something to negate that - but it hasn't worked.
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
res = re.search(r"(?<=series\-).+", url).group(0)
re.sub('-', '', res)
Which gives the desired result 'kbw10a'
Is it possible to strip out the metacharacter '-' in the positive lookbehind? Is there a better approach to this without the lookaround?
More examples;
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1014416/yanmar-marine-marine-main-engine-small-qm-series-kbw10',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1018923/yanmar-marine-marine-main-engine-small-qm-series-kh18-a',
You cannot "ignore" chars in a lookaround the way you describe, because in order to match a part of a string, the regex engine needs to consume the part, from left to right, matching all subsequent subpatterns in your regex.
The only way to achieve that is through additional step, removing the hyphens once the match is found. Note that you do not need another regex to remove hyphens, .replace('-', '') will suffice:
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
resObj = re.search(r"series-(.+)", url)
if resObj:
res = resObj.group(1).replace('-', '')
Note it is much safer to first run re.search to get the match data object and then access the .group(), else, when there is no match, you may get an exception.
Also, there is no need of any lookarounds in the pattern, a capturing group will work as well.
I have an url: http://200.73.81.212/.CREDIT-UNION/update.php None of reg expressions I've found and develop myself works. I'm working on phishing mails dataset and there are lots of strange hyperlinks. This is one of mine:
https?:\/\/([a-zA-z0-9]+.)+)|(www.[a-zA-Z0-9]+.([a-zA-Z0-9]+\.[a-zA-Z0-9]+)+)(((/[\.A-Za-z0-9]+))+/?.
Of course no success. I work in Python.
EDIT:
I need a regex to catch this kind of url's and, also, any ordinary hyperlinks, like:
https://cnn.com/
www.foxnews.com/story/122345678
Any thoughts?
What about something like this?
import re
phish = re.compile('''(?P<http>http\://)
(?P<ipaddress>(([0-9]*(\.)?)[0-9]*)*)/\.
(?P<name>(\.)?([A-Za-z]*)(\-)?([A-Za-z]*))/
(?P<ending>(update\.php))''', re.VERBOSE)
example_string = 'http://200.73.81.212/.CREDIT-UNION/update.php'
found_matches = []
# check that matches actually exist in input string
if phish.search(example_string):
# in case there are many matches, iterate over them
for mtch in phish.finditer(example_string):
# and append matches to master list
found_matches.append(mtch.group(0))
print(found_matches)
# ['http://200.73.81.212/.CREDIT-UNION/update.php']
This is flexible enough so now in case you have alternate endings than update.php, you can simply include them in the named capture group by separating all alternate ending matches with |, i.e.
(update\.php|remove\.php, ...)
Furthermore, your ip address named capture group can take any number of 123.23.123.12, it doesnt have to be a fixed number of repeating numbers followed by period patterns. Now I believe IP addresses cap out at 3 numbers, so you could anchor those down to make sure you are matching the right types of numbers with curly brackets:
[0-9]{2, 3}\. # minimum of 2 numbers, maximum of 3
While #datawrestler answer works for original question, I had to extend it to catch wider group of url's (I've edited the question). This url, seems to work, for the task:
r"""(https?://www\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(https?://[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(www.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})"""
Three alternatives: https?://www, https://domain, www.domain
Could you tell me how to print this part of the line only '\w+.226.\w.+' ?
Code
VSP = input("Номер ВСП (четыре цифры): ")
a = re.compile(r'\w+.226.\w.+'+VSP)
b=re.search(a, open('Sample.txt').read())
print (b.group())
Номер ВСП (четыре цифры): 1020
10.226.27.60 1020
After I have found the intended line associated with my variable "VSP" in the txt file, how can exclude it from output, printing the"10.226.27.60" only?
You will need to modify your regex slightly to separate the trailing characters in the IP and the spaces that separate it from VSP. Adding a capture group will let you select the portion with just the IP address. The updated regex looks like this:
'(\d+\.226\.\S+)\s+' + VSP
\S (uppercase S) matches any non-whitespace, while \s (lowercase s) matches all whitespace. I replaced the first \w with the more specific \d (digits), and . (any character at all) with \. (actual period). The second \w is now \S, but you could use \d+\.\d+ if you wanted to be more specific.
Using the first capture group will give you the IP address:
print(b.group(1))
If you are looking for a single IP address once, not compiling your regex is fine. Also, reading in a small file in its entirety is OK as long as the file is small. If either is not the case, I would recommend compiling the regex and going through the file line by line. That will allow you to discard most lines much faster than using a regex would do.
I see you already have an answer.You can also try this regex if you were to separate the two groups by the whitespace:
import re
a = re.compile(r'(.+?)\s+(.+)') # edit: added ? to avoid
# greedy behaviour of first .+
# otherwise multiple spaces after the
# address will be caught into
# b.group(1), as per #Mad comment
b=re.search(a, '10.226.27.60 1020')
print (b.group(0))
print (b.group(1))
print (b.group(2))
or customize the first group regexp to your needs.
Edit:
This was not meant to be a proper answer but more of a comment wich I didn't think was readable as such; I am trying only to show group separation using regex, wich seems OP didn't know about or didn't use.
That is why I am not matching .226. because OP can do that. I also removed the file read part, which isn't needed for demonstration. Please read #Mad answer because its quite complete and in fact also shows how to use groups.
I am looking to find all matches in a string and print all substrings until I match these strings to a new line.
e.g.
"123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
should print:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234
where "ABC" is the pattern match which is recurring.
Is there an efficient way I can do so using findall?
New to Python here, using python version 2.4.3
Edit just an F.Y.I:
What I am trying to do is basically I have a 250+Gb file which has control characters showing start and end of line but these Ctrl Characters (because of issues.. mostly network) are embedded within these lines i.e. in between the start/end indicating control characters.
With that, there is no specific distinction between the start/end control chars and the ones that come in between these messages.
So I am basically removing these control chars, and have I wish to have a complete message per line pertaining to some specific regex.
The regex here is not necessarily ABC or in order for all of these messages.
I have tried using findall and am able to find all the matches, just I did not know how to get the strings following these until i find the next match. (the regex here can be either -ABC=35nga|DEF=64325:dfaf:1234| or **ABC=35632|DEF=61 and many different forms.
And I have to break for each line and for the ones which have multiple lines embededed within a line.
Using re.findall:
See the regex in action on regex101.
s = "123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
re.findall("ABC.*?(?=ABC|$)",s)
which gives a list:
['ABC97edf', 'ABCaaabbdd1234', 'ABC0009ui50', 'ABC_1234']
And if you wanted to print the elements in this list, you could simply do:
for sub in re.findall("ABC.*?(?=ABC|$)",s):
print(sub)
which would output:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234
My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*