Adjust this regex to remove ":" from matches - python

I've written a very long, very specific regex to match any Japanese text in a file. Right now I'm testing it on json files and one of the problems I'm running into is trying to exclude a specific string ":" while avoiding removing : from possible matches.
Regex: (?<=[\"])[―一-龠ぁ-ゔァ-ヴーa-zA-Z0-9々〆〤~セクハラ゚゛!?+()【】『』←→↓↑←→、<>…・・。◎■◆×★= ”0-9A-Za-z.?!:;&/^%$##*_+\-\\\[\]\(\)\"\'\ ]+(?=[\"])
Line: "8": { "name":"礼礼礼礼礼礼", "礼礼礼:name礼":"礼礼礼", "M1":"", "M2":"礼 礼礼礼 礼礼 礼礼礼", "S1":""},
Expected Matches: 8 name 礼礼礼礼礼礼 礼礼礼:name礼 礼礼礼 M1 M2 礼 礼礼礼 礼礼 礼礼礼 S1
Actual Matches: 8 name":"礼礼礼礼礼礼 礼礼礼:name礼":"礼礼礼 M1":" M2":"礼 礼礼礼 礼礼 礼礼礼 S1":"
Edit:
Forgot to mention this but it also needs to be able to handle escaped quotes. i.e "礼礼礼礼礼礼\"礼礼礼礼礼礼礼礼礼\"礼礼礼"
Please note I'm interested in using this regex in multiple filetypes, not just json, which is why I'm not simply using a key/value regex. Though that would probably make things much easier.

I recommend you implement another class for any file type you want to read. Also I'd recommend using whatever tools are available for you that let you avoid writing regex like this.
That being said you could use a non-greedy quantifier with you existing regex and simply drop the elements you don't need.
Your regex (non-greedy):
(?<=\")[―一-龠ぁ-ゔァ-ヴーa-zA-Z0-9々〆〤~セクハラ゚゛!?+()【】『』←→↓↑←→、<>…・・。◎■◆×★= ”0-9A-Za-z.?!:;&/^%$##*_+\-\\\[\]\(\)\"\'\ ]+?(?=[\"])
would result in several matches of :, which you might want to drop if that's a viable option.

Related

Issues with re.search and unicode in python [duplicate]

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?
The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.

how to find occurrence of special character using regex

I have an url like this
http://foo.com/bar_by_baz.html
now I want to extract baz from that URL using a regex. But so far I have managed to write this much only
[_]+?\w[^.]+
This is giving me
_by_baz
as output. Now I want to know that how can I select any special character exactly one time or what would be the best approach to solve this using regex ?
I am trying it on python 3.x
Here's your regex: [_]+?([^_.]+) the group match will return baz.. The concept is to isolate underscore and dot from the target match
In another case, this works based on capturing only the alphanumerics [_]+?([A-Za-z0-9]+)
I am going to assume from your profile that you are seeking a javascript-friendly solution (you should update your question & tags).
For javascript, you could use this pattern: /[^_]+(?=\.[a-z]+$)/
Demo Link The pattern matches the substring containing no underscores that is followed by a dot then one or more alphabetical characters until the end of the string.
There will be several ways to accomplish your task. Finding the best/most efficient one can only be achieved if you provide more information about the coding environment/language and a few more sample strings.

Improving accuracy/brevity of regex for inconsistent url filtering

So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!
I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)
You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.

How to combine multiple regular expressions into one line?

My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

Regex pattern to match two datetime formats

I am doing a directory listening and need to get all directory names that follow the pattern: Feb14-2014 and 14022014-sometext. The directory names must not contain dots, so I dont want to match 14022014-sometext.more. Like you can see I want to match just the directories that follow the pattern %b%d-%Y and %d%m%Y-textofanylengthWithoutDots.
For the first case it should be something like [a-zA-Z]{3}\d{2}. I dont know how to parse the rest because my regex skills are poor, sorry. So I hope someone can tell me what the correct patterns look like. Thanks.
I am assuming each directory listing is separated by a new line
([A-Z]\w{2}\d{1,2}\-\d{4}|\d{7,8}\-\w+)$
Will match both cases and will match the text only if it is uninterrupted (by dots or anything else for that matter) until it hits the end of the line.
Some notes:
If you want to match everything except dot you may replace the final \w+ with [^.]+.
You need the multiline modifier /m for this to work, otherwise the $ will match the end of the string only.
I've not added a ^ to the start of the regex, but you may do so if each line contains a single directory
Of course you may expand this regex to include (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) instead of [A-Z]\w{2}. I've not done this to keep it readable. I would also suggest you store this in a python array and insert it dynamically into your regex for maintainability sake.
See it in action: http://regex101.com/r/pS6iY9
That's quite easy.
The best one I can make is:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d)|(\d\d\d\d\d\d\d\d-\w+)
The first part ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d) matches the first kind of dates and the second part (\d\d\d\d\d\d\d\d-\w+) - the second kind.

Categories

Resources