Regular expression to match a string when reading a file - python

I have files which contain information in the following manner
2458813.92557 10 20 30 #00FA0040000000010100005AB9000FFE86000F3596109000000703000100001000000000000036404E000000004000000020000000000032*
All of this is in the same line. I am only interested in getting only the portions in bold. I have the following regular expression to get what i want:
^(\d{7}\.\d{5}).*#([\dA-Z]+)\*
The regex works fine but when i use it this in python it does not include the # and the * in the second bold string. I am using re.match(r'^(\d{7}\.\d{5}).*#([\dA-Z]+\*)') in python. I would love to know why this is and what would be the solution to it.
Thanks

Grouping was wrong, use below regex.
^(\d{7}\.\d{5}).*(#[\dA-Z]+\*)

Related

How to find filenames with a specific extension using regex?

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.
You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

Adjust this regex to remove ":" from matches

I've written a very long, very specific regex to match any Japanese text in a file. Right now I'm testing it on json files and one of the problems I'm running into is trying to exclude a specific string ":" while avoiding removing : from possible matches.
Regex: (?<=[\"])[―一-龠ぁ-ゔァ-ヴーa-zA-Z0-9々〆〤~セクハラ゚゛!?+()【】『』←→↓↑←→、<>…・・。◎■◆×★= ”0-9A-Za-z.?!:;&/^%$##*_+\-\\\[\]\(\)\"\'\ ]+(?=[\"])
Line: "8": { "name":"礼礼礼礼礼礼", "礼礼礼:name礼":"礼礼礼", "M1":"", "M2":"礼 礼礼礼 礼礼 礼礼礼", "S1":""},
Expected Matches: 8 name 礼礼礼礼礼礼 礼礼礼:name礼 礼礼礼 M1 M2 礼 礼礼礼 礼礼 礼礼礼 S1
Actual Matches: 8 name":"礼礼礼礼礼礼 礼礼礼:name礼":"礼礼礼 M1":" M2":"礼 礼礼礼 礼礼 礼礼礼 S1":"
Edit:
Forgot to mention this but it also needs to be able to handle escaped quotes. i.e "礼礼礼礼礼礼\"礼礼礼礼礼礼礼礼礼\"礼礼礼"
Please note I'm interested in using this regex in multiple filetypes, not just json, which is why I'm not simply using a key/value regex. Though that would probably make things much easier.
I recommend you implement another class for any file type you want to read. Also I'd recommend using whatever tools are available for you that let you avoid writing regex like this.
That being said you could use a non-greedy quantifier with you existing regex and simply drop the elements you don't need.
Your regex (non-greedy):
(?<=\")[―一-龠ぁ-ゔァ-ヴーa-zA-Z0-9々〆〤~セクハラ゚゛!?+()【】『』←→↓↑←→、<>…・・。◎■◆×★= ”0-9A-Za-z.?!:;&/^%$##*_+\-\\\[\]\(\)\"\'\ ]+?(?=[\"])
would result in several matches of :, which you might want to drop if that's a viable option.

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

How to combine multiple regular expressions into one line?

My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

Extracting a string from a txt file

So im just experimenting, trying to parse through the web using python and i thought i would try to make a script that would search for my favorite links to watch shows online. Im trying to now have my program search through sidereel.com for a good link to my desired show and return to me the links. I know that the site saves the links in the following format:
watch-freeseries.mu'then some long string that i need to ignore followed by '14792088'
So what i need to be able to do is to find this string in the txt file of the site and return to me only the 8 numbers at the end of the string. I not sure how i can get to the numbers and i need them because they are the link number. Any help would be much appreciated
You could use a regular expression to do this fairly easily.
>>> import re
>>> text = "watch-freeseries.mu=lklsflamflkasfmsaldfasmf14792088"
>>> expr = re.compile("watch\-freeseries\.mu.*?(\d{8})")
>>> expr.findall(text)
['14792088']
A breakdown of the expression:
watch\-freeseries\.mu - Match the start of the expected expression. Escape any possible special characters by preceding them with \.
.*? - Match any character. . means any character and * means that appear one after the other an infinite amount of times. The ? is to perform a non-greedy match so that the match will not overlap if two or more urls show up in the same string.
(\d{8}) - Match and save the last 8 digits
Note: If you're trying to parse links out of a webpage there are easier ways. I've seen many recommendations on StackOverflow for the BeautifulSoup package in particular. I've never used it myself so YMMV.

Categories

Resources