split() not producing expected results

split() not producing expected results - python

I have a problem with python split which I can't figure out what I am missing that results in the split function not to work properly. I have been using similar splits before and they worked just fine.
content=open(file).read)()
Sep = content.split(r'Document [a-zA-Z0-9]{25}\n')
The file I am reading is something very easy as:
"I like coffee.
Document CLASSAR020181030eeat0000l
I like tea as well.
Document CLASSAR020181030eeat0000l
I like both coffee and tea."

str.split() splits using a fixed delimiter, not a regular expression. You need to use re.split().
import re
sep = re.split(r'Document [a-zA-Z0-9]{25}\n', content)

Error - regular expression syntax on string methods
content is a string. You cannot call the split method on this variable as it will invoke a string-based method that expects a separator. This separator must be a fixed string, and not a regular expression.
Solution - Use re module
You can instead use methods within the regular expression module, as you're using regular expression syntax:
import re
with open(file, 'r') as fp:
content = fp.read()
pattern = re.compile(r'Document \w{25}\n')
separated = pattern.split(content)
The with block is just best practice for opening files in python. It
is a context manager that automatically closes your file when you're
finished. You may run into problems in the future if you don't use
this.
The regular expression I have used is slightly different to yours. It
does exactly the same thing. However, \w is short for
[a-zA-Z0-9]. I.e., it matches any alphanumeric character.
We are using the split method again. However, this split method is part of the re module, not string, as our pattern variable is an re object.

Related

Pandas ignores separator passed as parameter

I am working on a function that among other tasks, is supposed to read a csv in pandas. As one of the parameters, I would like to pass the separator as a string. However, for some reason, probably something to do with regular expressions, pandas is totally ignoring my passed parser and defaults to '\t', which does not parse my data correcty.
import pandas as pd
def open_df(separator):
df = pd.read_csv('filename.csv', sep=separator)
return df
Question is, how am I suppose to pass the separator parameter in this case?

Please Check this link:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can,
meaning the latter will be used and automatically detect the separator
by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators
longer than 1 character and different from '\s+' will be interpreted
as regular expressions and will also force the use of the Python
parsing engine. Note that regex delimiters are prone to ignoring
quoted data. Regex example: '\r\t'.

I passed the seperator string as "raw" string and that worked fine for me.
I you use a raw string \ is interpreted as a normal character and also \t will work
When you call open_df() you have to write a r before the string quotes like open_df(r"\t")
Example:
test_string = r"\t\n"
print(test_string)
\t\n
And I also passed "python" as engine parameter in order to not display the parser warning :-).

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","author":null,"d‌escription":null,"fi‌leAssetId":"034b9317‌-60d9-45c2-b6d6-0f24‌b59e1991","filename"‌:"Reports.pdf"},"cre‌atedBy":1531,"create‌dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌bat.png","id":3041,"‌inheritedPermissions‌":false,"name":"map"‌,"permissions":[23,8‌7,35,49,65],"type":3‌,"viewLevel":2},{"__‌type":"WikiNode:http‌:\/\/samplesite.com.‌au\/ns\/business\/wi‌ki","children":[],"c‌ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌, 60d9, 45c2, b6d6, 0f24‌b59e1991
Im not to sure how to get the data as its displayed.

How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.

You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌-60d9-45c2-b6d6-0f24‌b59e1991

Try adding \n to the string that you are entering in to the file (\n means new line)

Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

How to clean a string using python regular expression

I have the following string which have to clean
#import re
addr="abcd&^fhj"
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$#\,\. \t\r\n]')
re.search(problemchars,addr)

In that case use re.sub searching \W (non-alphanum) and replacing by nothing.
import re
addr="abcd&^fhj"
print(re.sub("\W","",addr))
("\W+" works too, but not sure it would be more performant)

you could use the filter function as well if you don't want to go with regex
line = "abcd&^fhj"
line = filter(str.isalpha, line)
print line # Change for python3
Output :
abcdfhj
Edit: For python 3 you could change the print statement like this since the filter function returns an iterable.
print(''.join(list(line)))

python regular expression not matching file contents with re.match and re.MULTILINE flag

I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!

According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).

Loading regular expression patterns from external source?

I have a series of regular expression patterns defined for automated processing of text. Due to the design of the program, it's better to have these patterns separate in a text file, namely a JSON file. The pattern in Python is of r'' type, but all I can provide is a string. I'd like to retain functionalities such as grouping. I'd like to have features such as entities ([A-z]), so I'm not talking about escaping everything.
I'm using Python 3.4. How do I properly load these patterns into the re module? And what kind of escaping problem should I watch out for?

I am not sure what you want but have a look at this.:
If you have a file called input.txt containing \d+
Then you can use it this way:
import re
f=open("input.txt","r")
x="asasd3243sdfdsf23234sdsdf"
print re.findall(r""+f.readline(),x)
Output:['3243', '23234']
When you use r mode you need not escape anything.

The r'' thing in Python is not a different type than simple ''. The r'' syntax simply creates a string that looks exactly like the one you typed, so the \n sequence stays as \n, and isn't turned into a new line (same thing happens to other special characters). This little r simply escapes everything you type.
Check it yourself with this two simple lines in the console:
print('test \n test')
print(r'test \n test')
print(type(r''))
print(type(''))
Now, while you read lines from JSON file, the escaping is done for you. I don't know how will you create the JSON file, but you should take a look at the json module, and the load method, that will allow you to read a JSON file.

You can use re.escape to escape the strings. However this is escaping everything and you might want some special chars. I'd just use the strings and be careful about placing \ in the right places.
BTW: If you have many regular expressions, matching might get slow. You might want to consider some alternatives such esmre.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.