I have a string (from an API call) that looks something like this:
val=
{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}
I need to do temp=json.loads(val); but the problem is that the string is not a valid JSON. The keys and values do not have the quotes around them. I tried explicitly putting the quotes and that worked.
How can I programatically include the quotes for such a string before reading it as a JSON?
Also, how can I replace the numbers scientific notations with decimals? eg. 0d-2 becomes "0" and 8.0d-1 becomes "0.8"?
You could catch anything thats a string with regex and replace it accordingly.
Assuming your strings that need quotes:
start with a letter
can have numbers at the end
never start with numbers
never have numbers or special characters in between them
This would be a regex code to catch them:
([a-z]*\d*):
You can try it out here. Or learn more about regex here.
Let's do it in python:
import re
# catch a string in json
json_string = '{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}' # note the single quotes!
# search the strings according to our rule
string_search = re.search('([a-z]*\d*):', json_string)
# extract the first capture group; so everything we matched in brackets
# this is to exclude the colon at the end from the found string as
# we don't want to enquote the colons as well
extracted_strings = string_search.group(1)
This is a solution in case you will build a loop later.
However if you just want to catch all possible strings in python as a list you can do simply the following instead:
import re
# catch ALL strings in json
json_string = '{input:a,matches:[{in:["w","x","y","z"],output:{num1:0d-2,num2:7.0d-1}},
{in:["w","x"],output:{num1:0d-2,num2:8.0d-1}}]}' # note the single quotes!
extract_all_strings = re.findall(r'([a-z]*\d*):', json_string)
# note that this by default catches only our capture group in brackets
# so no extra step required
This was about basically regex and finding everything.
With these basics you could either use re.sub to replace everything with itself just in quotes, or generate a list of replacements to verify first that everything went right (probably somethign you'd rather want to do with this maybe a little bit unstable approach) like this.
Note that this is why I made this kind of comprehensive answer instead of just pointing you to a "re.sub" one-liner.
You can apporach your scientific number notation problem accordingly.
Related
I ultimately want to split a string by a certain character. I tried Regex, but it started escaping \, so I want to avoid that with another approach (all the attempts at unescaping the string failed). So, I want to get all positions of a character char in a string that is not within quotes, so I can split them up accordingly.
For example, given the phase hello-world:la\test, I want to get back 11 if char is :, as that is the only : in the string, and it is in the 11th index. However, re does split it, but I get ['hello-world,lat\\test'].
EDIT:
#BoarGules made me realize that re didn't actually change anything, but it's just how Python displays slashes.
Here's a function that works:
def split_by_char(string,char=':'):
PATTERN = re.compile(rf'''((?:[^\{char}"']|"[^"]*"|'[^']*')+)''')
return [string[m.span()[0]:m.span()[1]] for m in PATTERN.finditer(string)]
string = 'hello-world:la\test'
char = ':'
print(string.find(char))
Prints
11
char_index = string.find(char)
string[:char_index]
Returns
'hello-world'
string[char_index+1:]
Returns
'la\test'
Solution for the case you're likely encountering (a pseudo-CSV format you're hand-rolling a parser for; if you're not in that situation, it's still a likely situation for people finding this question later):
Just use the csv module.
import csv
import io
test_strings = ['field1:field2:field3', 'field1:"field2:with:embedded:colons":field3']
for s in test_strings:
for row in csv.reader(io.StringIO(s), delimiter=':'):
print(row)
Try it online!
which outputs:
['field1', 'field2', 'field3']
['field1', 'field2:with:embedded:colons', 'field3']
correctly ignoring the colons within the quoted field, requiring no kludgy, hard-to-verify hand-written regexes.
I have a string where I am trying to replace ["{\" with [{" and all \" with ".
I am struggling to find the right syntax in order to do this, does anyone have a solid understanding of how to do this?
I am working with JSON, and I am inserting a string into the JSON properties. This caused it to put a single quotes around my inserted data from my variable, and I need those single quotes gone. I tried to do json.dumps() on the data and do a string replace, but it does not work.
Any help is appreciated. Thank you.
You can use the replace method.
See documentation and examples here
I would recommend maybe posting more of your code below so we can suggest a better answer. Just based on the information you have provided, I would say that what you are looking for are escape characters. I may be able to help more once you provide us with more info!
Use the target/replacement strings as arguments to replace().
The general format is mystring = mystring.replace("old_text", "new_text")
Since your target strings have backslashes, you also probably want to use raw strings to prevent them from being interpreted as special characters.
mystring = "something"
mystring = mystring.replace(r'["{\"', '[{"')
mystring = mystring.replace(r'\"', '"')
if its two characters you want to replace then you have to first check for first character and then the second(which should be present just after the first one and so on) and shift(shorten the whole array by 3 elements in first case whenever the condition is satisfied and in the second case delete \ from the array.
You can also find particular substring by using inbuilt function and replace it by using replace() function to insert the string you want in its place
I want to develop a regex in Python where a component of the pattern is defined in a separate variable and combined to a single string on-the-fly using Python's .format() string method. A simplified example will help to clarify. I have a series of strings where the space between words may be represented by a space, an underscore, a hyphen etc. As an example:
new referral
new-referal
new - referal
new_referral
I can define a regex string to match these possibilities as:
space_sep = '[\s\-_]+'
(The hyphen is escaped to ensure it is not interpreted as defining a character range.)
I can now build a bigger regex to match the strings above using:
myRegexStr = "new{spc}referral".format(spc = space_sep)
The advantage of this method for me is that I need to define lots of reasonably complex regexes where there may be several different commonly-occurring stings that occur multiple times and in an unpredictable order; defining commonly-used patterns beforehand makes the regexes easier to read and allows the strings to be edited very easily.
However, a problem occurs if I want to define the number of occurrences of other characters using the {m,n} or {n} structure. For example, to allow for a common typo in the spelling of 'referral', I need to allow either 1 or 2 occurrences of the letter 'r'. I can edit myRegexStr to the following:
myRegexStr = "new{spc}refer{1,2}al".format(spc = space_sep)
However, now all sorts of things break due to confusion over the use of curly braces (either a KeyError in the case of {1,2} or an IndexError: tuple index out of range in the case of {n}).
Is there a way to use the .format() string method to build longer regexes whilst still being able to define number of occurrences of characters using {n,m}?
You can double the { and } to escape them or you can use the old-style string formatting (% operator):
my_regex = "new{spc}refer{{1,2}}al".format(spc="hello")
my_regex_old_style = "new%(spc)srefer{1,2}al" % {"spc": "hello"}
print(my_regex) # newhellorefer{1,2}al
print(my_regex_old_style) # newhellorefer{1,2}al
I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each timeā¦
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'
My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.