If I run an Athena query in AWS, the data I get back has structs with key/value pairs that look like this:
{
"events": "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"
}
I can use regular expressions to parse this, but things like special characters make that de-serialization very error-prone.
For example, {deviceType=Android, date=2022-01-01} will run into issues with delimiters if I use regex.
Is there an existing de-serializer for this type of thing?
EDIT:
This is the de-serialize regex I have:
def deserialize(s):
# Surround any word with "
s1 = re.sub('(\w+)', '"\g<1>"', s)
# Replace = with :
s2 = re.sub('=', ':', s1)
return json.loads(s2)
This hits issues when there are special characters in the value like "-" or "." Regex isn't able to properly determine the "word", so doesn't place the enclosing quotes properly.
The data inside the quotes is almost JSON but it's missing the quotes around keys and values. With a few judiciously chained .replace() method calls, you should be able to convert it from almost-JSON to JSON and then deserialize it using the json module:
import json
obj = {"events": "[{deviceType=Android, date=2022-01-01}]"}
events = obj['events']
events_json = events.replace(', ', ',').replace('{', '{"').replace('}', '"}').replace('=', '":"').replace(',', '","').replace('}","{','},{')
parsed = json.loads(events_json)
print(parsed[0])
print(parsed[0]['deviceType']) # prints 'Android'
print(parsed[0]['date']) # prints '2022-01-01'
*Edit to fix an issue raised by MisterMiyagi.
Instead of parsing this not-quite-JSON I recommend casting maps and arrays to JSON in your queries:
SELECT CAST(events AS JSON) AS events …
This has the added benefit of making the output less ambiguous to parse (e.g. without casting to JSON there is no way to know if "[1, 2, 3]" was an array of integers or strings, or if "[hello, world]" was an array of two elements, or one element with a comma inside).
Given the data as shown, you can isolate the strings between curly brackets with RE then further split those strings into their component parts. Here's an example:
import re
d = {'events': "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"}
for t in re.findall('(?<={).+?(?=})', d['events']):
for p in t.split(','):
print(p)
Output:
deviceType=Android
logins=400
deviceType=iPhone
logins=550
Related
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
How can I copy data from changing string?
I tried to slice, but length of slice is changing.
For example in one case I should copy number 128 from string '"edge_liked_by":{"count":128}', in another I should copy 15332 from "edge_liked_by":{"count":15332}
You could use a regular expression:
import re
string = '"edge_liked_by":{"count":15332}'
number = re.search(r'{"count":(\d*)}', string).group(1)
Really depends on the situation, however I find regular expressions to be useful.
To grab the numbers from the string without caring about their location, you would do as follows:
import re
def get_string(string):
return re.search(r'\d+', string).group(0)
>>> get_string('"edge_liked_by":{"count":128}')
'128'
To only get numbers from the *end of the string, you can use an anchor to ensure the result is pulled from the far end. The following example will grab any sequence of unbroken numbers that is both preceeded by a colon and ends within 5 characters of the end of the string:
import re
def get_string(string):
rval = None
string_match = re.search(r':(\d+).{0,5}$', string)
if string_match:
rval = string_match.group(1)
return rval
>>> get_string('"edge_liked_by":{"count":128}')
'128'
>>> get_string('"edge_liked_by":{"1321":1}')
'1'
In the above example, adding the colon will ensure that we only pick values and don't match keys such as the "1321" that I added in as a test.
If you just want anything after the last colon, but excluding the bracket, try combining split with slicing:
>>> '"edge_liked_by":{"count":128}'.split(':')[-1][0:-1]
'128'
Finally, considering this looks like a JSON object, you can add curly brackets to the string and treat it as such. Then it becomes a nested dict you can query:
>>> import json
>>> string = '"edge_liked_by":{"count":128}'
>>> string = '{' + string + '}'
>>> string = json.loads(string)
>>> string.get('edge_liked_by').get('count')
128
The first two will return a string and the final one returns a number due to being treated as a JSON object.
It looks like the type of string you are working with is read from JSON, maybe you are getting it as the output of some API you are working with?
If it is JSON, you've probably gone one step too far in atomizing it to a string like this. I'd work with the original output, if possible, if I were you.
If not, to make it more JSON like, I'd convert it to JSON by wrapping it in {}, and then working with the json.loads module.
import json
string = '"edge_liked_by":{"count":15332}'
string = "{"+string+"}"
json_obj = json.loads(string)
count = json_obj['edge_liked_by']['count']
count will have the desired output. I prefer this option to using regular expressions because you can rely on the structure of the data and reuse the code in case you wish to parse out other attributes, in a very intuitive way. With regular expressions, the code you use will change if the data are decimal, or negative, or contain non-numeric characters.
Does this help ?
a='"edge_liked_by":{"count":128}'
import re
b=re.findall(r'\d+', a)[0]
b
Out[16]: '128'
I am trying to parse a file where some of the lines may contain a combination of single quotes, double quotes, and contractions. Each observation includes a string as shown above. When trying to parse the data, I am running into problems when trying to parse the reviews. For example:
\'text\' : \'This is the first time I've tried really "fancy food" at a...\'
or
\'text\' : \'I' be happy to go back "next hollidy"\'
Preprocess your string with a simple double replace - first escape all the quotation marks, then replace all the escaped apostrophes with a quotation mark - which will simply invert the escapes, e.g.:
# we'll define it as an object to keep the validity
src = "{\\'text\\' : \\'This is the first time I've tried really \"fancy food\" at a...\\'}"
# The double escapes are just so we can type it properly in Python.
# It's still the same underneath:
# {\'text\' : \'This is the first time I've tried really "fancy food" at a...\'}
preprocessed = src.replace("\"", "\\\"").replace("\\'", "\"")
# Now it looks like:
# {"text" : "This is the first time I've tried really \"fancy food\" at a..."}
Which is now a valid JSON (and a Python dictionary, incidentally) so you can go ahead and parse it:
import json
parsed = json.loads(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
Or:
import ast
parsed = ast.literal_eval(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
UPDATE:
Based on the posted line, you actually have a (valid) representation of a 7-element tuple containing a string representation of a dictionary as its third element, you don't need to preprocess the string at all. What you need is to first evaluate the tuple, then post-process the inner dict with another level of evaluation, i.e.:
import ast
# lets first read the data from a 'input.txt' file so we don't have to manually escape it
with open("input.txt", "r") as f:
data = f.read()
data = ast.literal_eval(data) # first evaluate the main structure
data = data[:2] + (ast.literal_eval(data[2]), ) + data[3:] # .. and then the inner dict
# this gives you `data` containing your 'serialized' tuple, i.e.:
print(data[4]) # 31.328237,-85.811893
# and you can access the children of the inner dict as well, i.e.:
print(data[2]["types"]) # ['restaurant', 'food', 'point_of_interest', 'establishment']
print(data[2]["opening_hours"]["weekday_text"][3]) # Thursday: 7:00 AM – 9:00 PM
# etc.
That being said, I'd suggest tracking down whoever is generating data like this and convince them to use some proper form of serialization, even the most basic JSON would be better than this.
Have a set of string as follows
text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'
These data i have extracted from a Xls file and converted to string,
now i have to Extract data which is inside single quotes and put them in a list.
expecting output like
[MUC-EC-099_SC-Memory-01_TC-25, MUC-EC-099_SC-Memory-01_TC-26,MUC-EC-099_SC-Memory-01_TC-27]
Thanks in advance.
Use re.findall:
>>> import re
>>> strs = """text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'"""
>>> re.findall(r"'(.*?)'", strs, re.DOTALL)
['MUC-EC-099_SC-Memory-01_TC-25',
'MUC-EC-099_SC-Memory-01_TC-26',
'MUC-EC-099_SC-Memory-01_TC-27'
]
You can use the following expression:
(?<=')[^']+(?=')
This matches zero or more characters that are not ' which are enclosed between ' and '.
Python Code:
quoted = re.compile("(?<=')[^']+(?=')")
for value in quoted.findall(str(row[1])):
i.append(value)
print i
That text: prefix seems a little familiar. Are you using xlrd to extract it? In that case, the reason you have the prefix is because you're getting the wrapped Cell object, not the value in the cell. For example, I think you're doing something like
>>> sheet.cell(2,2)
number:4.0
>>> sheet.cell(3,3)
text:u'C'
To get the unwrapped object, use .value:
>>> sheet.cell(3,3).value
u'C'
(Remember that the u here is simply telling you the string is unicode; it's not a problem.)
I have an input file which is in a Fortran "namelist" format which I would like to parse with python regular expressions. Easiest way to demonstrate is with a ficticious example:
$VEHICLES
CARS= 1,
TRUCKS = 0,
PLAINS= 0, TRAINS = 0,
LIB='AUTO.DAT',
C This is a comment
C Data variable spans multiple lines
DATA=1.2,2.34,3.12,
4.56E-2,6.78,
$END
$PLOTTING
PLOT=T,
PLOT(2)=12,
$END
So the keys can contain regular variable-name characters as well as parenthesis and numbers. The values can be strings, boolean (T, F, .T., .F., TRUE, FALSE, .TRUE., .FALSE. are all possible), integers, floating-point numbers, or comma-separated lists of numbers. Keys are connected to their values with equal signs. Key-Value pairs are separated by commas, but can share a line. Values can span multiple lines for long lists of numbers. Comments are any line beginning with a C. There is generally inconsistent spacing before and after '=' and ','.
I have come up with a working regular expression for parsing the keys and values and getting them into an Ordered Dictionary (need to preserve order of inputs).
Here's my code so far. I've included everything from reading the file to saving to a dictionary for thoroughness.
import re
from collections import OrderedDict
f=open('file.dat','r')
file_str=f.read()
#Compile regex pattern for requested namelist
name='Vehicles'
p_namelist = re.compile(r"\$"+name.upper()+"(.*?)\$END",flags=re.DOTALL|re.MULTILINE)
#Execute regex on file string and get a list of captured tokens
m_namelist = p_namelist.findall(file_str)
#Check for a valid result
if m_namelist:
#The text of the desired namelist is the first captured token
namelist=m_namelist[0]
#Split into lines
lines=namelist.splitlines()
#List comprehension which returns the list of lines that do not start with "C"
#Effectively remove comment lines
lines = [item for item in lines if not item.startswith("C")]
#Re-combine now that comment lines are removed
namelist='\n'.join(lines)
#Create key-value parsing regex
p_item = re.compile(r"([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)",flags=re.DOTALL|re.MULTILINE)
#Execute regex
items = p_item.findall(namelist)
#Initialize namelist ordered dictionary
n = OrderedDict()
#Remove undesired characters from value
for item in items:
n[item[0]] = item[1].strip(',\r\n ')
My question is whether I'm going about this correctly. I realize there is a ConfigParser library, which I have not yet attempted. My focus here is the regular expression:
([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)
but I went ahead and included the other code for thoroughness and to demonstrate what I'm doing with it. For my Regular Expression, because the values can contain commas, and the key-value pairs are also separated by commas, there is no simple way to isolate the pairs. I chose to use a forward look-ahead to find the next key and "=". This allows everything between the "=" and the next key to be the value. Finally, because this doesn't work for the last pair, I threw in "|$" into the forward look-ahead meaning that if another "VALUE=" isn't found, look for the end of the string. I figured matching the value with [^=]+ followed by a look-ahead was better than trying to match all possible value types.
While writing this question I came up with an alternative Regular Expresson that takes advantage of the fact that numbers are the only value that can be in lists:
([^\s,\=]+?)\s*=\s*((?:\s*\d[\d\.\E\+\-]*\s*,){2,}|[^=,]+)
This one matches either a list of 2 or more numbers with (?:\s*\d[\d\.\E\+\-]*\s*,){2,} or anything before the next comma with [^=,].
Are these somewhat messy Regular Expressions the best way to parse a file like this?
I would suggest to develop little more sophisticated parser.
I stumble upon the project on google code hosting that implements very similar parser functionality: Fortran Namelist parser for Python prog/scripts but it was build for little different format.
I played with it a little and updated it to support structure of the format in your example:
Please see my version on gist:
Updated Fortran Namelist parser for python https://gist.github.com/4506282
I hope this parser will help you with your project.
Here is example output produced by the script after parsing FORTRAN code example:
{'PLOTTING':
{'par':
[OrderedDict([('PLOT', ['T']), ('PLOT(2) =', ['12'])])],
'raw': ['PLOT=T', 'PLOT(2)=12']},
'VEHICLES':
{'par':
[OrderedDict([('TRUCKS', ['0']), ('PLAINS', ['0']), ('TRAINS', ['0']), ('LIB', ['AUTO.DAT']), ('DATA', ['1.2', '2.34', '3.12', '4.56E-2', '6.78'])])],
'raw':
['TRUCKS = 0',
'PLAINS= 0, TRAINS = 0',
"LIB='AUTO.DAT'",
'DATA=1.2,2.34,3.12',
'4.56E-2,6.78']}}