How can I copy data from changing string?
I tried to slice, but length of slice is changing.
For example in one case I should copy number 128 from string '"edge_liked_by":{"count":128}', in another I should copy 15332 from "edge_liked_by":{"count":15332}
You could use a regular expression:
import re
string = '"edge_liked_by":{"count":15332}'
number = re.search(r'{"count":(\d*)}', string).group(1)
Really depends on the situation, however I find regular expressions to be useful.
To grab the numbers from the string without caring about their location, you would do as follows:
import re
def get_string(string):
return re.search(r'\d+', string).group(0)
>>> get_string('"edge_liked_by":{"count":128}')
'128'
To only get numbers from the *end of the string, you can use an anchor to ensure the result is pulled from the far end. The following example will grab any sequence of unbroken numbers that is both preceeded by a colon and ends within 5 characters of the end of the string:
import re
def get_string(string):
rval = None
string_match = re.search(r':(\d+).{0,5}$', string)
if string_match:
rval = string_match.group(1)
return rval
>>> get_string('"edge_liked_by":{"count":128}')
'128'
>>> get_string('"edge_liked_by":{"1321":1}')
'1'
In the above example, adding the colon will ensure that we only pick values and don't match keys such as the "1321" that I added in as a test.
If you just want anything after the last colon, but excluding the bracket, try combining split with slicing:
>>> '"edge_liked_by":{"count":128}'.split(':')[-1][0:-1]
'128'
Finally, considering this looks like a JSON object, you can add curly brackets to the string and treat it as such. Then it becomes a nested dict you can query:
>>> import json
>>> string = '"edge_liked_by":{"count":128}'
>>> string = '{' + string + '}'
>>> string = json.loads(string)
>>> string.get('edge_liked_by').get('count')
128
The first two will return a string and the final one returns a number due to being treated as a JSON object.
It looks like the type of string you are working with is read from JSON, maybe you are getting it as the output of some API you are working with?
If it is JSON, you've probably gone one step too far in atomizing it to a string like this. I'd work with the original output, if possible, if I were you.
If not, to make it more JSON like, I'd convert it to JSON by wrapping it in {}, and then working with the json.loads module.
import json
string = '"edge_liked_by":{"count":15332}'
string = "{"+string+"}"
json_obj = json.loads(string)
count = json_obj['edge_liked_by']['count']
count will have the desired output. I prefer this option to using regular expressions because you can rely on the structure of the data and reuse the code in case you wish to parse out other attributes, in a very intuitive way. With regular expressions, the code you use will change if the data are decimal, or negative, or contain non-numeric characters.
Does this help ?
a='"edge_liked_by":{"count":128}'
import re
b=re.findall(r'\d+', a)[0]
b
Out[16]: '128'
Related
If I run an Athena query in AWS, the data I get back has structs with key/value pairs that look like this:
{
"events": "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"
}
I can use regular expressions to parse this, but things like special characters make that de-serialization very error-prone.
For example, {deviceType=Android, date=2022-01-01} will run into issues with delimiters if I use regex.
Is there an existing de-serializer for this type of thing?
EDIT:
This is the de-serialize regex I have:
def deserialize(s):
# Surround any word with "
s1 = re.sub('(\w+)', '"\g<1>"', s)
# Replace = with :
s2 = re.sub('=', ':', s1)
return json.loads(s2)
This hits issues when there are special characters in the value like "-" or "." Regex isn't able to properly determine the "word", so doesn't place the enclosing quotes properly.
The data inside the quotes is almost JSON but it's missing the quotes around keys and values. With a few judiciously chained .replace() method calls, you should be able to convert it from almost-JSON to JSON and then deserialize it using the json module:
import json
obj = {"events": "[{deviceType=Android, date=2022-01-01}]"}
events = obj['events']
events_json = events.replace(', ', ',').replace('{', '{"').replace('}', '"}').replace('=', '":"').replace(',', '","').replace('}","{','},{')
parsed = json.loads(events_json)
print(parsed[0])
print(parsed[0]['deviceType']) # prints 'Android'
print(parsed[0]['date']) # prints '2022-01-01'
*Edit to fix an issue raised by MisterMiyagi.
Instead of parsing this not-quite-JSON I recommend casting maps and arrays to JSON in your queries:
SELECT CAST(events AS JSON) AS events …
This has the added benefit of making the output less ambiguous to parse (e.g. without casting to JSON there is no way to know if "[1, 2, 3]" was an array of integers or strings, or if "[hello, world]" was an array of two elements, or one element with a comma inside).
Given the data as shown, you can isolate the strings between curly brackets with RE then further split those strings into their component parts. Here's an example:
import re
d = {'events': "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"}
for t in re.findall('(?<={).+?(?=})', d['events']):
for p in t.split(','):
print(p)
Output:
deviceType=Android
logins=400
deviceType=iPhone
logins=550
I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31
i was wondering if anyone has a simpler solution to extract a few letters in the middle of a string. i want to retrive the 3 letters (in this case, GMB) and all the entries follow the same patter. i'struggling o get a simpler way of doing this.
here is an example of what i've been using.
entry = "entries-alphabetical.jsp?raceid13=GMB$20140313A"
symbol = entry.strip('entries-alphabetical.jsp?raceid13=')
symbol = symbol[0:3]
print symbol
thanks
First of all the argument passed to str.strip is not prefix or suffix, it is just a combination of characters that you want to be stripped off from the string.
Since the string looks like an url, you can use urlparse.parse_qsl:
>>> import urlparse
>>> urlparse.parse_qsl(entry)
[('entries-alphabetical.jsp?raceid13', 'GMB$20140313A')]
>>> urlparse.parse_qsl(entry)[0][1][:3]
'GMB'
This is what regular expressions are for. http://docs.python.org/2/library/re.html
import re
val = re.search(r'(GMB.*)', entry)
print val.group(1)
Have a set of string as follows
text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'
These data i have extracted from a Xls file and converted to string,
now i have to Extract data which is inside single quotes and put them in a list.
expecting output like
[MUC-EC-099_SC-Memory-01_TC-25, MUC-EC-099_SC-Memory-01_TC-26,MUC-EC-099_SC-Memory-01_TC-27]
Thanks in advance.
Use re.findall:
>>> import re
>>> strs = """text:u'MUC-EC-099_SC-Memory-01_TC-25'
text:u'MUC-EC-099_SC-Memory-01_TC-26'
text:u'MUC-EC-099_SC-Memory-01_TC-27'"""
>>> re.findall(r"'(.*?)'", strs, re.DOTALL)
['MUC-EC-099_SC-Memory-01_TC-25',
'MUC-EC-099_SC-Memory-01_TC-26',
'MUC-EC-099_SC-Memory-01_TC-27'
]
You can use the following expression:
(?<=')[^']+(?=')
This matches zero or more characters that are not ' which are enclosed between ' and '.
Python Code:
quoted = re.compile("(?<=')[^']+(?=')")
for value in quoted.findall(str(row[1])):
i.append(value)
print i
That text: prefix seems a little familiar. Are you using xlrd to extract it? In that case, the reason you have the prefix is because you're getting the wrapped Cell object, not the value in the cell. For example, I think you're doing something like
>>> sheet.cell(2,2)
number:4.0
>>> sheet.cell(3,3)
text:u'C'
To get the unwrapped object, use .value:
>>> sheet.cell(3,3).value
u'C'
(Remember that the u here is simply telling you the string is unicode; it's not a problem.)
I am learning regular expressions. Don't understand how to match the following pattern:
" myArray = ["Var1","Var2"]; "
Ideally I want to get the data in the array and to convert into python array
Are the array items guaranteed to be surrounded by double-quotes?
This is a quick and dirty method:
re.findall('"([^,]+)"', source)
where source is your string.
I didn't escape the double-quotes in the regex since you can also use single-quotes in Python.
This returns a list of each item surrounded by double quotes
so in your example: ['Var1', 'Var2']
Regular expression complexity differs much depending on variations of input. The easiest expressions that matches given string are:
>>> from re import search, findall
>>> s = ' myArray = ["Var1","Var2"]; '
>>> name, body = search(r'\s*(\w*)\s*=\s*\[(.*)\]', s).groups(0)
>>> contents = findall(r'"(\w*)"', body)
>>> name, contents
('myArray', ['Var1', 'Var2'])
"Converting" to python array can be done like this:
>>> globals().update({name: contents})
>>> myArray
['Var1', 'Var2']
Though it is actually a bad idea as it writes garbage in globals. Instead, try using separate dictionary, or something.
If you are interested in just getting the data in the array, you can skip using regex and use eval instead.
Consider this:
myArray = eval('["Var1","Var2"]')
If you must use the line you gave in the example, you can also use exec. However this command is somewhat dangerous and needs special care if used.
Without using an re you could use builtin string methods and literal_eval which given your example returns a usable list object:
from ast import literal_eval
text = ' myArray = ["Var1","Var2"]; '
name, arr_text = (el.strip('; ') for el in text.split('='))
arr = literal_eval(arr_text)
print name, arr
Then do what you want with name and arr...