I've been using graphite for some time now in order to power our backend pythonic program. As part of my usage of it, I need to sum (using sumSeries) different metrics using wildcards.
Thing is, I need to group them according to a pattern; say I have the following range of metric names:
group.*.item.*
I need to sum the values of all items, for a given group (meaning: group.1.item.*, group.2.item.*, etc)
Unfortunately, I do not know in advance the set of existing group values, and so what I do right now is that I query metrics/index.json, parse the list, and generated the desired query (manually creating sumSeries(group.NUMBER.item.*) for every NUMBER I find in the metrics index).
I was wondering if there was a way to have graphite do this for me, and save the first roundtrip, as the communication and pre-processing are costly (taking more than half the time of the entire process)
Thanks in advance!
If you want a separate line for each group you could use the groupByNode function.
groupByNode(group.*.item.*, 1, "sumSeries")
Where '1' is the node you're selecting (indexed by 0) and "sumSeries" is the function you are feeding each group into.
You can read more about this here: http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.groupByNode
If you want to restrict the second node to only numeric values you can use a character range. You do this by specifying the range in square brackets [...]. A character range is indicated by 2 characters separated by a dash (-).
group.[0-9].item.*
You can read more about this here:
http://graphite.readthedocs.io/en/latest/render_api.html#paths-and-wildcards
Related
I have a Python package installed that displays Google geography info from a search term. I cannot apply the function across all the rows in my data. I can confirm that it works for a single result:
But how can I apply this across every row in the 7th column?
I understand I need a for loop:
But how do I apply my function from the package to each row?
Thanks a lot
You can use the apply function which is faster:
data2['geocode_result'] = data2['ghana_city'].apply(lambda x: gmaps.geocode(x))
Result will be JSON strings in the geocode_result column, so you may have to define some custom function to extract the information you want from the strings. At least apply will be a step towards the right direction.
pd.Series.map is one way to vectorise your algorithm:
data2['geocode_result'] = data2['ghana_city'].map(gmaps.geocode)
This is still not vectorised as gmaps.geocode() will be applied to every element in the series. But, similar to #ScratchNPurr's approach, you should see a improvement.
I have a list of strings that follow a specific pattern. Here's an example
['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
I'm trying to end up with a blob pattern that will represent this list like the following
'ratelimiter:foobar:201401011*
I know the first two fields ahead of time. The third field is a time stamp and I want to find the column at which they start to have different values from other values in the column.
In the example given the timestamp ranges from 2014-01-01-11:57 to 2014-01-01-12:00 and the column that's different is the third to the last column where 1 changes to 2. If I can find that, then I can slice the string to [:-3] += '*' (for this example)
Every time I try and tackle this problem I end up with loops everywhere. I just feel like there's a better way of doing this.
Or maybe someone knows a better way of doing this with redis. I'm doing this because I'm trying to get keys from redis and I don't want to make a request for every key but rather make a batch request using the pattern parameter. Maybe there's a better way of doing this but haven't found anything yet.
Thanks
Staying in the pattern thing (converting to timestamp is probably best, though), I would do that to find the longest prefix:
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
print items[0][:[len(set(x)) == 1 for x in zip(*items)].index(False)] + '*'
# ratelimiter:foobar:201401011*
Which reads as: cut the first element of items where all nth elements of items are no longer equals.
[len(set(x)) == 1 for x in zip(*items)] will return a list of boolean being True for i if all elements at i are equal across items
This is what I will do:
convert the timestamp to numbers
find the max and min (if your list is not ordered)
take the difference between max and min and convert it back to pattern.
For example, in your case, the difference between max and min is 43. And the min is already 57, you can quickly deduct that if the min ends with ***157, the max should be ***200. And you know the pattern
You almost never want to use the '*' parameter in Redis in production because it is very slow-- much slower than making a request for each key individually in the vast majority of cases. Unless you're requesting so many keys that your bottleneck becomes the sheer amount of data you're transferring over the network (in which case you should really convert things to Lua and run the logic server-side), a pipeline is really want you want.
The reason you want a pipeline is you're probably getting hit by the costs of transferring data back and forth between your Redis server in separate hops right now. A pipeline, in contrast, queues up a bunch of commands to run against Redis, and then executes them all at once, when you're ready. Assuming you're using redis-py (if you're not, you really should be), and r is your connection to your Redis server, you can do this like so:
r = redis.Redis(...)
pipe = r.pipeline()
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
for item in items:
pipe.get(item)
#all the values for each item you're getting from Redis will be here.
item_values = pipe.execute()
Note: this will only make one call to Redis and will be much faster than either getting each value individually or running a pattern selection.
All of the other answers so far are good Python answers, but you're dealing with a Redis problem. You need a Redis answer.
I am new to python and struggling in one trivial task. I have one alphanumeric column known as region. It has both entries beginning with / such as /health/blood pressure etc and integer values. So typically few observations look like:
/health/blood pressure
/health/diabetes
7867
/fitness
9087
/health/type1 diabetes
Now I want to remove all the rows/cases with integer values. So after importing the data set into python shell, it is showing region as object. I intended to solve this problem with a sort of regular expression. So I did the following:
pattern='/'
data.region=Series(data.region)
matches=data.region.str.match(pattern)
matches
Here it gives a boolean object explaining whether each pattern is in the data set or not. So I get something like this:
0 true
1 false
2 true
3 true
.........
so on.
Now I am stuck further how to remove rows of matches boolean object with false tag. If statement is not working. If anyone can offer some sort of assistance, that would be great!!
Thanks!!
It seems like you are using the pandas framework. So I am not completely sure if this is working:
You can try:
matches = [i for i in data.region if i.str.match(pattern)]
In python this is called a list comprehension that goes through every entry in data.region and checks your pattern and puts it in the list if the pattern is matching (and the expression after 'if' is thus true).
See: https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
If you want to map those for every region you can try to create a dictionary that maps the regions to the lists with the following dict-comprehension:
matches = {region: [i for i in data.region if i.str.match(pattern)] for region in data}
See: https://docs.python.org/2/tutorial/datastructures.html#dictionaries
However you are definitely leaving the realm of the pandas framework. This could eventually fail of regions is not an integer/string but a list itself (as Is aid I don't know pandas enough to judge).
In that case you could try:
matches = {}
for region in list_of_regions:
matches[region] = [i for i in data.region if i.str.match(pattern)]
which is basically the same just with a given list of region and the dict comprehension made explicit in a for loop.
I need to create a BASH script, ideally using SED to find and replace value lists in href URL link constructs with HTML sit files, looking-up in a map (old to new values), that have a given URL construct. There are around 25K site files to look through, and the map has around 6,000 entries that I have to search through.
All old and new values have 6 digits.
The URL construct is:
One value:
HREF=".*jsp\?.*N=[0-9]{1,}.*"
List of values:
HREF=".*\.jsp\?.*N=[0-9]{1,}+N=[0-9]{1,}+N=[0-9]{1,}...*"
The list of values are delimited by + PLUS symbol, and the list can be 1 to n values in length.
I want to ignore a construct such as this:
HREF=".*\.jsp\?.*N=0.*"
IE the list is only N=0
Effectively I'm only interested in URL's that include one or more values that are in the file map, that are not prepended with CHANGED -- IE the list requires updating.
PLEASE NOTE: in the above construct examples: .* means any character that isn't a digit; I'm just interested in any 6 digit values in the list of values after N=; so I've trying to isolate the N= list from the rest of the URL construct, and it should be noted that this N= list can appear anywhere within this URL construct.
Initially, I want to create a script that will create a report of all links that fulfills the above criteria and that have a 6 digital OLD value that's in the map file, with its file path, to get an understanding of links impacted. EG:
Filename link
filea.jsp /jsp/search/results.jsp?N=204200+731&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
filea.jsp /jsp/search/BROWSE.jsp?Ntx=mode+matchallpartial&N=213890+217867+731&
fileb.jsp /jsp/search/results.jsp?N=0+450+207827+213767&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
Lastly, I'd like to find and replace all 6 digit numbers, within the URL construct lists, as outlined above, as efficiently as possible (I'd like it to be reasonably fast as there could be around 25K files, with 6K values to look up, with potentially multiple values in the list).
**PLEASE NOTE:** There is an additional issue I have, when finding and replacing, is that an old value could have been assigned a new value, that's already been used, that may also have to be replaced.
E.G. If the map file is as below:
MAP-FILE.txt
OLD NEW
214865 218494
214866 217854
214867 214868
214868 218633
... ...
and there is a HREF link such as:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214867+214868
214867 changes to 214868 - this would need to be prepended to flag that this value has been changed, and should not be replaced, otherwise what was 214867 would become 218633 as all 214868 would be changed to 218633. Hope this makes sense - I would then need to run through file and remove all 6 digit numbers that had been marked with the prepended flag, such that link would become:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214868CHANGED+218633CHANGED
Unless there's a better way to manage these infile changes.
Could someone please help me on this, I'm note an expert with these kind of changes - so help would be massively appreciated.
Many thanks in advance,
Alex
I will write the outline for the code in some kind of pseudocode. And I don't remember Python well to quickly write the code in Python.
First find what type it is (if contains N=0 then type 3, if contains "+" then type 2, else type 1) and get a list of strings containing "N=..." by exploding (name of PHP function) by "+" sign.
The first loop is on links. The second loop is for each N= number. The third loop looks in map file and finds the replacing value. Load the data of the map file to a variable before all the loops. File reading is the slowest operation you have in programming.
You replace the value in the third loop, then implode (PHP function) the list of new strings to a new link when returning to a first loop.
Probably you have several files with the links then you need another loop for the files.
When dealing with repeated codes you nees a while loop until spare number found. And you need to save the numbers that are already used in a list.
I am trying to apply filter on two diffrent properties but it GAE isn't allow me to do this what will be the solution then, there it is the code snipt:
if searchParentX :
que.filter("parentX >=", searchParentX).filter("parentX <=", unicode(searchParentX) + u"\ufffd")
que.order('parentX')
if searchParentY :
que.filter("parentY >=", searchParentY).filter("parentY <=", unicode(searchParentY) + u"\ufffd")
The solution would be to do an in memory filtering:
You can run two queries (filtering on one property each) and do an intersection on the results (depending on the size of the data, you may need to limit your results for one query but not the other so it can fit in memory)
Run one query and filter out the other property in memory (in this case it would be beneficial if you know which property would return a more filtered list)
Alternatively, if your data is structured in such a way that you can break the data into sets you can perform equality filters on that set and finish filtering in memory. For example, if you are searching on strings but you know the strings to be a fixed length (say 6 characters), you can create a "lookup" field with the begging 3/4 characters. Then when you need to search on this field, you do so by matching on the first few characters, and finish the search in memory. Another example: when searching for integer ranges, if you can define common grouping of ranges (say decades for a year, or price ranges), then you can define a "range" field to do equality searches on and continue filtering in memory
Inequality filters are limited to at most one property, i think this restriction is because the data in bigtable is stored in lexical sorted form so at one time only one search can be perform
https://developers.google.com/appengine/docs/python/datastore/queries#Restrictions_on_Queries