I am attempting to integrate some ReactiveX concepts into an existing project, thinking it might be good practice and a way to make certain tasks cleaner.
I open a file, create an Observable from its lines, then do some filtering until I get just the lines I want. Now, I want to extract some information from two of those lines using re.search() to return particular groups. I can't for the life of me figure out how to get such values out of an Observable (without assigning them to globals).
train = 'ChooChoo'
with open(some_file) as fd:
line_stream = Observable.from_(fd.readlines())
a_stream = line_stream.skip_while(
# Begin at dictionary
lambda x: 'config = {' not in x
).skip_while(
# Begin at train key
lambda x: "'" + train.lower() + "'" not in x
).take_while(
# End at closing brace of dict value
lambda x: '}' not in x
).filter(
# Filter sdk and clang lines only
lambda x: "'sdk'" in x or "'clang'" in x
).subscribe(lambda x: match_some_regex(x))
In place of .subscribe() at the end of that stream, I have tried using .to_list() to get a list over which I can iterate "the normal way," but it only returns a value of type:
<class 'rx.anonymousobservable.AnonymousObservable'>
What am I doing wrong here?
Every Rx example I have ever seen does nothing but print results. What if I want them in a data structure I can use synchronously?
For the short term, I implemented the feature I wanted using itertools (as suggested by #jonrsharpe). Still the problem grated at back of my mind, so I came back to it today and figured it out.
This is not a good example of Rx, since it only uses a single thread, but at least now I know how to break out of "the monad" when need be.
#!/usr/bin/env python
from __future__ import print_function
from rx import *
def my_on_next(item):
print(item, end="", flush=True)
def my_on_error(throwable):
print(throwable)
def my_on_completed():
print('Done')
pass
def main():
foo = []
# Create an observable from a list of numbers
a = Observable.from_([14, 9, 5, 2, 10, 13, 4])
# Keep only the even numbers
b = a.filter(lambda x: x % 2 == 0)
# For every item, call a function that appends the item to a local list
c = b.map(lambda x: foo.append(x))
c.subscribe(lambda x: x, my_on_error, my_on_completed)
# Use the list outside the monad!
print(foo)
if __name__ == "__main__":
main()
This example is rather contrived, and all the intermediate observables aren't necessary, but it demonstrates that you can easily do what I originally described.
Related
How i can change multiple parameters value in this url: https://google.com/?test=sadsad&again=tesss&dadasd=asdaas
You can see my code: i can just change 2 value!
This is the response https://google.com/?test=aaaaa&dadasd=howwww
again parameter not in the response! how i can change the value and add it to the url?
def between(value, a, b):
pos_a = value.find(a)
if pos_a == -1: return ""
pos_b = value.rfind(b)
if pos_b == -1: return ""
adjusted_pos_a = pos_a + len(a)
if adjusted_pos_a >= pos_b: return ""
return value[adjusted_pos_a:pos_b]
def before(value, a):
pos_a = value.find(a)
if pos_a == -1: return ""
return value[0:pos_a]
def after(value, a):
pos_a = value.rfind(a)
if pos_a == -1: return ""
adjusted_pos_a = pos_a + len(a)
if adjusted_pos_a >= len(value): return ""
return value[adjusted_pos_a:]
test = "https://google.com/?test=sadsad&again=tesss&dadasd=asdaas"
if "&" in test:
print(test.replace(between(test, "=", "&"), 'aaaaa').replace(after(test, "="), 'howwww'))
else:
print(test.replace(after(test, "="), 'test'))
Thanks!
From your code it seems like you are probably fairly new to programming, so first of all congratulations on having attempted to solve your problem.
As you might expect, there are language features you may not know about yet that can help with problems like this. (There are also libraries specifically for parsing URLs, but point you to those wouldn't help your progress in Python quite as much - if you are just trying to get some job done they might be a godsend).
Since the question lacks a little clarity (don't worry - I can only speak and write English, so you are ahead of me there), I'll try to explain a simpler approach to your problem. From the last block of your code I understand your intent to be:
"If there are multiple parameters, replace the value of the first with 'aaaaa' and the others with 'howwww'. If there is only one, replace its value with 'test'."
Your code is a fair attempt (at what I think you want to do). I hope the following discussion will help you. First, set url to your example initially.
>>> url = "https://google.com/?test=sadsad&again=tesss&dadasd=asdaas"
While the code deals with multiple arguments or one, it doesn't deal with no arguments at all. This may or may not matter, but I like to program defensively, having made too many silly mistakes in the past. Further, detecting that case early simplifies the remaining logic by eliminating an "edge case" (something the general flow of your code does not handle). If I were writing a function (good when you want to repeat actions) I'd start it with something like
if "?" not in url:
return url
I skipped this here because I know what the sample string is and I'm not writing a function. Once you know there are arguments, you can split them out quite easily with
>>> stuff, args = url.split("?", 1)
The second argument to split is another defensive measure, telling it to ignore all but the first question mark. Since we know there is at least one, this guarantees there will always be two elements in the result, and Python won't complain about a different number of names as values in that assignment. Let's confirm their values:
>>> stuff, args
('https://google.com/', 'test=sadsad&again=tesss&dadasd=asdaas')
Now we have the arguments alone, we can split them out into a list:
>>> key_vals = args.split("&")
>>> key_vals
['test=sadsad', 'again=tesss', 'dadasd=asdaas']
Now you can create a list of key,value pairs:
>>> kv_pairs = [kv.split("=", 1) for kv in key_vals]
>>> kv_pairs
[['test', 'sadsad'], ['again', 'tesss'], ['dadasd', 'asdaas']]
At this point you can do whatever is appropriate do the keys and values - deleting elements, changing values, changing keys, and so on. You could create a dictionary from them, but beware repeated keys. I assume you can change kv_pairs to reflect the final URL you want.
Once you have made the necessary changes, putting the return value back together is relatively simple: we have to put an "=" between each key and value, then a "&" between each resulting string, then join the stuff back up with a "?". One step at a time:
>>> [f"{k}={v}" for (k, v) in kv_pairs]
['test=sadsad', 'again=tesss', 'dadasd=asdaas']
>>> "&".join(f"{k}={v}" for (k, v) in kv_pairs)
'test=sadsad&again=tesss&dadasd=asdaas'
>>> stuff + "?" + "&".join(f"{k}={v}" for (k, v) in kv_pairs)
'https://google.com/?test=sadsad&again=tesss&dadasd=asdaas'
I would use urllib since it handles this for you.
First lets break down the URL.
import urllib
u = urllib.parse.urlparse('https://google.com/?test=sadsad&again=tesss&dadasd=asdaas')
ParseResult(scheme='https', netloc='google.com', path='/', params='', query='test=sadsad&again=tesss&dadasd=asdaas', fragment='')
Then lets isolate the query element.
data = dict(urllib.parse.parse_qsl(u.query))
{'test': 'sadsad', 'again': 'tesss', 'dadasd': 'asdaas'}
Now lets update some elements.
data.update({
'test': 'foo',
'again': 'fizz',
'dadasd': 'bar'})
Now we should encode it back to the proper format.
encoded = urllib.parse.urlencode(data)
'test=foo&again=fizz&dadasd=bar'
And finally let us assemble the whole URL back together.
new_parts = (u.scheme, u.netloc, u.path, u.params, encoded, u.fragment)
final_url = urllib.parse.urlunparse(new_parts)
'https://google.com/?test=foo&again=fizz&dadasd=bar'
Is it necessary to do it from scartch? If not use the urllib already included in vanilla Python.
from urllib.parse import urlparse, parse_qsl, urlencode, urlunparse
url = "https://google.com/?test=sadsad&again=tesss&dadasd=asdaas"
parsed_url = urlparse(url)
qs = dict(parse_qsl(parsed_url.query))
# {'test': 'sadsad', 'again': 'tesss', 'dadasd': 'asdaas'}
if 'again' in qs:
del qs['again']
# {'test': 'sadsad', 'dadasd': 'asdaas'}
parts = list(parsed_url)
parts[4] = urlencode(qs)
# ['https', 'google.com', '/', '', 'test=sadsad&dadasd=asdaas', '']
new_url = urlunparse(parts)
# https://google.com/?test=sadsad&dadasd=asdaas
import re
arr1 = ['2018.07.17 11:30:00,-0.19', '2018.07.17 17:55:00,0.86']
arr2 = ['2018.07.17 11:34:00,-0.39', '2018.07.17 17:59:01,0.85']
def combine_strats_lambda(*strats):
"""
Takes *strats in date,return format
combines infinite amount of strats with date, return and packs them into
one
single sorted array
>> RETURN: combined list
"""
temp = []
# create combined list
for v in enumerate(strats):
i = 0
while i < len(v[1]):
temp.append(v[1][i])
#k = re.findall(r"[\w']+", temp)[:6]
i += 1
temp2 = sorted(timestamps, key=lambda d: tuple(map(int, re.findall(r"[\w']+", d[0]))))
return temp2
Hi,
I've been trying to finish this function, which should combine multiple lists of dates,percentage returns and sort them.
I've come across a solution with lambda but all I get is this message:
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
Do you know an easier solution to the problem or what the error is caused by? I can't seem to figure it out.
Anything appreciated :)
The very basic error in your code is in line:
for v in enumerate(strats):
You have apparently forgotten that enumerate(...) returns two
values: the index and the current value from the iterable.
So, as you used just single v, it gets the index, not the value.
Another important point is that if the datetime strings are written as
yyyy.MM.dd hh:mm:ss, you can sort them using just string sort.
So, to gather the strings, you need a list comprehension, with 2 nested
loops.
And to sort them, you should use sorted function, specifying as the sort
key the "initial" (date / time) part, before the comma.
To sum up, to get the sorted list of strings, taken from a couple of
arguments of your function, sorted on the date / time part,
you can use the following program, written using version 3.6 of Python:
arr1 = ['2018.07.17 11:30:00,-0.19', '2018.07.17 17:55:00,0.86']
arr2 = ['2018.07.17 11:34:00,-0.39', '2018.07.17 17:59:01,0.85']
def combine_strats_lambda(*strats):
temp = [ v2 for v1 in strats for v2 in v1 ]
return sorted(temp, key = lambda v: v.split(',')[0])
res = combine_strats_lambda(arr1, arr2)
for x in res:
parts = x.split(',')
print("{:20s} {:>6s}".format(parts[0], parts[1]))
It does not even use re module.
Using Python 3
This is very basic I'm sure. The code is used to return the Country, from the country code provided. Essentially I need the first two letters of the input given.
The code I've worked so far will only output the first "country code"
def get_country_codes(prices):
c = prices.split(',')
for char in c:
return char[:2]
print(get_country_codes("NZ$300, KR$1200, DK$5"))
output:
NZ
Wanted output:
NZ, KR, DK
Very easy to one liner this:
>>> def get_country_codes(prices):
return [cc.strip()[:2] for cc in prices.split(',')]
>>> print(get_country_codes("NZ$300, KR$1200, DK$5"))
['NZ', 'KR', 'DK']
>>>
What your program was doing was executing the for loop, but when return is called, it terminates the function; your implementation looked as though you wanted a generator (i.e. using yield) which is doable, but probably more cumbersome than necessary for this.
def get_country_codes(prices):
values = []
price_codes = prices.split(',')
for price_code in price_codes:
values.append(price_code.strip()[0:2])
return values # output: ['NZ', 'KR', 'DK']
return ', '.join(values) # output: NZ, KR, DK
print(get_country_codes("NZ$300, KR$1200, DK$5"))
output:
['NZ', 'KR', 'DK']
basically your method was returning the first value from that split list.
You need to iterate on that split list and save each value in another list and return that.
Another approach:
country_price_values = "NZ$300, KR$1200, DK$5"
country_codes = [val.strip()[0:2] for val in country_price_values.split(',')]
I am new to python and really programming in general and am learning python through a website called rosalind.info, which is a website that aims to teach through problem solving.
Here is the problem, wherein you're asked to calculate the percentage of guanine and thymine to the string of DNA given to for each ID, then return the ID of the sample with the greatest percentage.
I'm working on the sample problem on the page and am experiencing some difficulty. I know my code is probably really inefficient and cumbersome but I take it that's to be expected for those who are new to programming.
Anyway, here is my code.
gc = open("rosalind_gcsamp.txt","r")
biz = gc.readlines()
i = 0
gcc = 0
d = {}
for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
biz[i] = biz[i].replace("\n","")
biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
del biz[i+2]
What I'm trying to accomplish here is, given input such as this:
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
Break what's given into a list based on the lines and concatenate the two lines of DNA like so:
['>Rosalind_6404', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG', 'TCCCACTAATAATTCTGAGG\n']
And delete the entry two indices after the ID, which is >Rosalind. What I do with it later I still need to figure out.
However, I keep getting an index error and can't, for the life of me, figure out why. I'm sure it's a trivial reason, I just need some help.
I've even attempted the following to limited success:
for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
biz[i] = biz[i].replace("\n","")
biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
elif biz[i].startswith("A" or "C" or "G" or "T") and biz[i+1].startswith(">"):
del biz[i]
which still gives me an index error but at least gives me the biz value I want.
Thanks in advance.
It is very easy do with itertools.groupby using lines that start with > as the keys and as the delimiters:
from itertools import groupby
with open("rosalind_gcsamp.txt","r") as gc:
# group elements using lines that start with ">" as the delimiter
groups = groupby(gc, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
# if k is False we a non match to our not x.startswith(">")
# so use the value v as the key and call next on the grouper object
# to get the next value
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
print(d)
{'>Rosalind_0808': 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT', '>Rosalind_5959': 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC', '>Rosalind_6404': 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'}
If you need order use a collections.OrderedDict in place of d.
You are looping over the length of biz. So in your last iteration biz[i+1] and biz[i+2] don't exist. There is no item after the last.
Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...
If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.
For example (only mildly tested):
import bisect
import re
class Node(object):
def __init__(self):
self.children = []
self.children_values = []
self.exists = False
# Based on code at http://docs.python.org/library/bisect.html
def _index_of(self, ch):
i = bisect.bisect_left(self.children_values, ch)
if i != len(self.children_values) and self.children_values[i] == ch:
return (i, self.children[i])
return (i, None)
def add(self, value):
if len(value) == 0:
self.exists = True
return
i, child = self._index_of(value[0])
if not child:
child = Node()
self.children.insert(i, child)
self.children_values.insert(i, value[0])
child.add(value[1:])
def contains_prefix_of(self, value):
if self.exists:
return True
i, child = self._index_of(value[0])
if not child:
return False
return child.contains_prefix_of(value[1:])
necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
'LIGHTRED', 'LIGHTGREEN', 'GRAY']
trie = Node()
for value in necessary:
trie.add(value)
# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
fruit = regexp.findall(line)[0]
if trie.contains_prefix_of(fruit):
filtered.append(line)
This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.
I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.
#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416
pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time
f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))
For brevity, implementation of PrefixMatch is published here.
If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.
update (on sorted results)
According to the changelog for Python 2.4:
key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.
also, in the source code, line 1792:
/* Special wrapper to support stable sorting using the decorate-sort-undecorate
pattern. Holds a key which is used for comparisons and the original record
which is returned during the undecorate phase. By exposing only the key
.... */
This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:
sorted_generator = sorted(filtered, key=regexp.match(line).group(1))
I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.
def any_it(iterable):
for element in iterable:
if element: return True
return False
necessary = ['YELLOW', 'GREEN', 'RED', ...]
predicate = lambda line: any_it("fruit=" + color in line for color in necessary)
filtered = ifilter(predicate, open("testest"))
Tested (but unbenchmarked) code:
import re
import fileinput
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]
filtered = []
for line in fileinput.input(["test.txt"]):
try:
key = regexp.match(line).group(1)
except AttributeError:
continue # no match
for p in necessary:
if key.startswith(p):
filtered.append(line)
break
# "filtered" now holds your results
print "".join(filtered)
Diff to code in question:
We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.
update
If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.
try:
# with line = "2 asdasd fruit=SOMETHING asdasd...."
key = line.split(" ", 3)[2].split("=")[1]
except:
continue # no match
filtered=[]
for line in open('huge_file'):
found=regexp.findall(line)
if found:
fruit=found[0]
for x in necessary:
if fruit.startswith(x):
filtered.append(line)
break
or maybe :
necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
for x in necessary:
if x in line:
filtered.append(line)
break
I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.
filtered = (line for line in f if any(a in line for a in necessary_simple))
(The any() function is doing the same thing as your any_it() function)
Oh, and get rid of file.readlines(), just iterate over the file.
Untested code:
filtered = []
for line in lines:
value = line.split('=', 1)[1].split(' ',1)[0]
if value not in necessary:
filtered.append(line)
That should be faster than pattern matching 10 000 patterns onto a line.
Possibly there are even faster ways. :)
It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect...
As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.