I have a function that use b(t-1) variable like:
def test_b(a,b_1):
return a + b_1
Assume the following dataframe:
df = pd.DataFrame({'a':[1,2,3],'b':np.nan})
I am assigning the b_1 initial value:
df['b'].ix[0]=0
and then (using my Matlab experience), i use the loop:
for i in range(1,len(df)):
df['b'].ix[i] = test_b(df['a'].ix[i],df['b'].ix[i-1])
output:
a|b
0|1|0
1|2|2
2|3|5
Is it a more elegant way to do the same?
You never want to do assignments like this, as this is chained indexing
This is a recurrent relation, so not easy way ATM to do this in a super performant manner, though see here.
here is an open issue about this with a pointer to this which uses ifilter to solve the relation.
Related
I have some code that is calculating the value of a large number of discrete actions and outputting the best action and it's value.
A_max = 0
for i in...
A = f(i)
if A > A_max
x = i
A_max = A
I'd like to parallelize this code in order to save time. Now, my understanding is that as calculating f(i) doesn't depend on calculating f(j) first, I can just use joblib.Parallel for that part of the code and get something like:
results = Parallel(n_jobs=-1)(delayed(f)(i) for i in...)
A_max = max(results)
x = list.index(A_max)
is this correct?
My next issue is that my code contains a dictionary that the function f alters as it does it calculation. My understanding is that if the code is parallelized, each concurrent process will be altering the same dictionary. Is this correct and if so would creating copies of the dictionary at the beginning of f solve the issue?
Finally, in the documentation I'm seeing references to backends called "Lorky" and "threading", what is the difference between these backends?
I just don't know how to explain what I need. I'm not looking for any codes, but just tutorial and direction to get to where I need to be.
Example: I have numbers in a csv file, a and b are in different columns:
header1,header2
a,b
a1,b1
a2,b2
a3,b3
a4,b4
a5,b5
a6,b6
so how would i create something like
[a(b)+a1(b1)+a2(b2)...a6(b6)] /(divided by) [sum of (all b values)]
ok so I know how to code the denominator by using pandas, but how would I code the numerator?
What is this process called, and where can I find a tutorial for it?
I don't know if this is the best method but should work. You can create a new column in pandas which is product of a*b
df['product'] = df['a']*df['b']
You can then simply user sum() to get sum of column b and column product and then divide the product by b:
ans = df['product'].sum() / df['b'].sum()
Not sure if this is the best method to use, but you could use list comprehensions along with the zip() function. With these two, you can get the nominator like this:
[a*b for a, b in zip(df['header1'], df['header2'])]
Chapter 3 of Dive into Python 3 has more on list comprehensions. Here is the documentation on zip() and here a few examples of its usage.
I have a pandas dataframe column (series) containing indices to a single character of interest inside the string elements of another column. Is there a way for me to access these characters of interest based on the index column in a vectorized manner, similar to the dataframe['name'].str.* functions? [edit: see comment below] If not (or regardless, really), what would you say is the preferred approach here?
[Edit: this assumption was wrong, as pointed out by jpp, but I'm leaving it here for traceability]
I'm trying to avoid being unnecessarily verbose, such as applying a translation function using map or having to construct a separate indexing recipe (like a dictionary containing the indices) in order to do something like
myDataFrame['myDesiredResult'] =
myDataFrame['myStrCol'].apply(myCharacterExtractionFunction, myIndexingRecipe)
I'd prefer sticking to numpy and pandas and not mix in more modules if at all possible.
Illustration of what the data might look like:
myStrCol myIndices myDesiredResult
0 ABC 1 B
1 DEF 0 D
2 GHI 2 I
Also, and maybe useful in order to get an understanding of how the numpy array is actually behaving inside the pandas wrapper, it would be great if someone could explain if it makes a difference to have a separate numpy array containing the indices, like this:
import pandas
import numpy
myPandasStringSeries = pandas.Series(['ABC', 'DEF', 'GHI'])
myPandasStringSeries
0 ABC
1 DEF
2 GHI
myNumpyIndexArray = numpy.array([1, 0, 2])
myNumpyIndexArray
array([1, 0, 2])
It seems to me that what I want is very similar to this suggestion relating to substrings but there doesn't seem to be a solution there yet. Apart from that, all I have found relates to the Series.str methods which operate using the same parameter for all elements of the Series like so:
myDataFrame['newColumn'] = myDataFrame['oldColumn'].str.split('_').str.get(0)
Is there a way for me to access these characters of interest based on
the index column in a vectorized manner, similar to the
dataframe['name'].str.* functions?
There is a misunderstanding here. Despite the documentation, pd.Series.str methods are not vectorised in the conventional sense. They operate in a high-level loop and often reflect the functionality in Python's built-in str methods.
In fact, pd.Series.str methods generally underperform simple list comprehensions when manipulating strings stored in Pandas dataframes. The convenient syntax should not be taken as a sign the underlying implementation is vectorised. This is often the case for series with dtype object.
One approach is to use a list comprehension:
df['myDesiredResult'] = [i[k] for i, k in zip(df['myStrCol'], df['myIndices'])]
need help making the function sumcounts(D) where D is a dictionary with numbers as values, returns the sum of all the values. Sample output should be like this:
>>> sumcounts({"a":2.5, "b":7.5, "c":100})
110.0
>>> sumcounts({ })
0
>>> sumcounts(strcount("a a a b"))
4
It's already there:
sum(d.values())
Or maybe that:
def sumcount(d):
return sum(d.values())
I'm not sure what you've been taught about dictionaries, so I'll assume the bare minimum.
Create a total variable and set it to 0.
Iterate over all keys in your dictionary using the normal for x in y syntax.
For every key, fetch its respective value from your dictionary using your_dict[key_name].
Add that value to total.
Get an A on your assignment.
Michael already posted the regular Pythonic solution.
The answer as given by Michael above is spot on!
I want to suggest that if you are going to work with large data sets to look at the most excellent Pandas Python Framework.(Maybe overkill for your problem but worth a look)
It accepts dictionaries and transforms it into a data set, for instance
yourdict = {"a":2.5, "b":7.5, "c":100}
dataframe = pandas.Series(yourdict)
You now have a very powerfull data frame that you can realy do a lot of neat stuff on including getting the sum
sum = dateframe.sum()
You can also easily plot it, save it to excel, CSV, get the mean, standard deviation etc...
dateframe.plot() # plots is in matplotlib
dateframe.mean() # gets the mean
dateframe.std() # gets the standard deviation
dataframe.to_csv('name.csv') # writes to csv file
I can realy recomend Pandas. It changed the way I do data business with Python...It compares well with the R data frame by the way.
I've written some code to find all the items that are in one iterable and not another and vice versa. I was originally using the built in set difference, but the computation was rather slow as there were millions of items being stored in each set. Since I know there will be at most a few thousand differences I wrote the below version:
def differences(a_iter, b_iter):
a_items, b_items = set(), set()
def remove_or_add_if_none(a_item, b_item, a_set, b_set):
if a_item is None:
if b_item in a_set:
a_set.remove(b_item)
else:
b_set.add(b)
def remove_or_add(a_item, b_item, a_set, b_set):
if a in b_set:
b_set.remove(a)
if b in a_set:
a_set.remove(b)
else:
b_set.add(b)
return True
return False
for a, b in itertools.izip_longest(a_iter, b_iter):
if a is None or b is None:
remove_or_add_if_none(a, b, a_items, b_items)
remove_or_add_if_none(b, a, b_items, a_items)
continue
if a != b:
if remove_or_add(a, b, a_items, b_items) or \
remove_or_add(b, a, b_items, a_items):
continue
a_items.add(a)
b_items.add(b)
return a_items, b_items
However, the above code doesn't seem very pythonic so I'm looking for alternatives or suggestions for improvement.
Here is a more pythonic solution:
a, b = set(a_iter), set(b_iter)
return a - b, b - a
Pythonic does not mean fast, but rather elegant and readable.
Here is a solution that might be faster:
a, b = set(a_iter), set(b_iter)
# Get all the candidate return values
symdif = a.symmetric_difference(b)
# Since symdif has much fewer elements, these might be faster
return symdif - b, symdif - a
Now, about writing custom “fast” algorithms in Python instead of using the built-in operations: it's a very bad idea.
The set operators are heavily optimized, and written in C, which is generally much, much faster than Python.
You could write an algorithm in C (or Cython), but then keep in mind that Python's set algorithms were written and optimized by world-class geniuses.
Unless you're extremely good at optimization, it's probably not worth the effort. On the other hand, if you do manage to speed things up substantially, please share your code; I bet it'd have a chance of getting into Python itself.
For a more realistic approach, try eliminating calls to Python code. For instance, if your objects have a custom equality operator, figure out a way to remove it.
But don't get your hopes up. Working with millions of pieces of data will always take a long time. I don't know where you're using this, but maybe it's better to make the computer busy for a minute than to spend the time optimizing set algorithms?
i think your code is broken - try it with [1,1] and [1,2] and you'll get that 1 is in one set but not the other.
> print differences([1,1],[1,2])
(set([1]), set([2]))
you can trace this back to the effect of the if a != b test (which is assuming something about ordering that is not present in simple set differences).
without that test, which probably discards many values, i don't think your method is going to be any faster than built-in sets. the argument goes something like: you really do need to create one set in memory to hold all the data (your bug came from not doing that). a naive set approach creates two sets. so the best you can do is save half the time, and you also have to do the work, in python, of what is probably efficient c code.
I would have thought python set operations would be the best performance you could get out of the standard library.
Perhaps it's the particular implementation you chose that's the problem, rather than the data structures and attendant operations themselves. Here's an alternate implementation that should be give you better performance.
For sequence comparison tasks in which the sequences are large, avoid, if at all possible, putting the objects that comprise the sequences into the containers used for the comparison--better to work with indices instead. If the objects in your sequences are unordered, then sort them.
So for instance, i use NumPy, the numerical python library, for these sort of tasks:
# a, b are 'fake' index arrays of type boolean
import numpy as NP
a, b = NP.random.randint(0, 2, 10), NP.random.randint(0, 2, 10)
a, b = NP.array(a, dtype=bool), NP.array(b, dtype=bool)
# items a and b have in common:
NP.sum(NP.logical_and(a, b))
# the converse (the differences)
NP.sum(NP.logical_or(a, b))