I've been trying and failing to use iterrows with if/else statements to return calculated values from DataFrame columns. Am starting to think it's the wrong method.
In this example I have two variables x and y, and a DataFrame:
category number
0 one 13
1 two 14
2 one 7
3 three 8
4 one 3
5 two 8
6 four 9
If the category is one or two, divide the corresponding number by 2 and assign half the value to variable x and half to variable y. But if the category is three or four, assign the whole corresponding number to just variable y. x and y would then be the summed result, as in:
x = 22.5
(Because: 13/2+14/2+7/2+3/2+8/2 = 22.5)
y = 39.5
(Because: 13/2+14/2+7/2+8+3/2+8/2+9 = 39.5)
I haven't found any example of iterrows being used like this. Are these types of calculations even possible using iterrows or is there better way?
You can use .loc to slice by each case you're looking at, and then aggregate as appropriate.
case1 = ['one', 'two']
case2 = ['three', 'four']
x = df.loc[df.category.isin(case1), 'number'].sum()/2
y = x + df.loc[df.category.isin(case2), 'number'].sum()
Related
Is there any way of multiplying a column of pandas with every other column and get a list of sum of products?
I'll try to brief my problem Let;s assume my data set is
Label X Y1 Y2
=================
abc 4 0 1
xyz 0 1 0
... 2 3 2
... 3 2 4
... 4 4 3
... 2 1 0
I would require a list list_sum
list_sum = [sum(X*Y1), sum(X*Y2)]
Every element in Column X is multiplied with corresponding element in column Y1 and all the numbers are added, same is done with Y1.
in this case list_sum should be
list_sum = [30, 32]
But my requirement is for a dataframe containing n columns, without iterating using a for loop, as that could really cost my computation time.
If i am missing anything or info insufficient I'll be sure to update on Notice.
More generic way to do it:
# list of columns to be multiplied by X
columns = list(set(df.columns).difference("X"))
print(columns)
sum_list = [sum(df["X"]*df[i]) for i in columns]
print(sum_list)
Edit
columns = list(set(df.columns).difference("X"))
df_ = df[columns].multiply(df["X"], axis="index")
print(df_.sum().tolist())
Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)
Let us assume we are given the below function:
def f(x,y):
y = x + y
return y
The function f(x,y) sums two numbers (but it could be any more or less complicated functions of two arguments). Let us now consider the following
import pandas as pd
import random
import numpy as np
random.seed(1234)
df = pd.DataFrame({'first': random.sample(range(0, 9), 5),
'second': np.NaN}, index = None)
y = 1
df
first second
0 7 NaN
1 1 NaN
2 0 NaN
3 6 NaN
4 4 NaN
for the scope of the question the second column of the data frame is here irrelevant, so we can without loss of generality assume it to be NaN. Let us apply f(x,y) to each row of the data frame, considering that the variable y has been initialised to 1. The first iteration returns 7+1 = 8; now, when applying the function again to second row, we want the y value to be updated to the previously calculated 8 and therefore the final result to be 1+8 =9, and so on and so forth.
What is the pythonic way to handle this? I want to avoid looping and re-assigning the variables inside the loop, thus my guess would be something along the lines of
def apply_to_df(df, y):
result = df['first'].apply(lambda s: f(s,y))
return result
however one may easily see that the above does not consider the updated values, whereas it computes the all calculations with the initial original value for y=1.
print(apply_to_df(df,y))
0 8
1 2
2 1
3 7
4 5
Note, you can probably solve this specific case with an existing cumulative function. However, in the general case, you could just hack it by relying on global state:
In [7]: y = 1
In [8]: def f(x):
...: global y
...: y = x + y
...: return y
...:
In [9]: df['first'].apply(lambda s: f(s))
Out[9]:
0 8
1 9
2 9
3 15
4 19
Name: first, dtype: int64
I want to avoid looping and re-assigning the variables inside the loop
Note, pd.DataFrame.apply is a vanilla Python loop under the hood, and it's actually less efficient because it does a lot of checking/validation of inputs. It is not meant to be efficient, but convenient. So if you care about performance, you've already given up if you are relying on .apply
Honestly, I think I would rather write the explicit loop over the rows inside of a function than rely on global state.
You could use a generator function to remember the prior calculation result:
def my_generator(series, foo, y_seed=0):
y = y_seed # Seed value for `y`.
s = series.__iter__() # Create an iterator on the series.
while True:
# Call the function on the next `x` value together with the most recent `y` value.
y = foo(x=s.next(), y=y)
yield y
df = df.assign(new_col=list(my_generator(series=df['first'], foo=f, y_seed=1)))
>>> df
first second new_col
0 8 NaN 9
1 3 NaN 12
2 0 NaN 12
3 5 NaN 17
4 4 NaN 21
The following question is a generalization to the question posted here:
Counting the intersection of equivalent rows in two tables
I have two FITS files. For example, the first file has 100 rows and 2 columns. The second file has 1000 rows and 3 columns.
FITS FILE 1 FITS FILE 2
A B C D E
1 2 1 2 0.1
1 3 1 2 0.3
2 4 1 2 0.9
I need to take the first row of the first file, i.e 1 and 2 and check how many rows in the second file have C = 1 and D = 2 weighting each pair (C,D) with respect to the corresponding value in column E.
In the example, I have 3 rows in the second file that have C = 1 and D = 2. They have weights E = 0.1, 0.3, and 0.9, respectively. Weighting with respect to the values in E, I need to associate the value 0.1+0.3+0.9 = 1.3 to the pair (A,B) = (1,2) of the first file. Then, I need to do the same for the second row (first file), i.e 1 and 3 and find out how many rows in the second file have 1 and 3, again weighting with respect to the value in column E, and so on.
The first file does not have duplicates (all the rows have different pairs, none are identical, only file 2 has many identical pairs which I need to find).
I finally need the weighted numbers of rows in the second file that have the similar values as that of the rows of the first FITS file.
The result should be:
A B Number
1 2 1.3 # 1 and 2 occurs 1.3 times
1 3 4.5 # 1 and 3 occurs 4.5 times
and so on for all pairs in A and B columns.
I know from the post cited above that the solution for weights in column E all equal to 1 involves Counter, as follows:
from collections import Counter
# Create frequency table of (C,D) column pairs
file2freq = Counter(zip(C,D))
# Look up frequency value for each row of file 1
for a,b in zip(A,B):
# and print out the row and frequency data.
print a,b,file2freq[a,b]
To answer the question I need to include the weights in E when I use Counter:
file2freq = Counter(zip(C,D))
I was wondering if it is possible to do that.
Thank you very much for your help!
I'd follow up on the suggestion made by Iguananaut in the comments to that question. I believe numpy is an ideal tool for this.
import numpy as np
fits1 = np.genfromtxt('fits1.csv')
fits2 = np.genfromtxt('fits2.csv')
summed = np.zeros(fits1.shape[0])
for ind, row in enumerate(fits1):
condition = (fits2[:,:2] == row).all(axis=1)
summed[ind] = fits2[condition,-1].sum() # change the assignment operator to += if the rows in fits1 are not unique
After the import, the first 2 lines will load the data from the files. That will return an array of floats, which comes with the warning: comparing one float to another is prone to bugs. In this case it will work though, because both the columns in fits1.csv and the first 2 columns in fits2.csv are integers and parsed in the same manner by genfromtxt.
Then, in the for-loop the variable condition is created, which states that anytime the first two columns in fits2 match with the columns of the current row of fits1, it is to be taken into account (the result is a boolean array).
Then, finally, for the current row index ind, set the value of the array summed to the sum of all the values in column 3 of fits2, where the condition was True.
For a mini example I made, I got this:
oliver#armstrong:/tmp/sto$ cat fits1.csv
1 2
1 3
2 4
oliver#armstrong:/tmp/sto$ cat fits2.csv
1 2 .1
1 2 .3
1 2 .9
2 4 .3
1 5 .5
2 4 .7
# run the above code:
# summed is:
# array([ 1.3, 0. , 1. ])
I am looking for the right approach for solve the following task (using python):
I have a dataset which is a 2D matrix. Lets say:
1 2 3
5 4 7
8 3 9
0 7 2
From each row I need to pick one number which is not 0 (I can also make it NaN if that's easier).
I need to find the combination with the lowest total sum.
So far so easy. I take the lowest value of each row.
The solution would be:
1 x x
x 4 x
x 3 x
x x 2
Sum: 10
But: There is a variable minimum and a maximum sum allowed for each column. So just choosing the minimum of each row may lead to a not valid combination.
Let's say min is defined as 2 in this example, no max is defined. Then the solution would be:
1 x x
5 x x
x 3 x
x x 2
Sum: 11
I need to choose 5 in row two as otherwise column one would be below the minimum (2).
I could use brute force and test all possible combinations. But due to the amount of data which needs to be analyzed (amount of data sets, not size of each data set) that's not possible.
Is this a common problem with a known mathematical/statistical or other solution?
Thanks
Robert