I have a dataframe df of the form
class_1_frequency class_2_frequency
group_1 20 10
group_2 60 25
..
group_n 50 15
Suppose class_1 has a total of 70 members and class_2 has 30.
For each row (group_1, group_2,..group_n) I want to create contingency tables (preferably dynamically) and then carry out a chisquare test to evaluate p-values.
For example, for group_1, the contingency table under the hood would look like:
class_1 class_2
group_1_present 20 10
group_1_absent 70-20 30-10
Also, I know scipy.stats.chi2_contingency() is the appropriate function for chisquare, but I am not able to apply it to my context. I have looked at previously discussed questions such as: here and here.
What is the most efficient way to achieve this?
You can take advantage of the apply function on pd.DataFrame. It allows to apply arbitrary functions to columns or rows of a DataFrame. Using your example:
df = pd.DataFrame([[20, 10], [60, 25], [50, 15]])
To produce the contingency tables one can use lambda and some vector operations
>>> members = np.array([70, 30])
>>> df.apply(lambda x: np.array([x, members-x]), axis=1)
0 [[20, 10], [50, 20]]
1 [[60, 25], [10, 5]]
2 [[50, 15], [20, 15]]
And this can of course be wrapped with the scipy function.
df.apply(lambda x: chi2_contingency(np.array([x, members-x])), axis=1)
This produces all possible return values, but by slicing the output, one is able to specify the wanted return values, leaving e.g. the expected arrays. The resulting series can also be converted to a DataFrame.
>>> s = df.apply(lambda x: chi2_contingency(np.array([x, members-x]))[:-1], axis=1)
>>> s
0 (0.056689342403628114, 0.8118072280034329, 1)
1 (0.0, 1.0, 1)
2 (3.349031920460492, 0.06724454934343391, 1)
dtype: object
>>> s.apply(pd.Series)
0 1 2
0 0.056689 0.811807 1.0
1 0.000000 1.000000 1.0
2 3.349032 0.067245 1.0
Now I don't know about the execution efficiency of this approach, but I'd trust the ones who have implemented these functions. And most likely the speed is not that critical. But it is at least efficient in the sense that it is (hypothetically) easy to understand and fast to write.
Related
I currently have a file where I create a hierarchy from the product and calculate the percentage split based on the previous level.
My code looks like this:
data = [['product1', 'product1a', 'product1aa', 10],
['product1', 'product1a', 'product1aa', 5],
['product1', 'product1a', 'product1aa', 15],
['product1', 'product1a', 'product1ab', 10],
['product1', 'product1a', 'product1ac', 20],
['product1', 'product1b', 'product1ba', 15],
['product1', 'product1b', 'product1bb',15],
['product2', 'product2_a', 'product2_aa', 30]]
df = pd.DataFrame(data, columns = ["Product_level1", "Product_Level2", "Product_Level3", "Qty"])
prod_levels = ["Product_level1", "Product_Level2", "Product_Level3"]
df = df.groupby(prod_levels).sum("Qty")
df["Qty ratio"] = df["Qty"] / df["Qty"].sum(level=prod_levels[-2])
print(df)
This gives me this as a result:
Qty Qty ratio
Product_level1 Product_Level2 Product_Level3
product1 product1a product1aa 30 0.500000
product1ab 10 0.166667
product1ac 20 0.333333
product1b product1ba 15 0.500000
product1bb 15 0.500000
product2 product2_a product2_aa 30 1.000000
According to my version of pandas (1.3.2), I'm getting a FutureWarning that level is deprecated and that I should use a groupby instead.
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum()
Unfortunately, I cannot seem to figure out what would be the correct syntax to get to the same results using Group by to make sure this will work with futrue versions of Pandas. I've tried variations of what's below but none worked.
df["Qty ratio"] = df.groupby(["Product_level1", "Product_Level2", "Product_Level3"]).sum("Qty") / df.groupby(level=prod_levels[-1]).sum("Qty")
Can anyway suggest how I could approach this?
Thank you
The level keyword on many functions was deprecated in 1.3. Deprecate: level parameter for aggregations in DataFrame and Series #39983.
The following functions are affected:
any
all
count
sum
prod
max
min
mean
median
skew
kurt
sem
var
std
mad
The level argument was always rewritten internally to be a groupby operation. For this reason, to increase clarity and reduce redundancy in the library it was deprecated.
The general pattern is whatever the level arguments passed to the aggregation were, they should be moved to groupby instead.
Sample Data:
import pandas as pd
df = pd.DataFrame(
{'A': [1, 1, 2, 2],
'B': [1, 2, 1, 2],
'C': [5, 6, 7, 8]}
).set_index(['A', 'B'])
C
A B
1 1 5
2 6
2 1 7
2 8
With aggregate over level:
df['C'].sum(level='B')
B
1 12
2 14
Name: C, dtype: int64
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead.
This now becomes groupby over level:
df['C'].groupby(level='B').sum()
B
1 12
2 14
Name: C, dtype: int64
In this specific example:
df["Qty ratio"] = df["Qty"] / df["Qty"].sum(level=prod_levels[-2])
Becomes
df["Qty ratio"] = df["Qty"] / df["Qty"].groupby(level=prod_levels[-2]).sum()
*just move the level argument to groupby
I have a data frame called df, and would like to add a column "Return" based on the existing columns by using lambda function. For each row, if value of "Field_3" < 50, then "Return" value would be the value of "Field_1", otherwise it would be "Field_2" value. My code raised a value error: Wrong number of items passed 7, placement implies 1. I'm a Python beginner, any help would be appreciated.
values_list = [[15, 2.5, 100], [20, 4.5, 50], [25, 5.2, 80],
[45, 5.8, 48], [40, 6.3, 70], [41, 6.4, 90],
[51, 2.3, 111]]
df = pd.DataFrame(values_list, columns=['Field_1', 'Field_2', 'Field_3'])
df["Return"] = df["Field_3"].apply(lambda x: df['Field_1'] if x < 50 else df['Field_2'])
The syntax here is a little tricky. You want:
def col_sorter(x, y, z):
if z < 50:
return x
else:
return y
df['return'] = df[['Field_1', 'Field_2', 'Field_3']].apply(lambda x: col_sorter(*x), axis=1)
out:
Field_1 Field_2 Field_3 return
0 15 2.5 100 2.5
1 20 4.5 50 4.5
2 25 5.2 80 5.2
3 45 5.8 48 45.0
4 40 6.3 70 6.3
Here's what's going on:
Define a function col_sorter that takes in the variables from one row of the dataframe and does what you want with it. (This isn't strictly necessary, but it's a good habit to form since not every transformation you want to do will be as simple as this, and this syntax scales.)
Then call apply off the columns you want from the dataframe, and pass them as a tuple to be unpacked into your function as a lambda function. (That's what the *x is doing).
This pattern will let you create a new column from arbitrarily complex calculations based on other columns in the dataframe, and while not as fast as vectorization is pretty fast.
For a lot more detail and discussion, see here.
Welcome to python and StackOverflow!
So I'm slicing my timeseries data, but for some of the columns, I need to be able to have the sum of the elements the were sliced. For example if you had
s = pd.Series([10, 30, 21, 18])
s = s[::-2]
I need to get the sum of a range of elements in this situation so I would need
3 39
1 40
as the output. I've see things like .cumsum() but I can't find anything to sum a range of elements
I'm not quite understand what's the first column represent. But the second column seems to be the sum result.
If you have got the correct slice, it's easy to get sum with sum(), like this:
import numpy as np
import pandas as pd
data = np.arange(0, 10).reshape(-1, 2)
pd.DataFrame(data).iloc[2:].sum(axis=1)
Output is :
2 9
3 13
4 17
dtype: int64
The answer based only in your title would be df[-15:].sum(), but it seems you're looking for performing a calculation per group of slicing.
To address this problem, pandas provides the window utilities. So, you can simply do:
s = pd.Series([10, 30, 21, 18])
s.rolling(2).sum()[::-2].astype(int)
which returns:
3 39
1 40
dtype: int64
Also, it's escalable, once you can replace 2 by any other value, and the .rolling method also works in dataframe objects.
The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")
I'm trying to do some data crunching in Python and I have a nested loop that does some arithmetic calculations. The inner loop is executed 20.000 times so the following piece of code takes a long time:
for foo in foo_list:
# get bar_list for foo
for bar in bar_list:
# do calculations w/ foo & bar
Could this loop be faster using Numpy or Scipy?
Use Numpy:
import numpy as np
foo = np.array(foo_list)[:,None]
bar = np.array(bar_list)[None,:]
Then
foo + bar
or other operation creates an array len(foo) * len(bar) with the respective results.
Example:
>>> foo_list = [10, 20, 30]
>>> bar_list = [4, 5]
>>> foo = np.array(foo_list)[:,None]
>>> bar = np.array(bar_list)[None,:]
>>> 2 * foo + bar
array([[24, 25],
[44, 45],
[64, 65]])
I've used numpy for image processing. Before I was using for(x in row) { for y in column} (or vice-versa, you get the idea).
That was fine for small images but would happily consume ram. Instead I switched to numpy.array. Much faster.
Depending on what is actually going on in your loop, yes.
numpy allows use of arrays and matrices, which allows indexing making execution of your code faster and, in some cases, can eliminate looping.
Indexing example:
import magic_square as ms
a = ms.magic(5)
print a # a is an array
[[17 24 1 8 15]
[23 5 7 14 16]
[ 4 6 13 20 22]
[10 12 19 21 3]
[11 18 25 2 9]]
# Indexing example.
b = a[a[:,1]>10]*10
print b
[[170, 240, 10, 80, 150],
[100, 120, 190, 210, 30],
[110, 180, 250, 20, 90]]
It should be clear how indexing can substantially improve your speed when analyzing one or more arrays. It's a powerful tool...
If these are aggregation statistics, consider using Python Pandas. For example, if you want to do something to all the different (foo, bar) pairs, you can just group-by those items and then apply the vectorized NumPy operations:
import pandas, numpy as np
df = pandas.DataFrame(
{'foo':[1,2,3,3,5,5],
'bar':['a', 'b', 'b', 'b', 'c', 'c'],
'colA':[1,2,3,4,5,6],
'colB':[7,8,9,10,11,12]})
print df.to_string()
# Computed average of 'colA' weighted by values in 'colB', for each unique
# group of (foo, bar).
weighted_avgs = df.groupby(['foo', 'bar']).apply(lambda x: (1.0*x['colA']*x['colB']).sum()/x['colB'].sum())
print weighted_avgs.to_string()
This prints the following for just the data object:
bar colA colB foo
0 a 1 7 1
1 b 2 8 2
2 b 3 9 3
3 b 4 10 3
4 c 5 11 5
5 c 6 12 5
And this is the grouped, aggregated output
foo bar
1 a 1.000000
2 b 2.000000
3 b 3.526316
5 c 5.521739