I need to find out how many of the first N rows of a dataframe make up (just over) 50% of the sum of values for that column.
Here's an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 1), columns=list("A"))
0 0.681991
1 0.304026
2 0.552589
3 0.716845
4 0.559483
5 0.761653
6 0.551218
7 0.267064
8 0.290547
9 0.182846
therefore
sum_of_A = df["A"].sum()
4.868260213425804
and with this example I need to find, starting from row 0, how many rows I need to get a sum of at least 2.43413 (approximating 50% of sum_of_A).
Of course I could iterate through the rows and sum and break when I get over 50%, but is there a more concise/Pythonic/efficient way of doing this?
I would use .cumsum(), which we can use to get all the rows where the cumulative sum is at least half of the total sum:
df[df["A"].cumsum() < df["A"].sum() / 2]
Related
Right now I am using this python code using pandas Library
grouped = df.groupby('EmployeeID')
temp = grouped.apply(lambda x: x.sample(frac= 0.1)
Scenario 1: If there are 15 rows for EmployeeID: 1, I will get 2 sample rows as a result.
(15 rows *10%)
Scenario 2: If there are 12 rows for employeeID 2, I will get 1 sample row. (12 rows * 10%)
My question is for scenerio 2, how do I round up so that I get 2 rows instead of 1 row
12 rows turns into a temporary 20 so that I can do 20 rows *10% = 2 rows.
IIUC you can use math.ceil like so:
from math import ceil
grouped = df.groupby('EmployeeID')
temp = grouped.apply(lambda g: g.sample(n=ceil(0.1 * len(g))))
I have a pandas dataframe similar to this structure:
a b c
1 0 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 0 0 0
I want to know if the sum of each row is != 0, so I try to use a for loop iterating each row and sum them with the builtin .sum() function and check if the condition applies.
The problem is that 99% of the data (>200,000 records) is filled with 0s, and my goal is to know which index whose sum is > 0.
Ive tried this
for x in range(len(people_killed)):
print("Checking row"+str(x))
if people_killed.iloc[x].sum() == 0:
people_killed = people_killed.drop(x, axis=0)
but it will take a long time to get through every row.
What would be the best way to do this?
Thanks a lot beforehand!
You can use sum and then find nonzero indices as follows:
np.flatnonzero(people_killed.sum(1))
#[0, 2]
people_killed[people_killed.apply(sum, axis = 1) != 0]
Let me give you a brief logic about this problem . You must not find the sum of each element in the row but if there are all the positive numbers then just find a single number greater than 0 .That is when you iterate the loop stop the loop until you find a number greater than 0 .The sum of row will not become zero .
To answer your first question: How to print the sum of columns
(in each row), run:
people_killed.sum(axis=1)
The result is:
1 1
2 0
3 1
4 0
5 0
dtype: int64
The left column is the index and the right column - sums for each row.
And as your second question is concerned, note that:
people_killed.sum(axis=1).ne(0) generates a Series of bool,
answering the question: Has this row a non-zero sum?
people_killed[people_killed.sum(axis=1).ne(0)] retrieves all
rows with sum != 0 (an example of boolean indexing).
So to get your result only one addition is needed: Retrieve only the
index of these rows:
people_killed[people_killed.sum(axis=1).ne(0)].index
The result is Int64Index([1, 3], dtype='int64'), so it is a list
of index values of the "wanted" rows, not integer positions of these rows
(as the solution by Ehsan generates).
My solution computes just what you asked for: indices.
I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.
I have a pandas Data Frame which is a 50x50 correlation matrix. In the following picture you can see what I have as an example
What I would like to do, if it's possible of course, is to make a new data frame which has only the elements of the old one that are higher than 0.5 or lower than -0.5, indicating a strong linear relationship, but not 1, to avoid the variance parts.
I dont think what I ask is exactly possible because of course variable x0 wont have the same strong relationships that x1 have etc, so the new data frame wont be looking very good.
But is there any way to scan fast through this data frame, find the values I mentioned and maybe at least insert them into an array?
Any insight would be helpful. Thanks
you can't really look at a correlation matrix if you want to drop correlation pairs that are too low. One thing you could do is stack the frame and keep the relevant correlation pair.
having (randomly generated as an example):
0 1 2 3 4
0 0.038142 -0.881054 -0.718265 -0.037968 -0.587288
1 0.587694 -0.135326 -0.529463 -0.508112 -0.160751
2 -0.528640 -0.434885 -0.679416 -0.455866 0.077580
3 0.158409 0.827085 0.018871 -0.478428 0.129545
4 0.825489 -0.000416 0.682744 0.794137 0.694887
you could do:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 5)))
df = df.stack()
df = df[((df > 0.5) | (df < -0.5)) & (df != 1)]
0 1 -0.881054
2 -0.718265
4 -0.587288
1 0 0.587694
2 -0.529463
3 -0.508112
2 0 -0.528640
2 -0.679416
3 1 0.827085
4 0 0.825489
2 0.682744
3 0.794137
4 0.694887
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)