Regarding getting the rowsum of a python data frame - python

I have following python data frame
data={'1':[1,1,1,1],'2':[1,1,1,1],'3':[1,1,1,1]}
df=pd.DataFrame(data)
I need to get the sum of the rows in a such away that my final output should be like this,
So in this desired output, the second column should contain the row sum up to second column of the original data frame. So on.
To get this output, I wrote the following code,
sum_mat=np.zeros(shape=(3,3))
numOfIteration=3
itr=list(range(0,numOfIteration))
for i in range(0,3):
for j in range(0,3):
while i <= itr[i]:
sum_mat[i,j]+= df.iloc[i,j]
print (sum_mat)
I am not getting an output here because the code is running forever (may be an infinite loop).
Can anyone suggest anything to get the desired output ?
May be there is more effective and easier way to do the same thing.
Thank you
UPDATE:
i update the for loop as follows,
for i in range(0,3):
for j in range(0,3):
while i <= itr[i]:
sum_mat[i,j] = df.iloc[:,0:i].sum(axis=1)
but it gives following error,
sum_mat[i,j] = df.iloc[:,0:i].sum(axis=1)
ValueError: setting an array element with a sequence.

this could work also
for i,row in df.iterrows(): #go through each row
df.loc[i]=df.loc[i].cumsum() #assign each row as the cumulative sum of the row
output:
>>> df
1 2 3
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
EDIT
can just do :
df=df.cumsum(axis=1)

sum_mat=np.zeros(shape=(3,3))
numOfIteration=3
itr=list(range(0,numOfIteration))
for i in range(0,3):
for j in range(0,3):
if j==0:
sum_mat[i,0]=df.iloc[i,0]
else:
sum_mat[i,j]=df.iloc[i,j]+sum_mat[i,j-1]
print (sum_mat)
This should work

Use cumsum() function to find the cumulative sum of the values seen so far along the column axis.
Ex.
import pandas as pd
data = {'1': [1, 1, 1, 1], '2': [1, 1, 1, 1], '3': [1, 1, 1, 1]}
df = pd.DataFrame(data)
print("before")
print(df)
df = df.cumsum(axis=1)
print("after")
print(df)
O/P:
before
1 2 3
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
after
1 2 3
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3

Related

Combining looping and conditional to make new columns on dataframe

I want to make a function with loop and conditional, that count only when Actual Result = 1.
So the numbers always increase by 1 if the Actual Result = 1.
This is my dataframe:
This is my code but it doesnt produce the result that i want :
def func_count(x):
for i in range(1,880):
if x['Actual Result']==1:
result = i
else:
result = '-'
return result
X_machine_learning['Count'] = X_machine_learning.apply(lambda x:func_count(x),axis=1)
When i check & filter with count != '-' The result will be like this :
The number always equal to 1 and not increase by 1 everytime the actual result = 1. Any solution?
Try something like this:
import pandas as pd
df = pd.DataFrame({
'age': [30,25,40,12,16,17,14,50,22,10],
'actual_result': [0,1,1,1,0,0,1,1,1,0]
})
count = 0
lst_count = []
for i in range(len(df)):
if df['actual_result'][i] == 1:
count+=1
lst_count.append(count)
else:
lst_count.append('-')
df['count'] = lst_count
print(df)
Result
age actual_result count
0 30 0 -
1 25 1 1
2 40 1 2
3 12 1 3
4 16 0 -
5 17 0 -
6 14 1 4
7 50 1 5
8 22 1 6
9 10 0 -
Actually, you don't need to loop over the dataframe, which is mostly a Pandas-antipattern that should be avoided. With df your dataframe you could try the following instead:
m = df["Actual Result"] == 1
df["Count"] = m.cumsum().where(m, "-")
Result for the following dataframe
df = pd.DataFrame({"Actual Result": [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
is
Actual Result Count
0 1 1
1 1 2
2 0 -
3 1 3
4 1 4
5 1 5
6 0 -
7 0 -
8 1 6
9 0 -

How to create a new column through a specific condition?

I have a column like this:
1
0
0
0
0
1
0
0
0
1
0
0
and need as result:
1
1
1
1
1
2
2
2
2
3
3
3
A method/algorithm that divides into ranks from 1 to 1 and gives them successively values.
Any idea?
You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1.
def rank(lst):
counter = 0
for i, column in enumerate(lst):
if column == 1:
counter+=1
lst[i] = counter
def fill_arr(arr):
curr = 1
for i in range(1, len(arr)):
arr[i] = curr
if i < len(arr)-1 and arr[i+1] == 1:
curr += 1
return arr
A quick test
arr = [1,0,0,0,1,0,0,0,1,0,0]
fill_arr(arr)
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
The idea is as follows:
keep track of the number of 1s we encounter it curr by looking ahead and increment it as we see new 1s.
set the elements at the current index to curr.
we start at index 1 since we know that there is a one at index zero. This helps us reduce edge cases and make the algorithm easier to manage.
What you are looking for is usually called the cumulated sums; or as a verb, you're looking to increasingly accumulate the values in the list.
For a python list:
import itertools
l1 = [1,0,0,0,1,0,0,0,1,0,0]
l2 = list(itertools.accumulate(l1))
print(l2)
# [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
For a numpy array:
import numpy
a1 = numpy.array([1,0,0,0,1,0,0,0,1,0,0])
a2 = a1.cumsum()
print(a2)
# [1 1 1 1 2 2 2 2 3 3 3]
For a column in a pandas dataframe:
import pandas
df = pandas.DataFrame({'col1': [1,0,0,0,1,0,0,0,1,0,0]})
df['col2'] = df['col1'].cumsum()
print(df)
# col1 col2
# 0 1 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 1 2
# 5 0 2
# 6 0 2
# 7 0 2
# 8 1 3
# 9 0 3
# 10 0 3
Documentation:
itertools.accumulate;
numpy.cumsum;
numpy.ndarray.cumsum;
pandas.DataFrame.cumsum;
pandas.Series.cumsum.

Assign a value to the first row of a sub-dataframe

I have a Pandas dataframe that looks like this:
df = pd.DataFrame({'gp_id': [1, 2, 1, 2], 'A': [1, 2, 3, 4]})
gp_id A
0 1 1
1 2 2
2 1 3
3 2 4
I want to assign the value -1 to the first row of the group with the id 2 (gp_id = 2), to get the following output:
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
To do this, I've tried the following code:
df[df.gp_id == 2].A.iloc[0] = -1
But this doesn't do anything as I'm assigning a value in the sub-dataframe df[df.gp_id == 2] and I'm not modifying the original dataframe df.
Is there an easy way to solve this problem?
You could do:
df.loc[(df.gp_id == 2).argmax(), 'A'] = -1
as pd.Series.argmax returns the first max.
If you are not sure that the value is present in the dataframe, you could do:
cond = (df.gp_id == 2)
if cond.sum():
df.loc[cond.argmax(), 'A'] = -1
General solution if possible mask return no rows is chain another mask by cumulative sum of mask by & for bitwise AND and set values by DataFrame.loc:
m = df.gp_id == 2
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Working well if no match - no assign, no error, no incorrect assignment:
m = df.gp_id == 7
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Solution if always match mask at least one row is:
idx = df[df.gp_id == 2].index[0]
df.loc[idx, 'A'] = -1
print (df)
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
If no match, solution raise error, no incorrect assignment.

Count values in previous rows that are greater than current row value

I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.
We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a
IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6
count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)
The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources