how to select partial data from a pandas dataframe - python

I have a dataframe,
flag=np.sign(value)
ticker value flag cumsum(flag)
A1 -1.5 -1 -1
B1 1.4 1 0
C3 -2.4 -1 -1
D4 -1.8 -1 -2
E6 -1.6 -1 -3
I have a variable, cutoff = 1 (it is always +ve, its a modulus)
how can I best select the tickers where abs(cumsum(flag)) <= 1
i.e., expected output is [A1, B1, C3]
i.e I want to keep going down the cumsum list until I find the LAST 1 or -1
I tried a loop:
ticker_list_keep = []
for y in range(0, len(df['cumsum']), 1):
if abs(df['cumsum'][y]) < abs(capacity) + 1:
ticker_list_keep.append(df.index[y])
but this would give me only A1 and C3, and would miss out B2
Thanks
per note on comments:
#Vaishali - The question is not a duplicate. I wanted ALL the values in the ticker list, up until we get to the final -1 in the cumsum list.
Above, we get to the final abs(val)=1 at C3, so my list is C3, B1,A1.
Your solution the thread you pointed me to gives only A1 and C3.
You notice A1 is not the final -1 in the cumsum list, therefore A1 alone doesn't suffice. We note C3 is where the final +/-1 occurs therefore our required list is A1,B1,C3
Thanks!!

You can find last valid index based on your condition and create a slice.
idx = df[df['cumsum(flag)'].abs() <= 1].last_valid_index()
df.loc[:idx, :]
ticker value flag cumsum(flag)
0 A1 -1.5 -1 -1
1 B1 1.4 1 0
2 C3 -2.4 -1 -1

Related

Custom function + groupby Pandas with different conditions on grouped by variables

I want to generate some weights using groupby on a data that originally looks like this :
V1 V2 MONTH CHOICES PRIORITY
X T1 M1 C1 1
X T1 M1 C2 0
X T1 M1 C3 0
X T2 M1 C1 1
X T2 M1 C5 0
X T2 M1 C6 0
X T2 M1 C2 1
X T1 M2 C1 1
X T1 M2 C2 0
X T1 M2 C3 0
X T2 M2 C1 0
X T2 M2 C5 1
X T2 M2 C6 0
X T2 M2 C2 1
Basically, when the MONTH is different than M1, I want to have flagged choices with weights equal to double any non flagged choice.
Example : if you have (C1, C2, C3) and C1 is the only one flagged, weights would be : 0.5 / 0.25 / 0.25.
On the same time, for the first month, I want the weights to be solely focused on flagged choices. Previous example would become (1/0/0).
Precision about the data :
For a given tuple (V1,V2,MONTH), we can have at most two choices flagged as priorities (no priority at all is a possibility).
Here's what I've tried :
def weights_preferences(data):
if (data.MONTH.values != 'M1'):
data['WEIGHTS'] = 1/(len(data)+data[data.PRIORITY==1].shape[0])
data['WEIGHTS'] = data.apply(lambda x : 2*x.WEIGHTS if x.PRIORITY==1 else x.WEIGHTS, axis=1)
elif data.MONTH.values == 'M1' & data[data.PRIORITY==1].shape[0]==0 :
data['WEIGHTS'] = 1/(len(data))
else :
if data[data.PREFERENCE==1].shape[0]==1 :
data['WEIGHTS'] = [1 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
else :
data['WEIGHTS'] = [0.5 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
return data
tmp = tmp.groupby(['V1','V2','MONTH']).apply(weights_preferences)
The problem is that since I groupby 'MONTH', it seems that the value no longer appears in data on which 'weights_preferences' is applied.
P.S : Output would look like this
V1 V2 MONTH CHOICES PRIORITY WEIGHTS
X T1 M1 C1 1 1
X T1 M1 C2 0 0
X T1 M1 C3 0 0
X T2 M1 C1 1 0.5
X T2 M1 C5 0 0
X T2 M1 C6 0 0
X T2 M1 C2 1 0.5
X T1 M2 C1 1 0.5
X T1 M2 C2 0 0.25
X T1 M2 C3 0 0.25
X T2 M2 C1 0 0.16
X T2 M2 C5 1 0.33
X T2 M2 C6 0 0.16
X T2 M2 C2 1 0.33
Any suggestions are very welcomed !
Thanks.

Finding equal variables in non solvable multi-variables linear equations

I am trying to find an algorithm to solve the current problem. I have multiple unknown variables (F1,F2,F3, ... Fx) and (R1,R2,R3 ... Rx) and multiple equations like this:
F1 + R1 = a
F1 + R2 = a
F2 + R1 = b
F3 + R2 = b
F2 + R3 = c
F3 + R4 = c
where a, b and c are known numbers. I am trying to find all equal variables in such equations. For example in the above equation I could see that F2 and F3 are equal and R3 and R4 are equal.
First equations tells us that R1 and R2 are equal, and second tells us that F2 and F3 are equal, while the third tells us that R3 and R4 are equal.
For a more complex scenario, is there any known algorithm that can find all equal (F and R) variables????
(I will edit the question if it is not clear enough)
Thanks
For the general situation, row echelon is probably the way to go. If every equation has only two variables, then you can consider each variable to be in a partition. Every time two variables appear in an equation together, their partitions are joined. So to begin with each variable is in its own partition. After the first equation, there a partition that contains F1 and R1. After the second equation, that partition is replaced by a partition that contain F1, R1 and R2. You should have the variables in some sort of order, and when two partitions are joined, put all the variables except the first one in terms of the first one (it doesn't really matter how you order the variables, you just need some way of deciding which is the "first"). So for instance, after the first equation, you have R1 = a-F1. After the second equation, you have R1 = a-F1 and R2 = a-F1. Each variable can be represented by two numbers: some number times the first variable in their partition, plus a constant. Then at the end, you go through each partition, and look for variables that have the same two numbers representing them.
Here's a hint: you have defined a system of linear equations with 7 variables and 6 equations. Here's a crude matrix/vector notation:
1 0 0 1 0 0 0 F1 a
1 0 0 0 1 0 0 F2 a
0 1 0 1 0 0 0 * F3 = b
0 0 1 0 1 0 0 R1 b
0 1 0 0 0 1 0 R2 c
0 0 1 0 0 0 1 R3 c
R4
If you do the Gaussian elimination manually, you can see that e.g. first row minus the second row results in
(0 0 0 1 -1 0 0) * (F1 F2 F3 R1 R2 R3 R4)^T = a - a
R1 - R2 = 0
R1 = R2
Which implies that R1 and R2 are what you call equivalent. There are many different methods to solve the system or interpret the results. Maybe you will find this thread useful: Is there a standard solution for Gauss elimination in Python?

Find the minimal rows with maximum 1s column

I have a numpy 2D array with all zeros and 1s and I want those rows that has atleast one 1 for each column. For example:
PROBLEM STATEMENT: Find minimal rows that gives maximum 1s across all columns.
INPUT1:
A B C D E
t1 0 0 0 1 1
t2 0 1 1 0 1
t3 0 1 1 0 1
t4 1 0 1 0 1
t5 1 0 1 0 1
t6 1 1 1 1 0
Here, there are multiple answers like (t6, t1), (t6, t2), (t6, t3), (t6, t4), (t6, t5).
INPUT2:
A B C D E
t1 0 0 0 1 1
t2 0 1 1 0 1
t3 0 1 1 0 1
t4 1 0 1 0 1
t5 1 0 1 0 1
t6 1 1 1 1 1
Answer: t6
I don't want to use brute force method as my original matrix is very big. Is there a smart way to do this?
Naive solution, worst-case O(2^n)
This iterates over all possible choices of rows, starting with as few rows as possible, making average cases usually low-polynomial time.
from itertools import combinations
import numpy as np
def minimum_rows(arr):
out_list = []
rows = arr.shape[0]
for x in range(1, rows):
for combo in combinations(range(rows),x):
if np.logical_or.reduce(arr[[combo]]).all():
out_list.append(combo)
if out_list:
return out_list
I wrote this entirely on my phone without much testing, so it may or may not work. It employs no tricks, but is fairly fast. Note that it will be slower when the ratio columns/rows is larger or the the probability of a given element being True is smaller, as that makes it less likely for fewer rows to meet the conditions required, causing x to increase, which in turn will increase the number of combinations iterated though.

Add new column with existing column names

I'm dealing with a dataframe which looks like:
FID geometry Code w1 w2
0 12776 POLYGON ((-1.350000000000025 53.61540813717482... 12776 0 1
1 13892 POLYGON ((6.749999999999988 52.11964001623148,... 13892 1 0
2 14942 POLYGON ((-3.058896639907732e-14 51.3958198431... 14942 1 1
3 18964 POLYGON ((8.549999999999974 45.26941059233587,... 18964 0 1
4 19863 POLYGON ((-0.4500000000000305 44.6337746953077... 19863 0 1
My objective is to add a column, labeled as 'Max', where I'm going to write which w (w1, w2) has got more frequency.
So far I've only managed add a column in which appears the maximum frequency, instead of the name of the column where it appears.
The desired output would be something like this:
FID geometry Code w1 w2 Max
0 12776 ... 12776 0 1 w2
1 13892 ... 13892 1 0 w1
2 14942 ... 14942 1 1 0
3 18964 ... 18964 0 1 w2
4 19863 ... 19863 0 1 w2
Furthermore, I'd like to fill with zeros whenever the frequencies are the same, if its possible, at the same time.
Any help would be appreciated! :-)
Use np.where to choose 0 when they are equal idxmax(1) when they are not.
df['max'] = np.where(df.w1 == df.w2, 0, df[['w1', 'w2']].idxmax(1))
df
FID geometry Code w1 w2 Max
0 12776 ... 12776 0 1 w2
1 13892 ... 13892 1 0 w1
2 14942 ... 14942 1 1 0
3 18964 ... 18964 0 1 w2
4 19863 ... 19863 0 1 w2
Something like this should work:
(df['w1'] == df['w2']).map({True: 0}).fillna(df[['w1', 'w2']].idxmax(axis=1))
Out[26]:
0 w2
1 w1
2 0
3 w2
4 w2
dtype: object
How it works:
The main part is with idxmax:
df[['w1', 'w2']].idxmax(axis=1)
Out[27]:
0 w2
1 w1
2 w1
3 w2
4 w2
dtype: object
This first selects the relevant columns, and returns the index of the maximum (axis=1 for columns). However, it returns the first index in case of ties.
(df['w1'] == df['w2']).map({True: 0}) fills a series with 0 when w1==w2. Remaining values are NaN. So those are filled with idxmax values.
Note: np.where is definitely the more logical (and probably faster) choice here. I just like to experiment with other alternatives.

conditional sums for pandas aggregate

I just recently made the switch from R to python and have been having some trouble getting used to data frames again as opposed to using R's data.table. The problem I've been having is that I'd like to take a list of strings, check for a value, then sum the count of that string- broken down by user. So I would like to take this data:
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
And return:
A_id_grouped sum_up sum_down ... over_200_up
1: a1 1 0 ... 0
2: a2 0 1 0
3: a3 2 0 ... 1
4: a4 0 0 0
5: a5 0 0 ... 0
Before I did it with the R code (using data.table)
>DT[ ,list(A_id_grouped, sum_up = sum(B == "up"),
+ sum_down = sum(B == "down"),
+ ...,
+ over_200_up = sum(up == "up" & < 200), by=list(A)];
However all of my recent attempts with Python have failed me:
DT.agg({"D": [np.sum(DT[DT["B"]=="up"]),np.sum(DT[DT["B"]=="up"])], ...
"C": np.sum(DT[(DT["B"]=="up") & (DT["C"]>200)])
})
Thank you in advance! it seems like a simple question however I couldn't find it anywhere.
To complement unutbu's answer, here's an approach using apply on the groupby object.
>>> df.groupby('A_id').apply(lambda x: pd.Series(dict(
sum_up=(x.B == 'up').sum(),
sum_down=(x.B == 'down').sum(),
over_200_up=((x.B == 'up') & (x.C > 200)).sum()
)))
over_200_up sum_down sum_up
A_id
a1 0 0 1
a2 0 1 0
a3 1 0 2
a4 0 0 0
a5 0 0 0
There might be a better way; I'm pretty new to pandas, but this works:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A_id':'a1 a2 a3 a3 a4 a5'.split(),
'B': 'up down up up left right'.split(),
'C': [100, 102, 100, 250, 100, 102]})
df['D'] = (df['B']=='up') & (df['C'] > 200)
grouped = df.groupby(['A_id'])
def sum_up(grp):
return np.sum(grp=='up')
def sum_down(grp):
return np.sum(grp=='down')
def over_200_up(grp):
return np.sum(grp)
result = grouped.agg({'B': [sum_up, sum_down],
'D': [over_200_up]})
result.columns = [col[1] for col in result.columns]
print(result)
yields
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
An old question; I feel a better way, and avoiding the apply, would be to create a new dataframe, before grouping and aggregating:
df = df.set_index('A_id')
outcome = {'sum_up' : df.B.eq('up'),
'sum_down': df.B.eq('down'),
'over_200_up' : df.B.eq('up') & df.C.gt(200)}
outcome = pd.DataFrame(outcome).groupby(level=0).sum()
outcome
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Another option would be to unstack before grouping; however, I feel it is a longer, unnecessary process:
(df
.set_index(['A_id', 'B'], append = True)
.C
.unstack('B')
.assign(gt_200 = lambda df: df.up.gt(200))
.groupby(level='A_id')
.agg(sum_up=('up', 'count'),
sum_down =('down', 'count'),
over_200_up = ('gt_200', 'sum')
)
)
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Here, what I have recently learned using df assign and numpy's where method:
df3=
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
df3.assign(sum_up= np.where(df3['B']=='up',1,0),sum_down= np.where(df3['B']=='down',1,0),
over_200_up= np.where((df3['B']=='up') & (df3['C']>200),1,0)).groupby('A_id',as_index=False).agg({'sum_up':sum,'sum_down':sum,'over_200_up':sum})
outcome=
A_id sum_up sum_down over_200_up
0 a1 1 0 0
1 a2 0 1 0
2 a3 2 0 1
3 a4 0 0 0
4 a5 0 0 0
This also resembles with if you are familiar with SQL case and want to apply the same logic in pandas
select a,
sum(case when B='up' then 1 else 0 end) as sum_up
....
from table
group by a

Categories

Resources