I am performing record linkage on a dataframe such as:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 1 0.5
1 2 0 0
2 1 1 0.8
2 5 1 0.8
3 1 0 0
3 2 1 0.5
When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that, but reverse the other prediction link values for that ID_1 to 0. If the (highest) probabilities are of equal value I want to reverse all the predicted link values to 0. If only one predicted link then the predicted values should be left as they are.
The resulting dataframe would look like this:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 0 0.5
1 2 0 0
2 1 0 0.8
2 5 0 0.8
3 1 0 0
3 2 1 0.5
I am grouping via pandas.groupby, and tried some variations with numpy.select and numpy.where, but without luck. Any help much appreciated!
For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.
First let's construct our data :
import pandas as pd
from io import StringIO
csvfile = StringIO(
"""ID_1\tID_2\tPredicted Link\tProbability
1\t0\t1\t0.9
1\t1\t1\t0.5
1\t2\t0\t0
2\t1\t1\t0.8
2\t5\t1\t0.8
3\t1\t0\t0
3\t2\t1\t0.5""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let's create a mask :
max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max()))
max_proba
Out[196]:
0 True
1 False
2 False
3 True
4 True
5 False
6 True
Name: Probability, dtype: bool
Considering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let's build a mask that consider these two conditions, True if max value and if only one max value.
To be more accurate, for each ID_1, if a Probablity value is duplicated then it can't be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 value
mask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1)
mask_unique
Out[284]:
0 True
1 True
2 True
3 False
4 False
5 True
6 True
Name: Probability, dtype: bool
Finally, let's combine our two masks :
df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique)
df
Out[285]:
ID_1 ID_2 Predicted Link Probability
0 1 0 1 0.9
1 1 1 0 0.5
2 1 2 0 0.0
3 2 1 0 0.8
4 2 5 0 0.8
5 3 1 0 0.0
6 3 2 1 0.5
Related
I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2
I have a DataFrame that looks like this table:
index
x
y
value_1
cumsum_1
cumsum_2
0
0.1
1
12
12
0
1
1.2
1
10
12
10
2
0.25
1
7
19
10
3
1.0
2
3
0
3
4
0.72
2
5
5
10
5
1.5
2
10
5
13
So my aim is to calculate the cumulative sum of value_1. But there are two conditions that must be taken into account.
First: If value x is less than 1 the cumsum() is written in column cumsum_1 and if x is greater in column cumsum_2.
Second: column y indicates groups (1,2,3,...). When the value in y changes, the cumsum()-operation start all over again. I think the grouby() method would help.
Does somebody have any idea?
You can use .where() on conditions x < 1 or x >= 1 to temporarily modify the values of value_1 to 0 according to the condition and then groupby cumsum, as follows:
The second condition is catered by the .groupby function while the first condition is catered by the .where() function, detailed below:
.where() keeps the column values when the condition is true and change the values (to 0 in this case) when the condition is false. Thus, for the first condition where column x < 1, value_1 will keep its values for feeding to the subsequent cumsum step to accumulate the filtered values of value_1. For rows where the condition x < 1 is False, value_1 has its values masked to 0. These 0 passed to cumsum for accumulation is effectively the same effect as taking out the original values of value_1 for the accumulation into
column cumsum_1.
The second line of codes accumulates value_1 values to column cumsum_2 with the opposite condition of x >= 1. These 2 lines of codes, in effect, allocate value_1 to cumsum_1 and cumsum_2 according to x < 1 and x >= 1, respectively.
(Thanks for the suggestion of #tdy to simplify the codes)
df['cumsum_1'] = df['value_1'].where(df['x'] < 1, 0).groupby(df['y']).cumsum()
df['cumsum_2'] = df['value_1'].where(df['x'] >= 1, 0).groupby(df['y']).cumsum()
Result:
print(df)
x y value_1 cumsum_1 cumsum_2
0 0.10 1 12 12 0
1 1.20 1 10 12 10
2 0.25 1 7 19 10
3 1.00 2 3 0 3
4 0.72 2 5 5 3
5 1.50 2 10 5 13
Here is another approach using a pivot:
(df.assign(ge1=df['x'].ge(1).map({True: 'cumsum_2', False: 'cumsum_1'}))
.pivot(columns='ge1', values='value_1').fillna(0).groupby(df['y']).cumsum()
.astype(int)
)
output:
ge1 cumsum_1 cumsum_2
0 12 0
1 12 10
2 19 10
3 0 3
4 5 3
5 5 13
full code:
df[['cumsum_1', 'cumsum_2']] = (df.assign(ge1=df['x'].ge(1).map({True: 'cumsum_2', False: 'cumsum_1'}))
.pivot(columns='ge1', values='value_1').fillna(0).groupby(df['y']).cumsum()
.astype(int)
)
(or use pd.concat to concatenate)
output:
index x y value_1 cumsum_1 cumsum_2
0 0 0.10 1 12 12 0
1 1 1.20 1 10 12 10
2 2 0.25 1 7 19 10
3 3 1.00 2 3 0 3
4 4 0.72 2 5 5 3
5 5 1.50 2 10 5 13
Similar to above approaches but a little more chained.
df[['cumsum_1a', 'cumsum2a']] = (df.
assign(
v1 = lambda temp: temp.x >= 1,
v2 = lambda temp: temp.v1 * temp.value_1,
v3 = lambda temp: ~ temp.v1 * temp.value_1
).
groupby('y')[['v2', 'v3']].
cumsum()
)
I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)
I have this dataframe
Id,ProductId,Product
1,100,a
1,100,x
1,100,NaN
2,150,NaN
3,150,NaN
4,100,a
4,100,x
4,100,NaN
Here I want to remove some of the rows which contains NaN and some I don't want to remove.
The removing criteria is as follow.
I want to remove only those NaNs rows whose Id already contains the value in Product columns.
for example, here Id1 has already value in Product columns and still contains NaN, so I want to remove that row.
But for id2, there exists only NaN in Product column. So I don't want to remove that one. Similarly for Id3 also, there is only NaN values in the Product columns and I want to keep it that one too.
Final Output would be like this
Id,ProductId,Product
1,100,a
1,100,x
2,150,NaN
3,150,NaN
4,100,a
4,100,x
Dont use groupby if exist alternative, because slow.
vals = df.loc[df['Product'].notnull(), 'Id'].unique()
df = df[~(df['Id'].isin(vals) & df['Product'].isnull())]
print (df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Explanation:
First get all Id with some non missing values:
print (df.loc[df['Product'].notnull(), 'Id'].unique())
[1 4]
Then check these groups with missing values:
print (df['Id'].isin(vals) & df['Product'].isnull())
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
dtype: bool
Invert boolean mask:
print (~(df['Id'].isin(vals) & df['Product'].isnull()))
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 False
dtype: bool
And last filter by boolean indexing:
print (df[~(df['Id'].isin(vals) & df['Product'].isnull())])
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
You can group the dataframe by Id and drop the NaN if the group has more than one element:
>> df.groupby(level='Id', group_keys=False
).apply(lambda x: x.dropna() if len(x) > 1 else x)
ProductId Product
Id
1 100 a
1 100 x
2 150 NaN
3 150 NaN
4 100 a
4 100 x
Calculate groups (Id) where values (Product) are all null, then remove required rows via Boolean indexing with loc accessor:
nulls = df.groupby('Id')['Product'].apply(lambda x: x.isnull().all())
nulls_idx = nulls[nulls].index
df = df.loc[~(~df['Id'].isin(nulls_idx) & df['Product'].isnull())]
print(df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Use groupby+transform with parameter count and then boolean indexing using isnull of Product column as:
count = df.groupby('Id')['Product'].transform('count')
df = df[~(count.ne(0) & df.Product.isnull())]
print(df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
In a pandas dataframe, how can I drop a random subset of rows that obey a condition?
In other words, if I have a Pandas dataframe with a Label column, I'd like to drop 50% (or some other percentage) of rows where Label == 1, but keep all of the rest:
Label A -> Label A
0 1 0 1
0 2 0 2
0 3 0 3
1 10 1 11
1 11 1 12
1 12
1 13
I'd love to know the simplest and most pythonic/panda-ish way of doing this!
Edit: This question provides part of an answer, but it only talks about dropping rows by index, disregarding the row values. I'd still like to know how to drop only from rows that are labeled a certain way.
Use the frac argument
df.sample(frac=.5)
If you define the amount you want to drop in a variable n
n = .5
df.sample(frac=1 - n)
To include the condition, use drop
df.drop(df.query('Label == 1').sample(frac=.5).index)
Label A
0 0 1
1 0 2
2 0 3
4 1 11
6 1 13
Using drop with sample
df.drop(df[df.Label.eq(1)].sample(2).index)
Label A
0 0 1
1 0 2
2 0 3
3 1 10
5 1 12