How to apply a command to multiple column elements? - python

I have the table below and would like to apply onde command to compare and eliminate duplicate values ​​in row n and n + 1 in multiple dataframes (df1, df2).
Comand sugestion: .diff().ne(0)
How to apply this command only to the elements of columns A ,C and D, using the commands def ,lambda or apply?
df1:
A
B
22
33
22
4
3
55
1
55
df2:
C
D
5
2.3
45
33
7
33
7
11
The expected output is:
df1:
A
B
22
33
NaN
4
3
55
1
55
df2:
C
D
5
2.3
45
33
7
NaN
NaN
11
The other desired option would be to delete the duplicated lines, keeping the first number.
df1:
A
B
22
33
row deleted
row deleted
3
55
row deleted
row deleted
df2:
C
D
5
2.3
45
33
row deleted
row deleted
row deleted
row deleted

Based on this answer, you can create a mask for a single column in your dataframe (here for example for column A) with
mask1 = df['A'].shift() == df['A']
Since this shows True if there was a duplicate, you need to slice the DataFrame with the negation of the mask
df = df[~mask1]
To do this for multiple columns, make a mask for each column and use NumPy's logical_or to combine the masks. Then slice df with the final mask.

With your suggested command: .diff().ne(0) (or .diff.eq(0))
Option 1: set NaN to duplicate values
# For 1 column
df1.loc[df1['A'].diff().eq(0), 'A'] = np.NaN
print(df1)
A B
0 22.0 33
1 NaN 4
2 3.0 55
3 1.0 55
# For multiple columns
df2 = df2.apply(lambda x: x[x.diff().ne(0)])
print(df2)
C D
0 5.0 2.3
1 45.0 33.0
2 7.0 NaN
3 NaN 11.0
Option 2: delete rows
>>> df1[df1.diff().ne(0).all(axis=1)]
A B
0 22 33
2 3 55
>>> df2[df2.diff().ne(0).all(axis=1)]
C D
0 5 2.3
1 45 33.0

Related

How to slice different rows in multiple columns in a dataframe, according to indexes in another dataframe? [duplicate]

I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Setup
Consider the dataframes A and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
lets take a simple example:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
A
3 5.0
I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.
In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:
df_left = pd.DataFrame({
"time_from": [1, 4, 10, 21],
"time_to": [3, 7, 15, 27]
})
df_right = pd.DataFrame({
"time": [2, 6, 16, 25]
})
df_left
time_from time_to
0 1 3
1 4 7
2 10 15
3 21 27
df_right
time
0 2
1 6
2 16
3 25
First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:
merged = pd.merge_asof(
left=df_1,
right=df_2.rename(columns={"time": "candidate_match_1"}),
left_on="time_from",
right_on="candidate_match_1",
direction="forward"
)
merged
time_from time_to candidate_match_1
0 1 3 2
1 4 7 6
2 10 15 16
3 21 27 25
As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.
Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:
merged = pd.merge_asof(
left=merged,
right=df_2.rename(columns={"time": "candidate_match_2"}),
left_on="time_to",
right_on="candidate_match_2",
direction="backward"
)
merged
time_from time_to candidate_match_1 candidate_match_2
0 1 3 2 2
1 4 7 6 6
2 10 15 16 6
3 21 27 25 25
Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:
merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]
merged
time_from time_to candidate_match_1 candidate_match_2 match
0 1 3 2 2 2
1 4 7 6 6 6
2 10 15 16 6 None
3 21 27 25 25 25

overwrite and append pandas data frames on column value

I have a base dataframe df1:
id name count
1 a 10
2 b 20
3 c 30
4 d 40
5 e 50
Here I have a new dataframe with updates df2:
id name count
1 a 11
2 b 22
3 f 30
4 g 40
I want to overwrite and append these two dataframes on column name.
for Eg: a and b are present in df1 but also in df2 with updated count values. So we update df1 with new counts for a and b. Since f and g are not present in df1, so we append them.
Here is an example after the desired operation:
id name count
1 a 11
2 b 22
3 c 30
4 d 40
5 e 50
3 f 30
4 g 40
I tried df.merge or pd.concat but nothing seems to give me the output that I require.? Can any one
Using combine_first
df2=df2.set_index(['id','name'])
df2.combine_first(df1.set_index(['id','name'])).reset_index()
Out[198]:
id name count
0 1 a 11.0
1 2 b 22.0
2 3 c 30.0
3 3 f 30.0
4 4 d 40.0
5 4 g 40.0
6 5 e 50.0

Pandas compare 1 columns values to another dataframe column, find matching rows

I have a database that I am bringing in a SQL table of events and alarms (df1), and I have a txt file of alarm codes and properties (df2) to watch for. Want to use 1 columns values from df2 that each value needs cross checked against an entire column values in df1, and output the entire rows of any that match into another dataframe df3.
df1 A B C D
0 100 20 1 1
1 101 30 1 1
2 102 21 2 3
3 103 15 2 3
4 104 40 2 3
df2 0 1 2 3 4
0 21 2 2 3 3
1 40 0 NaN NaN NaN
Output entire rows from df1 that column B match with any of df2 column 0 values into df3.
df3 A B C D
0 102 21 2 3
1 104 40 2 3
I was able to get single results using:
df1[df1['B'] == df2.iloc[0,0]]
But I need something that will do this on a larger scale.
Method 1: merge
Use merge, on B and 0. Then select only the df1 columns
df1.merge(df2, left_on='B', right_on='0')[df1.columns]
A B C D
0 102 21 2 3
1 104 40 2 3
Method 2: loc
Alternatively use loc to find rows in df1 where B has a match in df2 column 0 using .isin:
df1.loc[df1.B.isin(df2['0'])]
A B C D
2 102 21 2 3
4 104 40 2 3

Setting with enlargement - updating transaction DF

Looking for ways to achieve following updates on a dataframe:
dfb is the base dataframe that I want to update with dft transactions.
Any common index rows should be updated with values from dft.
Indexes only in dft should be appended to dfb.
Looking at the documentation, setting with enlargement looked perfect but then I realized it only worked with a single row. Is it possible to use setting with enlargement to do this update or is there another method that could be recommended?
dfb = pd.DataFrame(data={'A': [11,22,33], 'B': [44,55,66]}, index=[1,2,3])
dfb
Out[70]:
A B
1 11 44
2 22 55
3 33 66
dft = pd.DataFrame(data={'A': [0,2,3], 'B': [4,5,6]}, index=[3,4,5])
dft
Out[71]:
A B
3 0 4
4 2 5
5 3 6
# Updated dfb should look like this:
dfb
Out[75]:
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
You can use combine_first with renaming columns, last convert float columns to int by astype:
dft = dft.rename(columns={'c':'B', 'B':'A'}).combine_first(dfb).astype(int)
print (dft)
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
Another solution with finding same indexes in both DataFrames by Index.intersection, drop it from first DataFrame dfb and then use concat:
dft = dft.rename(columns={'c':'B', 'B':'A'})
idx = dfb.index.intersection(dft.index)
print (idx)
Int64Index([3], dtype='int64')
dfb = dfb.drop(idx)
print (dfb)
A B
1 11 44
2 22 55
print (pd.concat([dfb, dft]))
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6

Iterate through the rows of a dataframe and reassign minimum values by group

I am working with a dataframe that looks like this.
id time diff
0 0 34 nan
1 0 36 2
2 1 43 7
3 1 55 12
4 1 59 4
5 2 2 -57
6 2 10 8
What is an efficient way find the minimum values for 'time' by id, then set 'diff' to nan at those minimum values. I am looking for a solution that results in:
id time diff
0 0 34 nan
1 0 36 2
2 1 43 nan
3 1 55 12
4 1 59 4
5 2 2 nan
6 2 10 8
groupby('id') and use idxmin to find the location of minimum values of 'time'. Finally, use loc to assign np.nan
df.loc[df.groupby('id').time.idxmin(), 'diff'] = np.nan
df
You can group the time by id and calculate a logical vector where if the time is minimum within the group, the value is True, else False, and use the logical vector to assign NaN to the corresponding rows:
import numpy as np
import pandas as pd
df.loc[df.groupby('id')['time'].apply(lambda g: g == min(g)), "diff"] = np.nan
df
# id time diff
#0 0 34 NaN
#1 0 36 2.0
#2 1 43 NaN
#3 1 55 12.0
#4 1 59 4.0
#5 2 2 NaN
#6 2 10 8.0

Categories

Resources