Remove rows from pandas dataframe with condition - python

I have a dataframe that looks like this:
import pandas as pd
### create toy data set
data = [[1111,'10/1/2021',21,123],
[1111,'10/1/2021',-21,123],
[1111,'10/1/2021',21,123],
[2222,'10/2/2021',15,234],
[2222,'10/2/2021',15,234],
[3333,'10/3/2021',15,234],
[3333,'10/3/2021',15,234]]
df = pd.DataFrame(data,columns = ['Individual','date','number','cc'])
What I want to do is remove rows where Individual, date, and cc are the same, but number is a negative value in one case and a positive in the other case. For example, in the first three rows, I would remove rows 1 and 2 (because 21 and -21 values are equal in absolute terms), but I don't want to remove row 3 (because I have already accounted for the negative value in row 2 by eliminating row 1). Also, I don't want to remove duplicated values if the corresponding number values are positive. I have tried a variety of duplicated() approaches, but just can't get it right.
Expected results would be:
Individual date number cc
0 1111 10/1/2021 21 123
1 2222 10/2/2021 15 234
2 2222 10/2/2021 15 234
3 3333 10/3/2021 15 234
4 3333 10/3/2021 15 234
Thus, the first two rows are removed, but not the third row, since the negative value is already accounted for.
Any assistance would be appreciated. I am trying to do this without a loop, but it may be unavoidable. It seems similar to this question, but I can't figure out how to make it work in my case, as I am trying to avoid loops.

I can't be sure since you did not post your expected output, but you could try the below. Create a separate df called n that contains the rows with -ve 'number' and join it to the original with indicator=True.
n = df.loc[df.number.le(0)].drop('number',axis=1)
df = pd.merge(df,n,'left',indicator=True)
>>> df
Individual date number cc _merge
0 1111 10/1/2021 21 123 both
1 1111 10/1/2021 -21 123 both
2 1111 10/1/2021 21 123 both
3 2222 10/2/2021 15 234 left_only
4 2222 10/2/2021 15 234 left_only
5 3333 10/3/2021 15 234 left_only
6 3333 10/3/2021 15 234 left_only
This will allow us to identify the Individual/date/cc groups that have a -ve 'number' row.
Then you can locate the rows with 'both' in _merge, and only use those to perform a groupby.head(2), concatenating that with the rest of the df:
out = pd.concat([df.loc[df._merge.eq('both')].groupby(['Individual','date','cc']).head(2),
df.loc[df._merge.ne('both')]]).drop('_merge',axis=1)
Which prints:
Individual date number cc
0 1111 10/1/2021 21 123
1 1111 10/1/2021 -21 123
3 2222 10/2/2021 15 234
4 2222 10/2/2021 15 234
5 3333 10/3/2021 15 234
6 3333 10/3/2021 15 234

Related

Go through every row in a dataframe, search for this values in a second dataframe, if it matches, get a value from df1 and another value from df2

I have two dataframes:
Researchers: a list of all researcher and their id_number
Samples: a list of samples and all researchers related to it, there may be several researchers in the same cell.
I want to go through every row in the researcher table and check if they occur in each row of the Table Samples. If they do I want to get: a) their id from the researcher table and the sample number from the Samples table.
Table researcher
id_researcher full_name
0 1 Jack Sparrow
1 2 Demi moore
2 3 Bickman
3 4 Charles Darwin
4 5 H. Haffer
Table samples
sample_number collector
230 INPA A 231 Haffer
231 INPA A 232 Charles Darwin
232 INPA A 233 NaN
233 INPA A 234 NaN
234 INPA A 235 Jack Sparrow; Demi Moore ; Bickman
Output I want:
id_researcher num_samples
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235
I was able to it with a loop in regular python with the following code, but it is extremely low and quite long. Does anyone know a faster and simpler way? perhaps with pandas apply?
id_researcher = []
id_num_sample = []
for c in range(len(data_researcher)):
for a in range(len(data_samples)):
if pd.isna(data_samples['collector'].iloc[a]) == False and data_researcher['full_name'].iloc[c] in data_samples['collector'].iloc[a]:
id_researcher.append(data_researcher['id_researcher'].iloc[c])
id_num_sample.append(data_samples['No TEC'].iloc[a])
data_researcher_sample = pd.DataFrame.from_dict({'id_pesq': id_researcher, 'num_sample': id_num_sample}).sort_values(by='num_amostra')
You have a few data cleaning job to do such as 'Moore' in lowercase, 'Haffer' with first name initials in one case and none in the other, etc. After normalizing your two dataframes, you can split and explode collections and use merge:
samples['collector'] = samples['collector'].str.split(';')
samples = samples.explode('collector')
samples['collector'] = samples['collector'].str.strip()
out = researchers.merge(samples, right_on='collector', left_on='full_name', how='left')[['id_researcher','sample_number']].sort_values(by='sample_number').reset_index(drop=True)
Output:
id_researcher sample_number
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235

I want to create a new column territory based on the city column

Data Frame :
city Temperature
0 Chandigarh 15
1 Delhi 22
2 Kanpur 20
3 Chennai 26
4 Manali -2
0 Bengalaru 24
1 Coimbatore 35
2 Srirangam 36
3 Pondicherry 39
I need to create another column in data frame, which contains a boolean value for each city to indicate whether it's a union territory or not. Chandigarh, Pondicherry and Delhi are only 3 union territories here.
I have written below code
import numpy as np
conditions = [df3['city'] == 'Chandigarh',df3['city'] == 'Pondicherry',df3['city'] == 'Delhi']
values =[1,1,1]
df3['territory'] = np.select(conditions, values)
Is there any easier or efficient way that I can write?
You can use isin:
union_terrs = ["Chandigarh", "Pondicherry", "Delhi"]
df3["territory"] = df3["city"].isin(union_terrs).astype(int)
which checks each entry in city column and if it is in union_terrs, gives True and otherwise False. The astype makes True/False to 1/0 conversion,
to get
city Temperature territory
0 Chandigarh 15 1
1 Delhi 22 1
2 Kanpur 20 0
3 Chennai 26 0
4 Manali -2 0
0 Bengalaru 24 0
1 Coimbatore 35 0
2 Srirangam 36 0
3 Pondicherry 39 1

Create A New DataFrame Based on Conditions of Multiple DataFrames

I have two datasets: one with cancer positive patients (df_pos), and the other with the cancer negative patients (df_neg).
df_pos
id
0 123
1 124
2 125
df_neg
id
0 234
1 235
2 236
I want to compile these datasets into one with an extra column if the patient has cancer or not (yes or no).
Here is my desired outcome:
id outcome
0 123 yes
1 124 yes
2 125 yes
3 234 no
4 235 no
5 236 no
What would be a smarter approach to compile these?
Any suggestions would be appreciated. Thanks!
Use pandas.DataFrame.append and pandas.DataFrame.assign:
>>> df_pos.assign(outcome='Yes').append(df_neg.assign(outcome='No'), ignore_index=True)
id outcome
0 123 Yes
1 124 Yes
2 125 Yes
3 234 No
4 235 No
5 236 No
df_pos['outcome'] = True
df_neg['outcome'] = False
df = pd.concat([df_pos, df_neg]).reset_index(drop=True)

compare columns and generate duplicate rows in mysql (or) python pandas

i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!
Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901

pandas nlargest lost one column

I have this dataset:
Id query count
001 abc 20
001 bcd 30
001 ccd 100
002 ace 13
002 ahhd 30
002 ahe 28
I want to find the Top2 query for each Id, based on the count. So I want to see:
Id query count
001 ccd 100
001 bcd 30
002 ahhd 30
002 ahe 28
I tried these two lines of code:
df.groupby('Id')['count'].nlargest(2), the "query" column is lost in the result, which is not what I wanted. So how to keep query in my result.
Id count
001 100
001 30
002 30
002 28
Use set_index of missing column(s):
df = df.set_index('query').groupby('Id')['count'].nlargest(2).reset_index()
print (df)
Id query count
0 001 ccd 100
1 001 bcd 30
2 002 ahhd 30
3 002 ahe 28
I use a groupby and apply the method pd.DataFrame.nlargest. This differs from pd.Series.nlargest in that I have to specify a set of columns to consider when choosing my n rows. This solution keeps the original index values that are attached to the rows, if that is at all important to the OP or end user.
df.groupby('Id', group_keys=False).apply(
pd.DataFrame.nlargest, n=2, columns='count')
Id query count
2 1 ccd 100
1 1 bcd 30
4 2 ahhd 30
5 2 ahe 28
You could do this with groupby still:
df.sort_values('count', ascending = False).groupby('Id').head(2)

Categories

Resources