Delete duplicate by using python pandas - python

I want to delete all record with condition
import pandas as pd
import numpy as np
# Create a DataFrame
d = {
'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',
'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],
'Age':[26,24,23,22,23,24,26,24,22,23,24,24],
'Score':[85,63,55,74,31,77,85,63,42,62,89,77]}
df = pd.DataFrame(d,columns=['Name','Age','Score'])
df
I want to remove all the record of "Alisa" which is duplicate as she is having Score = 85
I have tried below code, but it still displays "Alisa"
df1 = df[df['Score']==85]
df.drop_duplicates(['Name'])

If you want to drop all duplicates where 'Score' is equal to 85 you can use the following solution:
df1 = df[df['Score'] == 85].drop_duplicates(keep='last')
df.drop(df1.index, inplace=True)

Related

DataFrame returns empty after .update()

I am trying to create a new DataFrame which contains a calculation from an original DF.
To that purpose, I run a for loop with the calc for each column, but I am still getting the empty original DF and I don't see where is the source of the error.
May I ask for some help here?
import yfinance as yf
import pandas as pd
df = yf.download(["YPFD.BA", "GGAL.BA"], period='6mo')
df2 = pd.DataFrame()
for i in ["YPFD.BA", "GGAL.BA"]:
df2.update(df["Volume"][i] * df["Close"][i])
df2
I expected to create a new DF which contains the original index but with the calculation obtained from original DF
I think this is what you are looking to do:
import yfinance as yf
import pandas as pd
df = yf.download(["YPFD.BA", "GGAL.BA"], period='6mo')
df2 = pd.DataFrame()
for i in ["YPFD.BA", "GGAL.BA"]:
df2[i] = df["Volume"][i] * df["Close"][i]
df2

How to assign new column to existing DataFrame in pandas

I'm new to pandas. I'm trying to add new columns to my existing DataFrame but It's not getting assigned don't know why can anyone explain me what I'm missing this is what i tried
import pandas as pd
df = pd.DataFrame(data = {"test":["mkt1","mkt2","mkt3"],
"test2":["cty1","cty2","cty3"]})
print("Before",df.columns)
df.assign(test3="Hello")
print("After",df.columns)
Output
Before Index(['test', 'test2'], dtype='object')
After Index(['test', 'test2'], dtype='object')
Pandas assign method returns a new modified dataframe with a new column, it does not modify it in place.
import pandas as pd
df = pd.DataFrame(data = {"test":["mkt1","mkt2","mkt3"],
"test2":["cty1","cty2","cty3"]})
print("Before",df.columns)
df = df.assign(test3="Hello") # <--- Note the variable reassingment
print("After",df.columns)

Select python data-frame columns with similar names

I have a data frame named df1 like this:
as_id TCGA_AF_2687 TCGA_AF_2689_Norm TCGA_AF_2690 TCGA_AF_2691_Norm
31 1 5 9 2
I wanna select all the columns which end with "Norm", I have tried the code down below
import os;
print os.getcwd()
os.chdir('E:/task')
import pandas as pd
df1 = pd.read_table('haha.txt')
Norms = []
for s in df1.columns:
if s.endswith('Norm'):
Norms.append(s)
print Norms
but I only get a list of names. what can I do to select all the columns including their values rather than just the columns names? I know it may be a silly question, but I am a new beginner, really need someone to help, thank you so much for your kindness and your time.
df1[Norms] will get the actual columns from df1.
As a matter of fact the whole code can be simplified to
import os
import pandas as pd
os.chdir('E:/task')
df1 = pd.read_table('haha.txt')
norm_df = df1[[column for column in df1.columns if column.endswith('Norm')]]
One can also use the filter higher-order function:
newdf = df[list(filter(lambda x: x.endswith("Norm"),df.columns))]
print(newdf)
Output:
TCGA_AF_2689_Norm TCGA_AF_2691_Norm
0 5 2

pandas sample based on criteria

I would like to use pandas sample function but with a criteria without grouping or filtering data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print df.sample(n=100)
This will sample 100 rows, but what if i want to sample 50 rows containing 0 to 50 rows containing 1 in df['a'].
You can use the == operator to make a list* of boolean values. And when said list is put into the getter ([]) it will filter the values. If you want to, you can use n=50 to create a sample size of 50 rows.
New code
df[df['a']==1].sample(n=50)
Full code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print(df[df['a']==1].sample(n=50))
*List isn't literally a list in this context, but it is a great word for explaining how it works. It's a technically a DataFrame that maps rows to a true/false value.
More obscure DataFrame sampling
If you want to sample all 50 where a is 1 or 0:
print(df[(df['a']==1) | (df['a']==0)].sample(n=50))
And if you want to sample 50 of each:
df1 = df[df['a']==1].sample(n=50)
df0 = df[df['a']==0].sample(n=50)
print(pd.concat([df1,df0]))

Pandas: Fill new column by condition row-wise

import pandas as pd
import numpy as np
df = pd.DataFrame([np.random.rand(100),100*[0.1],100*[0.3]]).T
df.columns = ["value","lower","upper"]
df.head()
How can I create a new column which indicates that value is between lower and upper ?
You can use between for this purpose.
df['new_col'] = df['value'].between(df['lower'], df['upper'])

Categories

Resources