Comparing two dataframe pandas Python - python

I have two dataframes and I have to compare these dfs. if rows in the dfs are same, I should find this row. I have written code but not running correctly.
def color(row):
if (df3_2['dt'].values == df_condition['dt'].values ) & ( df3_2['platform'].values == df_condition['platform'].values) & ( df3_2['network'].values == df_condition['network'].values) & ( df3_2['currency'].values == df_condition['currency'].values) & ( df3_2['cost'].values == df_condition['cost'].values) :
return "asdas"
else:
return "wewq"
df3_2.loc[:, 'demo'] = df3_2.apply(color, axis = 1)
Code wrote "asdas" all rows in the dataframe but Dataframe has different columns.
Thanks

Related

How I compare rows in pandas dataframe with a moving windows?

I would to compare a group of a values between them with a moving window. I try to explain in a better way: I have a column on pandas dataframe and I would to test if 5 rows in a sequence are the same, but I would to do this examination in a moving window, that is to say I would to compare the row from 0 to 5, then the row from 1 to 6 and so on, in order to do certain changes. I would to know how I could do it in a better way than mine, because I used iterrows method.
my method:
for idx, row in df[2:-2].iterrows():
previous2 = df.loc[idx-2, 'speed_limit']
previous1 = df.loc[idx-1, 'speed_limit']
now = row['speed_limit']
next1 = df.loc[idx+1, 'speed_limit']
next2 = df.loc[idx+2, 'speed_limit']
if (next1==next2) & (previous1 == previous2) & (previous1 == next1) & (now!=previous1):
df.at[idx, 'speed_limit'] = previous1
Thank you for your patience. I would appreciate any suggestions. I wish you a great day.
I want to post my solution based on numpy select and pandas shift, that is faster than previous.
def noise_remove(df, speed_limit_column):
speed_data_column = df[speed_limit_column]
previous_1 = speed_data_column.shift(-1)
previous_2 = speed_data_column.shift(-2)
next_1 = speed_data_column.shift(+1)
next_2 = speed_data_column.shift(+2)
conditions = [(previous_1 == previous_2) &
(next_1 == next_2) &
(previous_1 == next_1) &
(speed_data_column == previous_1),
(previous_1 == previous_2) &
(next_1 == next_2) &
(previous_1 == next_1) &
(speed_data_column != previous_1)]
choices = [speed_data_column, previous_1]
df[speed_limit_column] = np.select(conditions, choices, default=speed_data_column)
return df
If you have some suggestions, I'll appreciate. Have you a great day!

Trying to filter a CSV file with multiple variables using pandas in python

import pandas as pd
import numpy as np
df = pd.read_csv("adult.data.csv")
print("data shape: "+str(data.shape))
print("number of rows: "+str(data.shape[0]))
print("number of cols: "+str(data.shape[1]))
print(data.columns.values)
datahist = {}
for index, row in data.iterrows():
k = str(row['age']) + str(row['sex']) +
str(row['workclass']) + str(row['education']) +
str(row['marital-status']) + str(row['race'])
if k in datahist:
datahist[k] += 1
else:
datahist[k] = 1
uniquerows = 0
for key, value in datahist.items():
if value == 1:
uniquerows += 1
print(uniquerows)
for key, value in datahist.items():
if value == 1:
print(key)
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
I have been trying to get the above code to work.
I have limited experience in coding but it seems like the issue lies with some of the columns being objects. The int64 columns work just fine when it comes to filtering.
Any assistance will be much appreciated!
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
Firstly you are attemping to use Male variable, you probably meant string, i.e. it should be 'Male', secondly observe [ and ] placement, you are extracting part of DataFrame with age equal 58 then extracting part of DataFrame with sex equal Male and then try to use bitwise and. You should probably use & with conditions rather than pieces of DataFrame that is
df.loc[(data['age'] == 58) & (data['sex'] == 'Male')]
The int64 columns work just fine because you've specified the condition correctly as:
data['age'] == 58
However, the object column condition data['sex'] == Male should be specified as a string:
data['sex'] == 'Male'
Also, I noticed that you have loaded the dataframe df = pd.read_csv("adult.data.csv"). Do you mean this instead?
data = pd.read_csv("adult.data.csv")
The query at the end includes 2 conditions, and should be enclosed in brackets within the square brackets [ ] filter. If the dataframe name is data (instead of df), it should be:
data.loc[ (data['age'] == 58]) & (data['sex'] == Male) ]

How to combine df.loc with for-loop to calculate new columns in pandas

I would like to learn how to use df.loc and for-loop to calculate new columns for the dataframe below
Problem: from df_G, for T = 400, take value of each Go_j as input
Then add new column "G_ads_400" in dataframe df = df['Adsorption_energy_eV'] - Go_h2o
df_G
df
here is my code for each Temperature
Go_co2 = df_G.loc[df_G.index == "400" & df_G.Go_CO2]
Go_o2= df_G.loc[df_G.index == "400" & df_G.Go_O2]
Go_co= df_G.loc[df_G.index == "400" & df_G.Go_CO]
df.loc[df['Adsorbates'] == "CO2", "G_ads_400"] = df.Adsorption_energy_eV-Go_co2
df.loc[df['Adsorbates'] == "CO", "G_ads_400"] = df.Adsorption_energy_eV-Go_co
df.loc[df['Adsorbates'] == "O2", "G_ads_400"] = df.Adsorption_energy_eV-Go_o2
I am not sure why I kept having error and I would like to know how to put it in a for-loop so it looks less messy

Filling missing age in titanic dataset

I know there are several hundred solutions on this, but I was wondering if there is a smarter way to fill the panda's data frame missing the age column based on lengthy certain conditions as folows.
mean_value = df[(df["Survived"]== 1) & (df["Pclass"] == 1) & (df["Sex"] == "male")
& (df["Embarked"] == "C") & (df["SibSp"] == 0) & (df["Parch"] == 0)].Age.mean().round(2)
df = df.assign(
Age=np.where(df.Survived.eq(1) & df.Pclass.eq(1) & df.Sex.eq("male") & df.Embarked.eq("C") &
df.SibSp.eq(0) & df.Parch.eq(0) & df.Age.isnull(), mean_value, df.Age)
)
Repeating the following for all 6 columns above, with all categorical combinations is too long and bulky, I was wondering if there is a smarter way to do this?
#Ben.T answer:
If I understood your method correctly, this is the "verbose version" of it ?
for a in np.unique(df.Survived):
for b in np.unique(df.Pclass):
for c in np.unique(df.Sex):
for d in np.unique(df.SibSp):
for e in np.unique(df.Parch):
for f in np.unique(df.Embarked):
mean_value = df[(df["Survived"] == a) & (df["Pclass"] == b) & (df["Sex"] == c)
& (df["SibSp"] == d) & (df["Parch"] == e) & (df["Embarked"] == f)].Age.mean()
df = df.assign(Age=np.where(df.Survived.eq(a) & df.Pclass.eq(b) & df.Sex.eq(c) & df.SibSp.eq(d) &
df.Parch.eq(e) & df.Embarked.eq(f) & df.Age.isnull(), mean_value, df.Age))
which is equivalent to this?
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))
You can create a variable that combines all of your criteria, and then you can use the ampersand to add more criteria later.
Note, in the seaborn titanic dataset, where I got the data from, the column names are lowercase.
criteria = ((df["survived"]== 1) &
(df["pclass"] == 1) &
(df["sex"] == "male") &
(df["embarked"] == "C") &
(df["sibsp"] == 0) &
(df["parch"] == 0))
fillin = df.loc[criteria, 'age'].mean()
df.loc[criteria & (df['age'].isnull()), 'age'] = fillin
I guess groupby.transform can do it. It creates for each row the mean over the group of all the columns in the groupby, and it does it for all the combinations possibles at once. Then using fillna with the serie created will fill missing value with the mean of the group with same charateristics.
l_col = ['Survived','Pclass','Sex','Embarked','SibSp','Parch']
df['Age'] = df['Age'].fillna(df.groupby(l_col)['Age'].transform('mean'))

Trying different ways to do a conditional and the output is not what expected (PYTHON)

I´m trying to modify some cells of a column base on a double conditional. The code should write "none" in the column "FCH_REC" (this column is originally filled with different dates) if column "LL" is equal to "201" and column DEC is equal to "RC". Also I want to write "none" in the column "FCH_REC" if column LL is equal to "400" and column "DEC" is equal to "RCLA". Next what I tried.
First I transform to string that columns.
table["FCH_REC"] = table["FCH_REC"].astype(str)
table["LL"] = table["LL"].astype(str)
table["DEC"] = table["DEC"].astype(str)
Second I tried this:
table.loc[(tabla['LL'] == '201') & (tabla['DEC'] == "RC" ), "FCH_REC"] = None
table.loc[(tabla['LL'] == '400') & (tabla['DEC'] == "RCLA" ), "FCH_REC"] = None
Third I tried this:
table.columns = table.columns.str.replace(' ', '')
table.loc[(tabla['LL'] == '201') & (table['DEC'] == "RC" ), "FCH_REC"] = "None"
table.loc[(renove['LL'] == '400') & (table['DEC'] == "RCLA" ), "FCH_REC"] = "None"
Fourth I tried this (this one is having problems with the syntax):
table["FCH_REC"] = table[["FCH_REC","LL"]].apply(lambda row:
row["FCH_REC"] = "None" if row["LL"] in ("RC", "RCLA") else row["FCH_REC"] )
Fifth I tried this:
for i in list(range(0,table.shape[0])):
if table.loc[i,"DEC"] in ("RC", "RCLA"):
table.loc[i,"FCH_REC"] == "NONE"
I don´t know what`s going on.
Thanks for the help!

Categories

Resources