vlookup/index(+match) alternative in pandas df - python

Well. In this thread, answers are needed only if you have a faster solution to this case =)
I have a dataframe that has the following columns - status (contains the values "ok" and "not_responding"), a concat column (contains a concatenated value that can be perceived as an id) and columns that contain values ​​with certain results for a specific period, these columns are indicated in the following format "week number + R/S/F/U prefixes". This dataframe initially contains a certain number of duplicates, due to the fact that one id can have a value in the status column "ok" and "not_responding". And at one of the stages of working with a frame, I need to match it with another frame by the concat column (adding columns W* R/S/F/U, Columns W* A is already was here). And here the fun begins - I do not need duplication of results in columns with a week number and a row prefix. If the values of duplicates in the concat column are equal to each other both in the "ok" status and in the "not_responding" status, i need to delete the result in the "not_responding" status. After a lot of research, I could not find an alternative to the index (+ match) from Google spreadsheets in pandas and I made a clumsy function that cycled through the entire frame for several minutes, checking the results and the status of the rows by id. That how it looks:
for i in range(4):
wnum1 = f'{week[i]} R'
wnum2 = f'{week[i]} S'
wnum3 = f'{week[i]} F'
ind1 = df.loc[df.status == 'ok', 'concat'].reset_index(drop=True)
ind2 = df.loc[df.status == 'not_responding', 'concat'].reset_index(drop=True)
D1 = df.loc[df.status == 'ok', wnum1].reset_index(drop=True)
D2 = df.loc[df.status == 'not_responding', wnum1].reset_index(drop=True)
for j in ind2:
if len(df.loc[df.concat == j]) > 1:
if D2.iloc[ind2[ind2==j].index[0]] == D1.iloc[ind1[ind1==j].index[0]]:
df.loc[(df.status == 'not_responding') & (df.concat == j), wnum1] = pd.NA
df.loc[(df.status == 'not_responding') & (df.concat == j), wnum2] = pd.NA
df.loc[(df.status == 'not_responding') & (df.concat == j), wnum3] = pd.NA
However, in the end I found an option that works for fractions of seconds using masks. In general, I hope this helps some newbie like me find a faster solution =)
for i in range(4):
wnum1 = f'{week[i]} R'
wnum2 = f'{week[i]} S'
wnum3 = f'{week[i]} U'
mask1 = df.loc[:, ['concat', wnum1]].duplicated(keep='first')
mask2 = df.loc[:, ['concat', wnum2]].duplicated(keep='first')
mask3 = df.loc[:, ['concat', wnum3]].duplicated(keep='first')
df.loc[mask1, wnum1] = pd.NA
df.loc[mask2, wnum2] = pd.NA
df.loc[mask3, wnum3] = pd.NA
Thats the example of df:
Thats the expected result

Related

Trying to filter a CSV file with multiple variables using pandas in python

import pandas as pd
import numpy as np
df = pd.read_csv("adult.data.csv")
print("data shape: "+str(data.shape))
print("number of rows: "+str(data.shape[0]))
print("number of cols: "+str(data.shape[1]))
print(data.columns.values)
datahist = {}
for index, row in data.iterrows():
k = str(row['age']) + str(row['sex']) +
str(row['workclass']) + str(row['education']) +
str(row['marital-status']) + str(row['race'])
if k in datahist:
datahist[k] += 1
else:
datahist[k] = 1
uniquerows = 0
for key, value in datahist.items():
if value == 1:
uniquerows += 1
print(uniquerows)
for key, value in datahist.items():
if value == 1:
print(key)
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
I have been trying to get the above code to work.
I have limited experience in coding but it seems like the issue lies with some of the columns being objects. The int64 columns work just fine when it comes to filtering.
Any assistance will be much appreciated!
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
Firstly you are attemping to use Male variable, you probably meant string, i.e. it should be 'Male', secondly observe [ and ] placement, you are extracting part of DataFrame with age equal 58 then extracting part of DataFrame with sex equal Male and then try to use bitwise and. You should probably use & with conditions rather than pieces of DataFrame that is
df.loc[(data['age'] == 58) & (data['sex'] == 'Male')]
The int64 columns work just fine because you've specified the condition correctly as:
data['age'] == 58
However, the object column condition data['sex'] == Male should be specified as a string:
data['sex'] == 'Male'
Also, I noticed that you have loaded the dataframe df = pd.read_csv("adult.data.csv"). Do you mean this instead?
data = pd.read_csv("adult.data.csv")
The query at the end includes 2 conditions, and should be enclosed in brackets within the square brackets [ ] filter. If the dataframe name is data (instead of df), it should be:
data.loc[ (data['age'] == 58]) & (data['sex'] == Male) ]

How to combine df.loc with for-loop to calculate new columns in pandas

I would like to learn how to use df.loc and for-loop to calculate new columns for the dataframe below
Problem: from df_G, for T = 400, take value of each Go_j as input
Then add new column "G_ads_400" in dataframe df = df['Adsorption_energy_eV'] - Go_h2o
df_G
df
here is my code for each Temperature
Go_co2 = df_G.loc[df_G.index == "400" & df_G.Go_CO2]
Go_o2= df_G.loc[df_G.index == "400" & df_G.Go_O2]
Go_co= df_G.loc[df_G.index == "400" & df_G.Go_CO]
df.loc[df['Adsorbates'] == "CO2", "G_ads_400"] = df.Adsorption_energy_eV-Go_co2
df.loc[df['Adsorbates'] == "CO", "G_ads_400"] = df.Adsorption_energy_eV-Go_co
df.loc[df['Adsorbates'] == "O2", "G_ads_400"] = df.Adsorption_energy_eV-Go_o2
I am not sure why I kept having error and I would like to know how to put it in a for-loop so it looks less messy

Check for each row in several columns and append for each row if the requirement is met or not. python

I have the following example of my dataframe:
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
If the cus_num is equal in the column
The Title is equal for both rows in the dataframe
The second_date in a row <= end_date in an other row
If all these requirements are met the value True should be appended to a new column in the original row.
Because I'm working with a big dataset I'm looking for an efficient way to do this.
In this case only the first record should get a true value.
I have checked for the apply with lambda and groupby function in python but couldnt find a way to make these work.
Try this (spontaneously I cannot come up with a faster method):
import pandas as pd
import numpy as np
df["second_date"]=pd.to_datetime(df["second_date"], format='%d-%m-%Y')
df["end_date"]=pd.to_datetime(df["end_date"], format='%d-%m-%Y')
df["new col"] = False
for cust in set(df["cust_num"]):
indices = df.index[df["cust_num"] == cust].tolist()
if len(indices) > 1:
sub_df = df.loc[indices]
for title in set(df.loc[indices]["Title"]):
indices_title = sub_df.index[sub_df["Title"] == title]
if len(indices_title) > 1:
for i in indices_title:
if sub_df.loc[indices_title]["second_date"][i] <= sub_df.loc[indices_title]["end_date"][i]:
df["new col"] = True
break
df["new_col"] = new_col
First you need to make all date columns comparable with eachother by casting them into datetime. Then create the additional column you want.
Now create a set of all unique customer numbers and iterate through them. For each customer number get a list of all row indices with this customer number. If this list is longer than 1, then you have several same customer numbers. Then you create a sub df of your dataframe with all rows with the same customer number. Then iterate through the set of all titles. For each title check if there is the same title somewhere else in the sub df (len > 1). If this is the case, then iterate through all rows and write True in your additional column in the same row where the date condition is met for the first time.
This should work. Also while reading comments, I am assuming that all cust_num is unique.
import pandas as pd
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
df["second_date"]=pd.to_datetime(df["second_date"])
df["end_date"]=pd.to_datetime(df["end_date"])
df['Value'] = False
for i in range(len(df)):
for j in range(len(df)):
if (i != j):
if (df.loc[j,'end_date'] >= df.loc[i,'second_date']) == True:
if (df.loc[i,'cust_num'] == df.loc[j,'cust_num']) == True:
if (df.loc[i,'Title'] == df.loc[j,'Title']) == True:
df.loc[i,'Value'] = True
Tell me if this code works! and any errors.

pandas fillna is not working on subset of the dataset

I want to impute the missing values for df['box_office_revenue'] with the median specified by df['release_date'] == x and df['genre'] == y .
Here is my median finder function below.
def find_median(df, year, genre, col_year, col_rev):
median = df[(df[col_year] == year) & (df[col_rev].notnull()) & (df[genre] > 0)][col_rev].median()
return median
The median function works. I checked. I did the code below since I was getting some CopyValue error.
pd.options.mode.chained_assignment = None # default='warn'
I then go through the years and genres, col_name = ['is_drama', 'is_horror', etc] .
i = df['release_year'].min()
while (i < df['release_year'].max()):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
print(i)
i += 1
However, nothing changed!
len(df['box_office_revenue'].isnull())
The output was 35527. Meaning none of the null values in df['box_office_revenue'] had been filled.
Where did I go wrong?
Here is a quick look at the data: The other columns are just binary variables
You mentioned
I did the code below since I was getting some CopyValue error...
The warning is important. You did not give your data, so I cannot actually check, but the problem is likely due to:
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(..)
Let's break this down:
First you select some rows with:
df[(df['release_year'] == i) & (df[genre] > 0)]
Then from that, you select a columns with:
...['box_office_revenue']
And now you have a problem...
Why?
The problem is that when you selected some rows (ie: not all), pandas was forced to create a copy of your dataframe. You then select a column of the copy!. Then you fillna() on the copy. Not super useful.
How do I fix it?
Select the column first:
df['box_office_revenue'][(df['release_year'] == i) & (df[genre] > 0)].fillna(..)
By selecting the entire column first, pandas is not forced to make a copy, and thus subsequent operations should work as desired.
This is not elegant, but I think it works. Basically, I calculate the means conditioned on genre and year, and then join the data to a dataframe containing the imputing values. Then, wherever the revenue data is null, replace the null with the imputed value
import pandas as pd
import numpy as np
#Fake Data
rev = np.random.normal(size = 10_000,loc = 20)
rev_ix = np.random.choice(range(rev.size), size = 100 )
rev[rev_ix] = np.NaN
year = np.random.choice(range(1950,2018), replace = True, size = 10_000)
genre = np.random.choice(list('abc'), size = 10_000, replace = True)
df = pd.DataFrame({'rev':rev,'year':year,'genre':genre})
imputing_vals = df.groupby(['year','genre']).mean()
s = df.set_index(['year','genre'])
s.rev.isnull().any() #True
#Creates dataframe with new column containing the means
s = s.join(imputing_vals, rsuffix = '_R')
s.loc[s.rev.isnull(),'rev'] = s.loc[s.rev.isnull(),'rev_R']
new_df = s['rev'].reset_index()
new_df.rev.isnull().any() #False
This URL describing chained assignments seems useful for such a case: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters
As seen in above URL:
Hence, instead of doing (in your 'for' loop):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
You can try:
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df.loc[(df['release_year'] == i) & (df[genre] > 0) & (df['box_office_revenue'].isnull()), 'box_office_revenue'] = median

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources