Compare consecutive dataframe rows based on columns in Python - python

I have a dataframe. It has data about suppliers. If the name of the supplier and group are same, number of units should ideally be the same. However, sometimes that is not the case. I am writing code that imports data from SQL into Python then compares for these numbers.
This is for Python 3. Importing the data into Python was easy. I am a Python rookie. To make things easier for myself, I created individual dataframes for each supplier to compare numbers instead of looking at the whole dataframe at once.
supp = data['Supplier']
supplier = []
for s in supp:
if s not in Supplier:
supplier.append(s)
su = "Authentic Brands Group LLC"
deal = defaultdict(list)
blist = []
glist = []
columns = ['Supplier','ID','Units','Grp']
df3 = pd.DataFrame(columns=columns)
def add_row(df3, row):
df3.loc[-1] = row
df3.index = df3.index + 1
return df3.sort_index()
for row in data.itertuples():
for x in supplier:
s1 = row.Supplier
if s1 == su:
if row.Supplier_Group not in glist:
glist.append(row.Supplier_Group)
for g in range(len(glist)):
if glist[g]==row.Supplier_Group:
supp = x
blist=[]
blist.append(row.ID)
blist.append(row.Units)
blist.append(glist[g])
add_row(df3,[b1,row.ID,row.Units,glist[g]])
break
break
break
for i in range(1,len(df3)):
if df3.Supplier.loc[i] == df3.Supplier.loc[i-1] and df3.Grp.loc[i] == df3.Grp.loc[i-1]:
print(df3.Supplier,df3.Grp)
This gives me a small subset that looks like this:
Now I want to look at the supplier name and Grp, if they are same as others in dataframe, Units should be same. In this case, row 2 is incorrect. Units should be 100. I want to add another column to this dataframe that says 'False' if the number of Units is correct. This is the tricky part for me. I can iterate over the rows, but I'm unsure how to compare them and add column.
I'm stuck at this point.
Any help is highly appreciated. Thank you.

If you have all of your data in a single dataframe, df, you can do the following:
grp_by_cols = ['Supplier', 'ID', 'Grp']
all_cols = grp_by_cols + ['Unit']
res_df = df.assign(first_unit=lambda df: df.loc[:, all_cols]
.groupby(grp_by_cols)
.transform('first'))\
.assign(incorrect=lambda df: df['Unit'] == df['first_unit'])\
.assign(incorrect=lambda df: df.loc[:, grp_by_cols + ['incorrect']])\
.groupby(grp_by_cols)
.transform(np.any))
The first call to assign adds a single new column (called 'first_unit') that is the first value of "Unit" for each group of Supplier/ID/Grp (see grp_by_cols).
The second call to assign adds a column called 'incorrect' that is True when 'Unit' doesn't equal 'first_unit'. The third and final assign call overwrites that column to be True if any rows in that group are True. You can remove that if that's not what you want.
Then, if you want to look at the results for a single supplier, you can do something like:
res_df.query('Supplier = "Authentic Brands Group"')

Related

Replacing large dataset Multiple Conditions Loop with faster alternative in Pandas Dataframe

I'm trying to perform a nested loop onto a Dataframe but I'm encountering serious speed issues. Essentially, I have a list of unique values through which I want to loop through, all of which will need to be iterated on four different columns. The code is shown below:
def get_avg_val(temp_df, col):
temp_df = temp_df.replace(0, np.NaN)
avg_val = temp_df[col].mean()
return (0 if math.isnan(avg_val) else avg_val)
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
for unique_val in unique_list:
temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]
for col in col_list:
amended_val = get_avg_val (temp_df, col)
""" The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
Final_df[col] = np.where(mask, amended_val, Final_df[col])
The 'Mask' section specifies when two conditions are fulfilled in the dataframe and the np.where replaces the values in the cells identified with Amendend Value (which is itself a Function performing an average value).
Now this would normally work but with over 400k rows and a dozen of columns, speed is really slow. Is there any recommended way to improve on the two 'For..'? As I believe these are the reason for which the code takes some time.
Thanks all!
I am not certain if this is what you are looking for, but if your goal is to impute missing values of a series corresponding to the average value of that series in a particular group you can do this as follow:
for col in col_list:
Final_df[col] = Final_df.groupby('Group_SecCode')[col].transform(lambda x:
x.fillna(x.mean()))
UPDATE - Found an alternative way to Perform the amendments via Dictionary, with the task now taking 1.5 min rather than 35 min.
Code below. The different approach here allows for filtering the DataFrame into smaller ones, on which a series of operations are carried out. The new data is then stored into a Dictionary this time, with a loop adding more data onto it. Finally the dictionary is transferred back to the initial DataFrame, replacing it entirely with the updated dataset.
""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}
for unique_val in unique_list:
condition = Final_df['Group_SecCode'].isin([unique_val])
temp_df = Final_df[condition].replace(0, np.NaN)
for col in col_list:
""" Perform Amendments at Filtered Dataframe - by column """
""" 1. Replace NaN values with Median for the Datapoints encountered """
#amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
#mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
#Final_df[col] = np.where(mask, amended_val, Final_df[col])
amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
mask = temp_df[col].isnull()
temp_df[col] = np.where(mask, amended_val, temp_df[col])
""" 2. Perform Validation Checks via Function defined on line 36 """
temp_df = val_checks (temp_df,col)
""" Updates Dictionary with updated data at Unique Value level """
final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe
""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)

Extract and match items dealing with multiple data-frames using Python

I have two dataframes which can be created using the code shown below
df1 = pd.DataFrame({'home':[1,np.nan,2,np.nan,3,4],
'PERSONAL INFORMATION':['Study Number', 'Study ID','Age when interview
done', 'Derived using date of birth','Gender','ethnicity],
'VARIABLE':
['studyid','dummy','age_interview','dummy','gender','Chinese'],
'Remarks':[2000000001,20005000001,4265453,0,4135376,2345678]})
df2 = df2 = pd.DataFrame({'level_0': ['studyid','age_interview','gender','dobyear','ethderived','smoke','alcohol'],
'0':['tmp001', 56,'Female',1950,'Chinese','No', 'Yes']})
Aim
1) My objective is to take the values from 'level_0' column of df2 and look for them in 'VARIABLE' column of df1 to fetch their 'Remarks' column value provided it satisfies the the below condition
a) 'Home' column of df1 should contain digits as part of their value( Ex: 1,2,3,4,B1.5,C1.9, D1.2 etc are all valid values for 'Home' column)
2) My objective is same as above, but here I would like to take the values from '0' column of df2 and look for them in 'PERSONAL INFORMATION' column of df1 to fetch their 'Remarks' value provided it satisfies the below condition
a) 'VARIABLE' column of df1 should contain 'dummy' as a value
For the above two scenarios, I have written the below code but for some reason I feel that it is quite lengthy/inefficient. There should be some easy way to do this.
Scenario - 1
qconc_id = []
missed_items=[]
col_list=[]
for i in df7.index:
ques = df7['level_0'][i]
col_list.append(ques)
try:
qindex = int(df[df['VARIABLE']==ques].index[0]),
df.columns.get_loc('VARIABLE')
pos_qindex = qindex[0]
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
while result is None:
pos_qindex = pos_qindex-1
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
qconc_id.append(df['Remarks'][pos_qindex])
except:
missed_items.append(ans)
Scenario - 2
aconc_id = []
missed_items=[]
ans_list=[]
for i in df7.index:
ans = df7[0][i]
print("ans is ",ans)
ans_list.append(ans)
idx=0
try:
aindex = df[df['PERSONAL
INFORMATION'].str.contains(ans,case=False,regex=False)].index
print(aindex)
pos_aindex = aindex[idx]
while (df['VARIABLE'][pos_aindex] !='dummy') and
(df['PERSONAL INFORMATION'].str.contains('Yes|No',regex=True)
[pos_aindex])==False):
pos_aindex = aindex[idx+1]
print("The value is ",df['Remarks'][pos_aindex])
aconc_id.append(df['Remarks'][pos_aindex])
except:
print("Goes to Exception")
aconc_id.append('0')
missed_items.append(ans)
Please note these two things
a) I have used while loop because the values might be repeating. For example, we might have a matching value as 'No' but the df1['VARIABLE'] may not be dummy. So I increase the id values in both scenarios to find whether the next occurrence of 'No' has 'Dummy' value for VARIABLE column. The same applies for scenario 1 as well
b) How can I handle scenarios like "No" when finds match in "Notes", "Nocase". As you can see from my code that I am using regex but it still am encountering error here.
As you can see, I am making some modifications to the code and writing it twice. How can I make it elegant and efficient? I am sure there must be very easy and simple way to do this.
Any suggestions/ideas on alternative approach w.r.t to changing the data format of source data or using merge/join approach is also welcome.
I expect the output, the 'Remarks' value to be stored in the list. Please find the screenshot of what I have done
You should avoid as much as possible explicit loops in pandas, because they will not be vectorized (optimized in pandas and numpy wording). Here you could merge your dataframes:
Scenario 1:
# extract values where df2.level_0 == df1.VARIABLE
tmp = pd.merge(pd.DataFrame(df2.level_0), df1.loc[:,['home', 'VARIABLE', 'Remarks']],
left_on = ['level_0'], right_on=['VARIABLE'])
# drop lines where home would not contain a digit
tmp.drop(tmp.loc[~tmp.home.astype(np.str_).str.contains(r'\d')].index,
inplace=True)
# extract the Remarks column into a list
lst = tmp.Remarks.tolist()
With your example data I get [2000000001, 4265453, 4135376]
Scenario 2:
tmp = pd.merge(pd.DataFrame(df2['0']), df1.loc[:,['PERSONAL INFORMATION',
'VARIABLE', 'Remarks']],
left_on = ['0'], right_on=['PERSONAL INFORMATION'])
tmp.drop(tmp.loc[~tmp['VARIABLE'] == 'dummy'].index, inplace=True)
lst.extend(tmp.Remarks.tolist())
With your example data I get no additional values because from the first step, tmp is an empty dataframe.

take count of values from a dataframe which are separated by a comma

I've a dataframe 'genres', where value of each row in column is separated by ','. I need to take count of each value, such that comedy 2, drama 7 and so on. Tried many methods nut failed.
I tried genres = trending.groupby(['genre']).size() but this line considers values 'Comedy,Crime,CriticallyAcclaimed' as one . I'm new to python, please help me.
genre
Comedy,Crime,CriticallyAcclaimed
Comedy,Drama,Romance
Drama
Drama
Drama,Hollywood
Drama,Romance
Drama,Romance
Drama,Romance,Classic
I've got the answer:
genres = pd.DataFrame(genres.genre.str.split(',', expand=True).stack(), columns= ['genre'])
genres = genres.reset_index(drop = True)
genre_count = pd.DataFrame(genres.groupby(by = ['genre']).size(),columns = ['count'])
genre_count = genre_count.reset_index()
The following code is assuming that you know already the maximum number of the items in one row. This means that you need to read the file once and find this information (here we are assuming this number is 3 based on your example).
max_num_of_items_in_one_row = 3
cols = range(max_num_of_items_in_one_row)
df = pd.read_csv('genre.txt', names=cols, engine='python', skiprows=1)
df = df.applymap(lambda x: 'NA' if x==None else x)
all_ = df.values.flatten()
genres = np.unique(all_)
for y in genres:
tmp = df.applymap(lambda x: 1 if x==y else 0)
print(y, tmp.values.flatten().sum())
The code, reads the file into a dataframe, gets rid of None values, find all the unique values in the dataframe and count the number of their occurrences in the dataframe.
If you are using pandas, what even not said in the OP can be guessed, you could do something similar to this:
from collections import Counter
// Code where you get trending variable
genreCount = Counter()
for row in trending.itertuples():
genreCount.update(row[0].split(",")) // Change the 0 for the position where the genre column is
print(genreCount) // It works as a dict where keys are the genres and values the appearances
print(dict(genreCount)) // You can also turn it inot a dict but the Counter variable already works as one

DataFrame change doesn't save when iterating

I am trying to read a certain DF from file and add to it two more columns containing, say, the year and the week from other columns in DF. When i apply the code to generate a single new column, all works great. But when there are few columns to be created, the change does not apply. Specifically, new columns are created but their values are not what they are supposed to be.
I know that this happens because i first set all new values to a certain initial string and then change some of them, but I don't understand why it works on a single column and is "nulled" for multiple columns, leaving only the latest column changed... Help please?
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
bad_ind = list(np.where(tbl[date_cols[i]] == 'No Fill')[0])
tbl_ind = range(len(tbl))
for i in range(len(bad_ind)):
tbl_ind.remove(bad_ind[i])
tmp = pd.to_datetime(tbl[date_cols[i]][tbl_ind])
tbl[tmp_col_name][tbl_ind] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
If I try the following lines, disregarding possible "empty data values", everything works...
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
tmp = pd.to_datetime(tbl[date_cols[i]])
tbl[tmp_col_name] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
it has to do with not changing all data values, but i don't understand why the change does not apply - after all, before the second iteration begins, the DF seems to be updated and then tbl[tmp_col_name] = 'No Week' for the second iteration "deletes" the changes made in the first iteration, but only partially - it leaves the new column created but filled with 'No Week' values...
Many thanks to #EdChum! Performing chained indexing may or may not work. In case of creating new multiple columns and then filling in only some of their values, it doesn't work. More precise, it does work but only on the last updated column. Using loc, iloc or ix accessors to set the data works. In case of the above code, to make it work, one needs to cast the tbl_ind into np.array, using tbl[col_name[j]].iloc[np.array(tbl_ind)] = tmp.apply(lambda x: x.year)
Many thanks and credit for the answer to #EdChum.

Pandas For Loop, If String Is Present In ColumnA Then ColumnB Value = X

I'm pulling Json data from the Binance REST API, after formatting I'm left with the following...
I have a dataframe called Assets with 3 columns [Asset,Amount,Location],
['Asset'] holds ticker names for crypto assets e.g.(ETH,LTC,BNB).
However when all or part of that asset has been moved to 'Binance Earn' the strings are returned like this e.g.(LDETH,LDLTC,LDBNB).
['Amount'] can be ignored for now.
['Location'] is initially empty.
I'm trying to set the value of ['Location'] to 'Earn' if the string in ['Asset'] includes 'LD'.
This is how far I got, but I can't remember how to apply the change to only the current item, it's been ages since I've used Pandas or for loops.
And I'm only able to apply it to the entire column rather than the row iteration.
for Row in Assets['Asset']:
if Row.find('LD') == 0:
print('Earn')
Assets['Location'] = 'Earn' # <----How to apply this to current row only
else:
print('???')
Assets['Location'] = '???' # <----How to apply this to current row only
The print statements work correctly, but currently the whole column gets populated with the same value (whichever was last) as you might expect.
So (LDETH,HOT,LDBTC) returns ('Earn','Earn','Earn') rather than the desired ('Earn','???','Earn')
Any help would be appreciated...
np.where() fits here. If the Asset starts with LD, then return Earn, else return ???:
Assets['Location'] = np.where(Assets['Asset'].str.startswith('LD'), 'Earn', '???')
You could run a lambda in df.apply to check whether 'LD' is in df['Asset']:
df['Location'] = df['Asset'].apply(lambda x: 'Earn' if 'LD' in x else None)
One possible solution:
def get_loc(row):
asset = row['Asset']
if asset.find('LD') == 0:
print('Earn')
return 'Earn'
print('???')
return '???'
Assets['Location'] = Assets.apply(get_loc, axis=1)
Note, you should almost never iterate over a pandas dataframe or series.

Categories

Resources