Extract and match items dealing with multiple data-frames using Python

Extract and match items dealing with multiple data-frames using Python - python

I have two dataframes which can be created using the code shown below
df1 = pd.DataFrame({'home':[1,np.nan,2,np.nan,3,4],
'PERSONAL INFORMATION':['Study Number', 'Study ID','Age when interview
done', 'Derived using date of birth','Gender','ethnicity],
'VARIABLE':
['studyid','dummy','age_interview','dummy','gender','Chinese'],
'Remarks':[2000000001,20005000001,4265453,0,4135376,2345678]})
df2 = df2 = pd.DataFrame({'level_0': ['studyid','age_interview','gender','dobyear','ethderived','smoke','alcohol'],
'0':['tmp001', 56,'Female',1950,'Chinese','No', 'Yes']})
Aim
1) My objective is to take the values from 'level_0' column of df2 and look for them in 'VARIABLE' column of df1 to fetch their 'Remarks' column value provided it satisfies the the below condition
a) 'Home' column of df1 should contain digits as part of their value( Ex: 1,2,3,4,B1.5,C1.9, D1.2 etc are all valid values for 'Home' column)
2) My objective is same as above, but here I would like to take the values from '0' column of df2 and look for them in 'PERSONAL INFORMATION' column of df1 to fetch their 'Remarks' value provided it satisfies the below condition
a) 'VARIABLE' column of df1 should contain 'dummy' as a value
For the above two scenarios, I have written the below code but for some reason I feel that it is quite lengthy/inefficient. There should be some easy way to do this.
Scenario - 1
qconc_id = []
missed_items=[]
col_list=[]
for i in df7.index:
ques = df7['level_0'][i]
col_list.append(ques)
try:
qindex = int(df[df['VARIABLE']==ques].index[0]),
df.columns.get_loc('VARIABLE')
pos_qindex = qindex[0]
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
while result is None:
pos_qindex = pos_qindex-1
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
qconc_id.append(df['Remarks'][pos_qindex])
except:
missed_items.append(ans)
Scenario - 2
aconc_id = []
missed_items=[]
ans_list=[]
for i in df7.index:
ans = df7[0][i]
print("ans is ",ans)
ans_list.append(ans)
idx=0
try:
aindex = df[df['PERSONAL
INFORMATION'].str.contains(ans,case=False,regex=False)].index
print(aindex)
pos_aindex = aindex[idx]
while (df['VARIABLE'][pos_aindex] !='dummy') and
(df['PERSONAL INFORMATION'].str.contains('Yes|No',regex=True)
[pos_aindex])==False):
pos_aindex = aindex[idx+1]
print("The value is ",df['Remarks'][pos_aindex])
aconc_id.append(df['Remarks'][pos_aindex])
except:
print("Goes to Exception")
aconc_id.append('0')
missed_items.append(ans)
Please note these two things
a) I have used while loop because the values might be repeating. For example, we might have a matching value as 'No' but the df1['VARIABLE'] may not be dummy. So I increase the id values in both scenarios to find whether the next occurrence of 'No' has 'Dummy' value for VARIABLE column. The same applies for scenario 1 as well
b) How can I handle scenarios like "No" when finds match in "Notes", "Nocase". As you can see from my code that I am using regex but it still am encountering error here.
As you can see, I am making some modifications to the code and writing it twice. How can I make it elegant and efficient? I am sure there must be very easy and simple way to do this.
Any suggestions/ideas on alternative approach w.r.t to changing the data format of source data or using merge/join approach is also welcome.
I expect the output, the 'Remarks' value to be stored in the list. Please find the screenshot of what I have done

You should avoid as much as possible explicit loops in pandas, because they will not be vectorized (optimized in pandas and numpy wording). Here you could merge your dataframes:
Scenario 1:
# extract values where df2.level_0 == df1.VARIABLE
tmp = pd.merge(pd.DataFrame(df2.level_0), df1.loc[:,['home', 'VARIABLE', 'Remarks']],
left_on = ['level_0'], right_on=['VARIABLE'])
# drop lines where home would not contain a digit
tmp.drop(tmp.loc[~tmp.home.astype(np.str_).str.contains(r'\d')].index,
inplace=True)
# extract the Remarks column into a list
lst = tmp.Remarks.tolist()
With your example data I get [2000000001, 4265453, 4135376]
Scenario 2:
tmp = pd.merge(pd.DataFrame(df2['0']), df1.loc[:,['PERSONAL INFORMATION',
'VARIABLE', 'Remarks']],
left_on = ['0'], right_on=['PERSONAL INFORMATION'])
tmp.drop(tmp.loc[~tmp['VARIABLE'] == 'dummy'].index, inplace=True)
lst.extend(tmp.Remarks.tolist())
With your example data I get no additional values because from the first step, tmp is an empty dataframe.

Related

Getting next value of groupby after explode with pandas

I am trying to get the first non-null value of the list inside each row of Emails column to write to the Email_final1 then get the next value of the list inside each row of Emails, if there is one, to Emails_final2 otherwise to write Emails2 value to Emails2_final if not blank and doesn't equal 'Emails' otherwise leave Emails_final2 blank. Lastly if a value from Emails 2 was written to Emails_final1 then make Emails_final2 None I have tried many different ways to achieve this to no avail here is what I have so far including pseudo-code:
My Current Code:
df = pd.DataFrame({'Emails': [['jjj#gmail.com', 'jp#gmail.com', 'jc#gmail.com'],[None, 'www#gmail.com'],[None,None,None]],
'Emails 2': ['sss#gmail.com', 'zzz#gmail.com','ccc#gmail.com'],
'num_specimen_seen': [10, 2,3]},
index=['falcon', 'dog','cat'])
df['Emails_final1'] = df['Emails'].explode().groupby(level=0).first()
#pseudo code
df['Emails_final2'] = df['Emails'].explode().groupby(level=0).next() #I know next doesn't exist but I want it to try to get the next value of 'Emails' before trying to get 'Emails 2 values.
Desired Output:
Emails_final1 Emails_final2
falcon jjj#gmail.com jp#gmail.com
falcon www#gmail.com zzz#gmail.com
falcon ccc#gmail.com None
Any direction of how to approach a problem like this would be appreciated.

It looks a bit messy but it works. Basically, we keep a boolean mask from the first step in filling "Emails_final1" and use it in the second step to fill "Emails_final1".
To fill the second column, the idea is to use groupby + nth to get the second elements and if they don't match the previously selected emails; keep it (for example for the first row) but if it doesn't select from "Emails 2" column, unless it was already selected before (for example in the 3rd row):
exp_g = df['Emails'].explode().groupby(level=0)
df['Emails_final1'] = exp_g.first()
msk = df['Emails_final1'].notna()
df['Emails_final1'] = df['Emails_final1'].fillna(df['Emails 2'])
df['Emails_final2'] = exp_g.nth(1)
df['Emails_final2'] = df['Emails_final2'].mask(lambda x: ((x == df['Emails_final1']) | x.isna()) & msk, df['Emails 2'])
The relevant columns are:
Emails_final1 Emails_final2
falcon jjj#gmail.com jp#gmail.com
dog www#gmail.com zzz#gmail.com
cat ccc#gmail.com None

apply a user defined function, pass kwarg where the values exist across multiple columns

Please forgive me if this has been answered, I couldn't find anything that quite fit the need I have here.
I have a dataframe with a mix of float and string fields. It looks something like -
data = {'df_to_look_at':['A','B'], 'data_to_use':[100,200]}
I have a function that uses the first column to pick a dataframe and the second column to find a value to return. I want to make a new column with the returned value.
My function is something like -
def find_value(col_a, col_b):
#set lookup table based on argument
if col_a == 'A':
table = table_a
elif col_a == 'B':
table = table_b
#Find the value based on column b, set it to the adj variable, and return the adj variable
adj = table.loc[(table['Term']==B), 'Adj'].values
The line I want to run to make a new columns looks something like this
df['new_val'] = df.apply(find_value, col_a = df['df_to_look_at'], col_b = df['data_to_use'])
In my real code I have about 5 arguments in my function, one used to set a dataframe to look at, the others are criteria to look up the value needed for the new column. So far I have learned how to pass kwargs into an apply function, but only as absolute values like 3 or 'A', not "the value in this row in column X"
First question asked, I hope I got the point across
Thanks for your help!

I would take a slightly different approach. Apply sends columns or rows to the function you define. So the following would be another way to solve your problem (and probably a faster way).
def find_value(row):
#set lookup table based on argument
if row['df_to_look_at'] == 'A':
table = table_a
elif row['data_to_use'] == 'B':
table = table_b
adj = table.loc[(table['Term']==B), 'Adj'].values
#apply across rows
df.apply(find_value, axis = 1)

Compare consecutive dataframe rows based on columns in Python

I have a dataframe. It has data about suppliers. If the name of the supplier and group are same, number of units should ideally be the same. However, sometimes that is not the case. I am writing code that imports data from SQL into Python then compares for these numbers.
This is for Python 3. Importing the data into Python was easy. I am a Python rookie. To make things easier for myself, I created individual dataframes for each supplier to compare numbers instead of looking at the whole dataframe at once.
supp = data['Supplier']
supplier = []
for s in supp:
if s not in Supplier:
supplier.append(s)
su = "Authentic Brands Group LLC"
deal = defaultdict(list)
blist = []
glist = []
columns = ['Supplier','ID','Units','Grp']
df3 = pd.DataFrame(columns=columns)
def add_row(df3, row):
df3.loc[-1] = row
df3.index = df3.index + 1
return df3.sort_index()
for row in data.itertuples():
for x in supplier:
s1 = row.Supplier
if s1 == su:
if row.Supplier_Group not in glist:
glist.append(row.Supplier_Group)
for g in range(len(glist)):
if glist[g]==row.Supplier_Group:
supp = x
blist=[]
blist.append(row.ID)
blist.append(row.Units)
blist.append(glist[g])
add_row(df3,[b1,row.ID,row.Units,glist[g]])
break
break
break
for i in range(1,len(df3)):
if df3.Supplier.loc[i] == df3.Supplier.loc[i-1] and df3.Grp.loc[i] == df3.Grp.loc[i-1]:
print(df3.Supplier,df3.Grp)
This gives me a small subset that looks like this:
Now I want to look at the supplier name and Grp, if they are same as others in dataframe, Units should be same. In this case, row 2 is incorrect. Units should be 100. I want to add another column to this dataframe that says 'False' if the number of Units is correct. This is the tricky part for me. I can iterate over the rows, but I'm unsure how to compare them and add column.
I'm stuck at this point.
Any help is highly appreciated. Thank you.

If you have all of your data in a single dataframe, df, you can do the following:
grp_by_cols = ['Supplier', 'ID', 'Grp']
all_cols = grp_by_cols + ['Unit']
res_df = df.assign(first_unit=lambda df: df.loc[:, all_cols]
.groupby(grp_by_cols)
.transform('first'))\
.assign(incorrect=lambda df: df['Unit'] == df['first_unit'])\
.assign(incorrect=lambda df: df.loc[:, grp_by_cols + ['incorrect']])\
.groupby(grp_by_cols)
.transform(np.any))
The first call to assign adds a single new column (called 'first_unit') that is the first value of "Unit" for each group of Supplier/ID/Grp (see grp_by_cols).
The second call to assign adds a column called 'incorrect' that is True when 'Unit' doesn't equal 'first_unit'. The third and final assign call overwrites that column to be True if any rows in that group are True. You can remove that if that's not what you want.
Then, if you want to look at the results for a single supplier, you can do something like:
res_df.query('Supplier = "Authentic Brands Group"')

DataFrame change doesn't save when iterating

I am trying to read a certain DF from file and add to it two more columns containing, say, the year and the week from other columns in DF. When i apply the code to generate a single new column, all works great. But when there are few columns to be created, the change does not apply. Specifically, new columns are created but their values are not what they are supposed to be.
I know that this happens because i first set all new values to a certain initial string and then change some of them, but I don't understand why it works on a single column and is "nulled" for multiple columns, leaving only the latest column changed... Help please?
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
bad_ind = list(np.where(tbl[date_cols[i]] == 'No Fill')[0])
tbl_ind = range(len(tbl))
for i in range(len(bad_ind)):
tbl_ind.remove(bad_ind[i])
tmp = pd.to_datetime(tbl[date_cols[i]][tbl_ind])
tbl[tmp_col_name][tbl_ind] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
If I try the following lines, disregarding possible "empty data values", everything works...
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
tmp = pd.to_datetime(tbl[date_cols[i]])
tbl[tmp_col_name] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
it has to do with not changing all data values, but i don't understand why the change does not apply - after all, before the second iteration begins, the DF seems to be updated and then tbl[tmp_col_name] = 'No Week' for the second iteration "deletes" the changes made in the first iteration, but only partially - it leaves the new column created but filled with 'No Week' values...

Many thanks to #EdChum! Performing chained indexing may or may not work. In case of creating new multiple columns and then filling in only some of their values, it doesn't work. More precise, it does work but only on the last updated column. Using loc, iloc or ix accessors to set the data works. In case of the above code, to make it work, one needs to cast the tbl_ind into np.array, using tbl[col_name[j]].iloc[np.array(tbl_ind)] = tmp.apply(lambda x: x.year)
Many thanks and credit for the answer to #EdChum.

Pandas For Loop, If String Is Present In ColumnA Then ColumnB Value = X

I'm pulling Json data from the Binance REST API, after formatting I'm left with the following...
I have a dataframe called Assets with 3 columns [Asset,Amount,Location],
['Asset'] holds ticker names for crypto assets e.g.(ETH,LTC,BNB).
However when all or part of that asset has been moved to 'Binance Earn' the strings are returned like this e.g.(LDETH,LDLTC,LDBNB).
['Amount'] can be ignored for now.
['Location'] is initially empty.
I'm trying to set the value of ['Location'] to 'Earn' if the string in ['Asset'] includes 'LD'.
This is how far I got, but I can't remember how to apply the change to only the current item, it's been ages since I've used Pandas or for loops.
And I'm only able to apply it to the entire column rather than the row iteration.
for Row in Assets['Asset']:
if Row.find('LD') == 0:
print('Earn')
Assets['Location'] = 'Earn' # <----How to apply this to current row only
else:
print('???')
Assets['Location'] = '???' # <----How to apply this to current row only
The print statements work correctly, but currently the whole column gets populated with the same value (whichever was last) as you might expect.
So (LDETH,HOT,LDBTC) returns ('Earn','Earn','Earn') rather than the desired ('Earn','???','Earn')
Any help would be appreciated...

np.where() fits here. If the Asset starts with LD, then return Earn, else return ???:
Assets['Location'] = np.where(Assets['Asset'].str.startswith('LD'), 'Earn', '???')

You could run a lambda in df.apply to check whether 'LD' is in df['Asset']:
df['Location'] = df['Asset'].apply(lambda x: 'Earn' if 'LD' in x else None)

One possible solution:
def get_loc(row):
asset = row['Asset']
if asset.find('LD') == 0:
print('Earn')
return 'Earn'
print('???')
return '???'
Assets['Location'] = Assets.apply(get_loc, axis=1)
Note, you should almost never iterate over a pandas dataframe or series.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract and match items dealing with multiple data-frames using Python - python

Related

Getting next value of groupby after explode with pandas

apply a user defined function, pass kwarg where the values exist across multiple columns

Compare consecutive dataframe rows based on columns in Python

DataFrame change doesn't save when iterating

Pandas For Loop, If String Is Present In ColumnA Then ColumnB Value = X

Categories

Resources