Alternative way to merge two dataframes in python - python

Let's take a simple example. I have this first dataframe :
df = pd.DataFrame(dict(Name=['abc','def','ghi'],NoMatter=['X','X','X']))
df
Name NoMatter
0 abc X
1 def X
2 ghi X
For some reasons, I would like to use a for loop which add a column Value to df and do some treatments, from another dataframe changing at each iteration :
# strucutre of for loop I would like to use :
for i in range(something) :
add the column Value to df from df_value
other treatment not usefull here
# appearance of df_value (which change at each iteration of the for loop) :
Name Value
0 abc 1
1 def 2
2 ghi 3
However, I would prefer not to use merging, because that would require to delete the column Value added in the previous iteration before adding the one of the current iteration. Is there please a way to add the Value column to df by just an assignment starting like that :
df['Value'] = XXX
Expected output :
Name NoMatter Value
0 abc X 1
1 def X 2
2 ghi X 3
[EDIT]
I don't want to use merging because at the fourth iteration of the for loop, df would have the columns :
Name NoMatter Value1 Value2 Value3 Value4
Whereas I just want to have :
Name NoMatter Value4
I could delete the previous column each time but it seems not to be very efficient. This is why I'm just looking for a way to assign values to the Value column, not adding the column. Like an equivalent of the vlookup function in Excel applied to df from df_value data.

3 ways to join dataframes
df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2],axis=1) # Adds the columns in df1 to the end of df2 (rows should be identical)
df1.join(df2,on=col1,how='inner') # SQL-style joins the columns in df1 with the columns on
df2 where the rows for col have identical values. how can be one of 'left', 'right',

Here's the solution for your problem.
import pandas as pd
df = pd.DataFrame(dict(Name=['abc','def','ghi'],NoMatter=['X','X','X']))
df1 = pd.DataFrame(dict(Name=['abc','def','ghi'],Value=[1,2,3]))
new_df=pd.merge(df, df1, on='Name')
new_df

The correct way is #UmerRana's answer, because iterating over a dataframe has terrible performances. If you really have to do it, it is possible to address an individual cell, but never pretend I advise you to do so:
df = pd.DataFrame(dict(Name=['abc','def','ghi'],NoMatter=['X','X','X']))
df1 = pd.DataFrame(dict(Name=['abc','def','ghi'],Value=[1,2,3]))
df['Value'] = 0 # initialize a new column of integers (hence the 0)
ix = df.columns.get_loc('Value')
for i in range(len(df)): # perf is terrible!
df.iloc[i, ix] = df1['Value'][i]
After seeing your example code, and if you cannot avoid the loop, I thing that this would be the less bad way:
newcol = np.zeros(something, dtype='int') # set the correct type
for i in range(something):
#compute a value
newcol[i] = value_for_i_iteration
df['Value'] = newcol # assign the array to the new column

Maybe not the best way, but this solution works and replaces at each iteration the Value column (no need to delete the Value column before each new iteration) :
# similar to Excel vlookup function
def vlookup(df,ref,col_ref,col_goal):
return pd.DataFrame(df[df.apply(lambda x: ref == x[col_ref],axis=1)][col_goal]).iloc[0,0]
df['Value'] = df['Name'].apply(lambda x : vlookup(df_value,x,'Name','Value'))
#Output :
Name NoMatter Value
0 abc X 1
1 def X 2
2 ghi X 3

Related

how to join pandas dataframes values to correspondant rows ID and add a new rows with new data from the second dataframe when the ID not in first df

I need to join two pandas dataframes to add new column with values to correspondant rows and add a new rows with the data from the second dataframe when the ID is not in the first one.
example:
df1
ID DATA1 DATA2 SAMPLE_X
0 A a 1 X
1 B b 1 X
2 C c 1 X
df2
ID DATA1 DATA2 SAMPLE_Y
0 A a 1 Y
1 C Z 1 Y
2 D d 1 Y
joined df1+df2
ID DATA1 DATA2 SAMPLE_X SAMPLE_Y
0 A a 1 X Y
1 B b 1 X -
2 C c,Z 2 X Y
3 D d 1 - Y
So There is a new column with the new data set, empty when ID was not present. And the joined df conteins new rows for rows that appear in df2 and not in df1 (based on ID). Also, I need that whenever the DATA1 value for the same ID does not match, that they are appended, and increase the value of DATA2 by one each time this happens.
I need to add in this fashion multiple more samples. I would really thank any help.
I have tried playing around with info found https://sparkbyexamples.com/pandas/pandas-merge-two-dataframes-on-multiple-columns/#:\~:text=You%20can%20pass%20two%20DataFrame,and%20df1%20assigns%20to%20merged_df%20.
but I can't find the way to do what I want. Thank you.
The task can be split in two parts: first, you merge the two dataframes and then you update the new values. Merging is quite straightforward:
halfway = pd.concat([df1, df2], ignore_index=True).groupby(['ID', 'DATA1']).last().reset_index()
it is better to use concat() rather than merge(), because the columns that are not shared, as SAMPLE_X and SAMPLE_Y, would otherwise be left out.
last() allows to fill the NaN values and select one row only (another option could be, e.g., fillna(method='bfill'), but then you also need to drop the duplicate rows, if any).
reset_index() brings back ID from index to regular column, as we are going to use it in the next step, which is as follows.
to_update = halfway.groupby('ID').filter(lambda x: len(x)>1)
agg_vals = to_update.agg({'DATA1': lambda x: ",".join(x.to_list()), 'DATA2': sum})
halfway.loc[to_update.index[0], 'DATA1'] = agg_vals['DATA1']
halfway.loc[to_update.index[0], 'DATA2'] = agg_vals['DATA2']
First, we select those rows which needs to be updated (groupby + filter). Then we compute the new values and get rid of NaNs.
The problem was that aggr() allows to get the new values of DATA1 and DATA2, but doesn't allow to return the values of the other columns unchanged. So I assigned the new values one by one (although I don't think that's a great solution).
Finally, we propagate the values of X and Y as before:
halfway.groupby('ID').first().reset_index()
This solution assumes that the only value of SAMPLE_X is X, and, as well, that SAMPLE_Y only has value Y, as showed by your example.

How to concatenate values from many columns into one column when one doesn't know the number of columns will have

My application saves an indeterminate number of values in different columns. As a results, I have a data frame with a certain number of columns at the beginning but then from a particular column (that I know) I will have an uncertain number of columns saving same data
Example:
known1 known2 know3 unknow1 unknow2 unknow3 ...
1 3 3 data data2 data3
The result I would like to get should be something like this
known1 known2 know3 all_unknow
1 3 3 data,data2,data3
How can I do this when I don't know the number of unknown columns but what I do know is this will occur (in this example) from the 4th column.
IIUC, use filter to select the columns by keyword:
cols = list(df.filter(like='unknow'))
# ['unknow1', 'unknow2', 'unknow3']
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
or take all columns from the 4th one:
cols = df.columns[3:]
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
output:
known1 known2 know3 all_unknow
0 1 3 3 data,data2,data3
df['all_unknown'] = df.iloc[:, 3:].apply(','.join, axis=1)
if you also want to drop all columns after the 4th:
cols = df.columns[3:-1]
df.drop(cols, axis=1)
the -1 is to avoid dropping the new column

Pandas groupby.sum for all columns

I have a dataset with a set of columns I want to sum for each row. The columns in question all follow a specific naming pattern that I have been able to group in the past via the .sum() function:
pd.DataFrame.sum(data.filter(regex=r'_name$'),axis=1)
Now, I need to complete this same function, but, when grouped by a value of a column:
data.groupby('group').sum(data.filter(regex=r'_name$'),axis=1)
However, this does not appear to work as the .sum() function now does not expect any filtered columns. Is there another way to approach this keeping my data.filter() code?
Example toy dataset. Real dataset contains over 500 columns where all columns are not cleanly ordered:
toy_data = ({'id':[1,2,3,4,5,6],
'group': ["a","a","b","b","c","c"],
'a_name': [1,6,7,3,7,3],
'b_name': [4,9,2,4,0,2],
'c_not': [5,7,8,4,2,5],
'q_name': [4,6,8,2,1,4]
})
df = pd.DataFrame(toy_data, columns=['id','group','a_name','b_name','c_not','q_name'])
Edit: Missed this in original post. My objective is to get a variable ;sum" of the summation of all the selected columns as shown below:
You can filter first and then pass df['group'] instead group to groupby, last add sum column by DataFrame.assign:
df1 = (df.filter(regex=r'_name$')
.groupby(df['group']).sum()
.assign(sum = lambda x: x.sum(axis=1)))
ALternative is filter columns names and pass after groupby:
cols = df.filter(regex=r'_name$').columns
df1 = df.groupby('group')[cols].sum()
Or:
cols = df.columns[df.columns.str.contains(r'_name$')]
df1 = df.groupby('group')[cols].sum().assign(sum = lambda x: x.sum(axis=1))
print (df1)
a_name b_name q_name sum
group
a 7 13 10 30
b 10 6 10 26
c 10 2 5 17

Finding mean of consecutive column data

I have the following data:
(the data given here is just representational)
`
I want to do the following with this data:
I want to get column only after the 201
i.e. I want to remove the 200-1 to 200-4 column data.
One way to do this is to retrieve only the required column while reading the data from excel, but I want to know how we can filter the column name on the basis of a particular pattern as 200-1 to 200-4 column name has pattern 200-*
I want to make a column after 202-4 which stores the values in the following ways:
201q1= mean of (201-1 and 201-2)
201q2 = mean of(201-3 and 201-4)
Similarly, if 202-1 to 201-4 data would have been there, a similar column should have been formed.
Please help.
Thanks in advance for your support.
This is a rough example but it will get you close. The example assume that there are always four columns per group:
#sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randn(2,12), columns=['200-1','200-2','200-3','200-4', '201-1', '201-2', '201-3','201-4', '202-1', '202-2', '202-3','202-4'])
# remove 200-* columns
df2 = df[df.columns[~df.columns.str.contains('200-')]]
# us np.arange to create groups
new = df2.groupby(np.arange(len(df2.columns))//2, axis=1).mean()
# rename columns
new.columns = [f'{v}{k}' for v,k in zip([x[:3] for x in df2.columns[::2]], ['q1','q2']*int(len(df2.columns[::2])/2))]
# join
df2.join(new)
201-1 201-2 201-3 201-4 202-1 202-2 202-3 \
0 0.865408 -2.301539 1.744812 -0.761207 0.319039 -0.249370 1.462108
1 -0.172428 -0.877858 0.042214 0.582815 -1.100619 1.144724 0.901591
202-4 201q1 201q2 202q1 202q2
0 -2.060141 -0.718066 0.491802 0.034834 -0.299016
1 0.502494 -0.525143 0.312514 0.022052 0.702043
For step 1, you can get away with list comprehension, and the pandas drop function:
dropcols = [x for x in df.columns if '200-' in x]
df.drop(dropcols, axis=1, inplace=True)
Steps 3 and 4 are similar, you could calculate the rolling mean of the columns:
df2 = df.rolling(2, axis = 1).mean() # creates rolling mean
df2.columns = [x.replace('-', 'q') for x in df2.columns] # renames the columns
dfans = pd.concat([df, df2], axis = 1) # concatenate the columns together
Now, you just need to remove the columns that you dont want and rename them.

How to apply a function to multiple columns to create multiple columns in Pandas?

I am trying to apply a function on multiple columns and in turn create multiple columns to count the length of each entry.
Basically I have 5 columns with indexes 5,7,9,13 and 15 and each entry in those columns is a string of the form 'WrappedArray(|2008-11-12, |2008-11-12)' and in my function I try to strip the wrappedArray part and split the two values and count the (length - 1) using the following;
def updates(row,num_col):
strp = row[num_col.strip('WrappedAway')
lis = list(strp.split(','))
return len(lis) - 1
where num_col is the index of the column and cal take the value 5,7,9,13,15.
I have done this but only for 1 column:
fn = lambda row: updates(row,5)
col = df.apply(fn, axis=1)
df = df.assign(**{'count1':col.values})
I basically want to apply this function to ALL the columns (not just 5 as above) with the indexes mentioned and then create a separate column associated with columns 5,7,9,13 and 15 all in short code instead of doing that separately for each value.
I hope I made sense.
In regards to finding the amount of elements in the list, looks like you could simply use str.count() to find the amount of ',' in the strings. And in order to apply a defined function to a set of columns you could do something like:
cols = [5,7,9,13,15]
for col in cols:
col_counts = {'{}_count'.format(col): df.iloc[:,col].apply(lambda x: x.count(','))}
df = df.assign(**col_counts)
Alternatively you can also usestrip('WrappedAway').split(',') as you where using:
def count_elements(x):
return len(x.strip('WrappedAway').split(',')) - 1
for col in cols:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
So for example with the following dataframe:
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'C': ['WrappedArray(|2008-11-12|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Redefining the set of columns on which we want to count the amount of elements:
for col in [0,1,2]:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
Would yield:
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B \
0 WrappedArray(|2008-11-12,|2008-11-12)
1 WrappedArray(|2008-11-12, |2008-11-12)
C 0_count 1_count 2_count
0 WrappedArray(|2008-11-12|2008-11-12) 2 1 0
1 WrappedArray(|2008-11-12|2008-11-12) 1 1 0
You are confusing row-wise and column-wise operations by trying to do both in one function. Choose one or the other. Column-wise operations are usually more efficient and you can utilize Pandas str methods.
Setup
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Logic
# perform operations on strings in a series
def calc_length(series):
return series.str.strip('WrappedAway').str.split(',').str.len() - 1
# apply to each column and join to original dataframe
df = df.join(df.apply(calc_length).add_suffix('_Length'))
Result
print(df)
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B A_Length B_Length
0 WrappedArray(|2008-11-12,|2008-11-12) 2 1
1 WrappedArray(|2008-11-12|2008-11-12) 1 0
I think we can use pandas str.count()
df= pd.DataFrame({
"col1":['WrappedArray(|2008-11-12, |2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)'],
"col2":['WrappedArray(|2008-11-12, |2008-11-12,|2008-11-12,|2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)']})
df["col1"].str.count(',')

Categories

Resources