I have two pandas dataframes that Im trying to merge together on their ID number. However in df1 the ID is being used multiple times and in df2 it is only being used once. Therefore I want the final dataframe to include all the results seperated by commas and having a index value in front of it. I made a simple example that will help me explain what I'm asking.
df1:
df2:
Merged Goal:
Ive tried merging them how I usually do:
MergedGoal= pd.merge(df1, df2, on='ID', how='left')
But I get a key error for ID, probably because there are duplicates. How can I add them together? and if anyone could also give me some insight as how to add an index for each value added that would be amazing. But if its not possible to add the index numbers thats totally fine, I just need all of the values in the same entry seperated by commas.
I created df1 the following way:
df1 = pd.DataFrame(data=[
[ 1, 'Manchester', 'NH', 3108 ],
[ 1, 'Bedford', 'NH', 3188 ],
[ 6, 'Boston', 'MA', 23718 ],
[ 1, 'Austin', 'TX', 20034 ]],
columns=['ID', 'City', 'State', 'Zip'])
df1.Zip = df1.Zip.astype(str).str.zfill(5)
Note that I changed source Zips (as I see, they are "plain"
integers) to a string, because you want to have leading zeroes.
To create df2 I used:
df2 = pd.DataFrame(data=[[ 1, 'Best Cities', 'xxx' ], [ 6, 'Worst Cities', 'yyy' ]],
columns=['ID', 'Title', 'Description'])
As a preparation step, let's define a function, which will be used
to aggregate columns from df1:
def fn(src):
lst = [ f'{idx}) {val}' for idx, val in enumerate(src, start=1) ]
return ', '.join(lst)
The first step of this function is a list comprehension, where
enumerate iterates over src (the content of the current column
in the current group) and substitutes:
idx - the current element index, but starting from 1,
val - the current element itself.
Formatting of result items performs f-string.
The result is a list of e.g. city names with numbers before them.
return statement joins this list into a string, inserting ", "
between them.
So e.g. for group for ID == 1 and City column, the source values are:
[ 'Manchester', 'Bedford', 'Austin' ] and result is:
1) Manchester, 2) Bedford, 3)Austin.
And the actual processing can be performed with a single instruction:
pd.merge(df2, df1.groupby('ID').agg(fn), how='left',
left_on='ID', right_index=True).fillna('')
As you can see:
I reverted the order of merged DataFrames. This way the result
contains first columns from df2, then from df1.
City, State and Zip columns from df1 are first
grouped by ID and aggregated, using fn function.
Then they are merged with df2.
I added fillna('') to replace NaN values with an empty string,
which would occur in case of IDs present only in df2.
Related
This is a bit tricky to put into words, but I'll give it a try. I have a dataframe with duplicated indices as provided below.
a = [0.00000, 0.071928, 1.294, 2.592563, 0.000318, 2.575291, 0.439986, 2.232147, 6.091523, 2.075441, 0.96152]
b = [0.00000, 0.399791, 1.302446, 1.388957, 1.276451, 1.527568, 1.614107, 2.686325, 4.167600, 6.135689, 5.945807]
df = pd.DataFrame({'a' : a, 'b' : b})
df.index = [1,1,1,1,1,2,2,3,3,3,4]
I want the row of the first duplicated index for every number to be appended to df1, and the row of the second duplicated index to be appended to df2, etc; the first time indices 1, 2, 3, 4... n have a duplicate, those rows get appended to dataframe 1. The second time indices 1, 2, 3, 4...n have a duplicate, those rows get appended to dataframe 2, and so on. Ideally, it would look something like this if concatenated for the first three duplicates under the 'index' column:
Any idea how to go about this? I've tried to run df[df.duplicated(subset = ['index'])] in a for loop to widdle down the df to the very first duplicates, but it doesn't seem to work the way I think it will.
Slicing out the duplicate indices via cumcount and using concat to stitch together the resulting sub-dataframes will do the job.
cols = df.columns
df['id'] = df.index
pd.concat([df[df.groupby('id').cumcount()==i][cols] for i in range(0, max(df.groupby('id').cumcount().values))], axis=1)
I would want to find a way in python to merge the files on 'seq' but return all the ones with the same id, in this example only the lines with id 2 would be removed.
File one:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSDLILYYEQYF,2
CASSDLILYYTQYF,2
CASSGSYEQYF,3
CASSGSYEQYY,3
File two:
seq
CSVGPPNNEQFF
CASRGEAAGFYEQYF
CASSGSYEQYY
Output:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSGSYEQYF,3
CASSGSYEQYY,3
I have tried:
df3 = df1.merge(df2.groupby('seq',as_index=False)[['seq']].agg(','.join),how='right')
output:
seq,id
CASRGEAAGFYEQYF,1
CASSGSYEQYY,3
CSVGPPNNEQFF,0
Does anyone have any advice how to solve this?
Do you want to merge two dataframes, or just take subset of the first dataframe according to which id is included in the second dataframe (by seq)? Anyway, this gives the required result.
df1 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CTVGPPNNEQFF',
'CTVGPPNNERFF',
'CASRGEAAGFYEQYF',
'RASRGEAAGFYEQYF',
'CASRGGAAGFYEQYF',
'CASSDLILYYEQYF',
'CASSDLILYYTQYF',
'CASSGSYEQYF',
'CASSGSYEQYY'
],
'id': [0, 0, 0, 1, 1, 1, 2, 2, 3, 3]
})
df2 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CASRGEAAGFYEQYF',
'CASSGSYEQYY'
]
})
df3 = df1.loc[df1['id'].isin(df1['id'][df1['seq'].isin(df2['seq'])])]
Explanation: df1['id'][df1['seq'].isin(df2['seq'])] takes those values of id from df1 that contain at least one seq that is included in df2. Then all rows with those values of id are taken from df1.
You can use the isin() pandas method, code shall looks as follow :
df1.loc[df1['seq'].isin(df2['seq'])]
Assuming, both objects are pandas dataframe and 'seq' is a column.
I have two data frames, let's say A and B. A has the columns ['Name', 'Age', 'Mobile_number'] and B has the columns ['Cell_number', 'Blood_Group', 'Location'], with 'Mobile_number' and 'Cell_number' having common values. I want to join the 'Location' column only onto A based off the common values in 'Mobile_number' and 'Cell_number', so the final DataFrame would have A={'Name':,'Age':,'Mobile_number':,'Location':]
a = {'Name': ['Jake', 'Paul', 'Logan', 'King'], 'Age': [33,43,22,45], 'Mobile_number':[332,554,234, 832]}
A = pd.DataFrame(a)
b = {'Cell_number': [832,554,123,333], 'Blood_group': ['O', 'A', 'AB', 'AB'], 'Location': ['TX', 'AZ', 'MO', 'MN']}
B = pd.DataFrame(b)
Please suggest. A colleague suggest to use pd.Join but I don't understand how.
Thank you for your time.
the way i see it, you want to merge a dataframe with a part of another dataframe, based on some common column.
first you have to make sure the common column share the same name:
B['Mobile_number'] = B['Cell_number']
then create a dataframe that contains only the relevant columns (the indexing column and the relevant data column):
B1 = B[['Mobile_number', 'Location']]
and at last you can merge them:
merged_df = pd.merge(A, B1, on='Mobile_number')
note that this usage of pd.merge will take only rows with Mobile_number value that exists in both dataframes.
you can look at the documentation of pd.merge to change how exactly the merge is done, what to include etc..
I have columns in two dataframes representing interacting partners in a biological system, so if gene_A interacts with gene_B, the entry in column 'gene_pair' would be {gene_A, gene_B}. I want to do an inner join, but trying:
pd.merge(df1, df2, how='inner', on=['gene_pair'])
throws the error
TypeError: type object argument after * must be a sequence, not itertools.imap
I need to merge on the unordered pair, so as far as I can tell I can't merge on a combination of two individual columns with gene names. Is there another way to achieve this merge?
Some example dfs:
gene_pairs1 = [
set(['gene_A','gene_B']),
set(['gene_A','gene_C']),
set(['gene_D','gene_A'])
]
df1 = pd.DataFrame({'r_name': ['r1','r2','r3'], 'gene_pair': gene_pairs1})
gene_pairs2 = [
set(['gene_A','gene_B']),
set(['gene_F','gene_A']),
set(['gene_C','gene_A'])
]
df2 = pd.DataFrame({'function': ['f1','f2','f3'], 'gene_pair': gene_pairs2})
pd.merge(df1,df2,how='inner',on=['gene_pair'])
and I would like entry 'r1' line up with 'f1' and 'r2' to line up with 'f3'.
Pretty simple in the end: I used frozenset, rather than set.
I suggest u get an extra Id column for each pair and then join on that!
for eg.
df2['gp'] = df2.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
df1['gp'] = df1.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
pd.merge(df1, df2[['function','gp']],how='inner',on=['gp']).drop('gp', axis=1)
I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.