I would like to merge three dataframes. I have tried to simplify the problem to explain: I have df with columns ['1', '2', '3'] and df1 with columns ['1', '2', '3'] and df2 with columns ['1', '2', '3'].
I want to merge the dataframes on keys 1 & 2.
I have tried the following (simplified):
new = pd.merge(df, df1, how = 'left', on = [ '1', '2'])
new1 = pd.merge(new, df2, how = 'left', on = ['1', '2'])
Now you get as outcome
new with columns ['1', '2', '3_x', '3_y']
new1 with columns ['1', '2', '3_x', '3_y', '3_z']
while i would like
new with columns ['1', '2', '3']
new1 with columns ['1', '2', '3']
Any help is welcome! I don't want to use a loop.
Thanks in advance.
Merge will 'merge' only the columns (and only on the columns) that you are specifying in the 'on'. That is, this is working as expected.
If you have matching columns 1 and 2 but different columns 3 and you make the merge, what should be in each column in our new output? 1 and 2 will be whatever they were in both of the originals, but there are two different options for column 3. The function of merge is to split those by different columns.
What you may want here instead is append. Append will put one dataframe below another, as explained on this page here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-using-append.
This will give you an output that has the columns '1', '2', '3'.
new = df.append(df1)
new1 = new.append(df2)
You may even be able to do new = df.append([df1,df2]), but I didn't try it and don't have 3 dataframes handy.
You did say you want to just merge on columns 1 and 2, so I may be completely missing your point here. What would you want with the data in column 3 in such a case? You may be able to achieve that by using append and then removing some duplicates or otherwise cleaning up this output.
Related
The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)
df_1 = {'budget_id':['1', '2', '3', '4'],
'budget_amount':[200, 300, 400, 500]}
df_2 = {'budget_id':['1', '2', '3', '4', '5'],
'budget_amount':[200, 300, 400, 550, 700]}
df_1.compare(df_2, align_axis=0, keep_equal=True).rename(index={'self': 'Prev', 'other': 'New'}, level=1)
Desired output of df.compare():
budget_id budget_amount
4 550
5 700
I have two data frames that I wish to compare using df.compare. They both have the same columns and index labels.
However, I can not guarantee they have the same number of rows. This causes issues as compare expects a two DFs with the same shape.
I need to know if a new row has been added as part of the compare.
Is the best solution would be to append blank rows to either data frame until they're equal? How would you do that?
Is there a more elegant way?
Does mergework for you:
(df_1.merge(df_2, on='budget_id', how='right')
.query('budget_amount_x != budget_amount_y')
)
Output:
budget_id budget_amount_x budget_amount_y
3 4 500.0 550
4 5 NaN 700
This is the solution I wrote based on Giovanni Frison's comment.
def compare_dataframes(df_1, df_2):
if df_1.equals(df_2):
return pandas.DataFrame()
else:
#Get indexes of rows present in df_2, but not in df_1
new_row_indexes = df_2.index.difference(df_1.index)
new_rows = df_2.loc[new_row_indexes]
#Create second index to match df.compare output
new_rows[''] = 'New'
new_rows = new_rows.set_index('',append=True)
#Drop new rows from df_2 to create same shape for df.compare
df_2 = df_2.drop(new_row_indexes)
compare_df = df_1.compare(df_2, align_axis=0, keep_equal=True).rename(index={'self': 'Prev', 'other': 'New'}, level=1)
compare_df = compare_df.append(new_rows)
return compare_df
I wanna append a longer list to dataframe .But get an error ValueError: Length of values (4) does not match length of index (3)
import pandas as pd
df = pd.DataFrame({'Data': ['1', '2', '3']})
df['Data2'] =['1', '2', '3', '4']
print(df)
How can I fix it .
Use DataFrame.reindex for add new rows by maximal length by new list or original DataFrame, if length of list should be changed, sometimes same length or sometimes length is shorter:
df = pd.DataFrame({'Data': ['1', '2', '3']})
L = ['1', '2', '3', '4']
df = df.reindex(range(max(len(df), len(L))))
df['Data2'] = L
print (df)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4
If always is length of list longer:
df = df.reindex(range(len(L)))
df['Data2'] = L
You can try using pd.concat here but convert your list to a Series then use pd.concat
l = ['1', '2', '3', '4']
pd.concat([df, pd.Series(l, name='Data2')], axis=1)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4
I have two pandas dataframes.
I am going row by row and trying to check if a value in df1[column] is in df2[column], and record this info to df1.
I have a 'toy' example below. But my actual dataset has 150,000 rows.
The below code runs fine, but on the larger dataset I actually had to stop the kernel it was taking too long.
df1= pd.DataFrame([['1', 'a'],
['2', 'b'],
['3', 'b'],
['4', 'z'],
['5', 'e']], columns=['num', 'num_letter'])
# adding an extra column to record result of check for duplicates
df1['dupe'] = None
df2= pd.DataFrame([['1', 'a'],
['2', 'b'],
['3', 'b'],
['4', 'd'],
['5', 'e']], columns=['num', 'num_letter'])
for i in range(len(df1)):
for k in df1['num_letter']:
# if value from df1 is found in df2 column,
# record the word 'dupe' to corresponding empty cell in df1.
if k in df2['num_letter'].values:
df1.loc[i,'dupe'] = 'dupe'
else:
df1.loc[i,'dupe'] = 'not_dupe'
Is there a more efficient way to do this?
Thanks folks
Numpy's in1d and where
df1.assign(dupe=np.where(np.in1d(df1.num_letter, df2.num_letter), 'dupe', 'not_dupe'))
num num_letter dupe
0 1 a dupe
1 2 b dupe
2 3 b dupe
3 4 z not_dupe
4 5 e dupe
I have the pandas.DataFrame below:
One of the columns from the Dataframe, pontos, holds a dict for each of the rows.
What I want to do is add one column to the DataFrame for each key from this dict. So the new columns would be, in this example: rodada, mes, etc, and for each row, these columns would be populated with the respective value from the dict.
So far I've tried the following for one of the keys:
df_times["rodada"] = [df_times["pontos"].get('rodada') for d in df_times["pontos"]]
However, as a result I'm getting a new column rodada filled with None values:
Any hints on what I'm doing wrong?
You can create a new dataframe and concat it to the current one like:
Code:
df2 = pd.concat([df, pd.DataFrame(list(df.pontos))], axis=1)
Test Code:
import pandas as pd
df = pd.DataFrame([
['A', dict(col1='1', col2='2')],
['B', dict(col1='3', col2='4')],
], columns=['X', 'D'])
print(df)
df2 = pd.concat([df, pd.DataFrame(list(df.D))], axis=1)
print(df2)
Results:
X D
0 A {'col2': '2', 'col1': '1'}
1 B {'col2': '4', 'col1': '3'}
X D col1 col2
0 A {'col2': '2', 'col1': '1'} 1 2
1 B {'col2': '4', 'col1': '3'} 3 4
You just need a slight change in your comprehension to extract that data.
It should be:
df_times["rodada"] = [d.get('rodada') for d in
df_times["pontos"]]
You want the values of the dictionary key 'rodada' to be the basis of your new column. So you iterate over those dictionary entries in the loop- in other words, d, and then extract the value by key to make the new column.
you can also use join and pandas apply function:
df=df.join(df['pontos'].apply(pd.Series))