Pandas concat not concatenating, but appending - python

I'm hoping for some help.
I am trying to concatenate three dataframes in pandas with a multiindex. Two of them work fine, but the third keeps appending, instead of concatenating.
They all have the same multiindex (I have tested this by df1.index.name == df2.index.name)
This is what I have tried:
df_final = pd.concat([df1, df2], axis = 1)
example:
df1
A B X
0 1 3
2 4
df2
A B Y
0 1 20
2 30
What I want to get is this:
df_final
A B X Y
0 1 3 20
2 4 30
But what I keep getting is this:
df_final
A B X Y
0 1 3 NaN
2 4 NaN
0 1 NaN 20
2 NaN 30
Any ideas? I have also tried
df_final = pd.concat([df1, df2], axis = 1, keys = ['A', 'B'])
But then df2 doesn't appear at all.
Thanks!

First way (and the better one in this case):
use merge:
pd.merge(left=df1, right=df2, on=['A','B'], how='inner')
Second way:
If you prefer using concat you can use groupby after it:
df_final = pd.concat([df1, df2])
df_final = df_final.groupby(['A','B']).first()

Thank you everyone for your help!
With your suggestions, I tried merging, but I got a new error:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Which led me to find that one of the indexes in the dataframe that was appending was an object instead of an integer. So I've changed that and now the concat works!
This has taken me days to get through...
So thank you again!

Try doing
pd.merge(df1, df2)
join() may also be used for your problem, provided you add the 'key' column to all your dataframes.

Related

Merging 3 dataframes with Pandas

I have 3 dataframes with the same ID column. I want to combine them into a single dataframe. I want to combine with inner join logic in SQL. When I try the code below it gives the following result. It correctly joins the two dataframes even though the ID column matches, but makes the last one wrong. How can I fix this? Thank you for your help in advance.
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
output
SOLVED: The data type of the ID column in DF1 was int, while the others were str. Before asking the question I had str the ID column in DF1 and got the following result. Then, when I converted all of them to int data type, I got the result I wanted.
Your IDs are not the same dtype:
>>> DF1
ID A
0 10 1
1 20 2
2 30 3
>>> DF2
ID K
0 30 3
1 10 1
2 20 2
>>> DF3
ID P
0 20 2
1 30 3
2 10 1
Your code:
dfs = [DF1, DF2, DF3]
df_final = reduce(lambda left, right: pd.merge(left, right, on=["ID"], how="outer"), dfs)
The output:
>>> df_final
ID A K P
0 10 1 1 1
1 20 2 2 2
2 30 3 3 3
Use join:
# use set index to add 'join' key into the index and
# create a list of dataframes using list comprehension
l = [df.set_index('ID') for df in [df1,df2,df3])
# pd.DataFrame.join accepts a list of dataframes as 'other'
l[0].join(l[1:])

How can I properly use pivot on this pandas dataframe?

I have the following df:
Item Service Damage Type Price
A Fast 3.5 1 15.48403728
A Slow 3.5 1 17.41954194
B Fast 5 1 19.3550466
B Slow 5 1 21.29055126
C Fast 5.5 1 23.22605592
and so on
I want to turn this into this format:
Item Damage Type Price_Fast Price_slow
So the first row would be:
Item Damage Type Price_Fast Price_slow
A 3.5 1 15.4840.. 17.41954...
I tried:
df.pivot(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
but it threw this error:
ValueError: Length of passed values is 2340, index implies 3
To get exactly the dataframe layout you want use
dfData = dfRaw.pivot_table(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
like #CJR suggested followed by
dfData.reset_index(inplace=True)
to flatten dataframe and
dfData.rename(columns={'Fast': 'Price_fast'}, inplace=True)
dfData.rename(columns={'Slow': 'Price_slow'}, inplace=True)
to get your desired column names.
Then use
dfNew.columns = dfNew.columns.values
to get rid of custom index label and your are done (Thanks to #Akaisteph7 for pointing that out that I was not quite done with my previous solution.)
You can do it with the following code:
# You should use pivot_table as it handles multiple column pivoting and duplicates aggregation
df2 = df.pivot_table(index=['Item', 'Damage', 'Type'], columns='Service', values='Price')
# Make the pivot indexes back into columns
df2.reset_index(inplace=True)
# Change the columns' names
df2.rename(columns=lambda x: "Price_"+x if x in ["Fast", "Slow"] else x, inplace=True)
# Remove the unneeded column Index name
df2.columns = df2.columns.values
print(df2)
Output:
Item Damage Type Price_Fast Price_Slow
0 A 3.5 1 15.484037 17.419542
1 B 5.0 1 19.355047 21.290551
2 C 5.5 1 23.226056 NaN

pandas: low level concatenation of DataFrames along axis=1

The problem:
one has 2 DataFrames
one knows that the two have identical (MultiIndex) indices
(just in case it helps) both indices are sorted
both DataFrames have columns which do not intersect
How can I concatenate the 2 DataFrames very efficiently by just slapping their memory blocks together, i.e. do equivalent of
pd.concat([df1, df2], axis=1, sort=False)
but forcing somehow to completely disregard index values of both DataFrames to make it very fast? I want it to be essentially as close as possible to a memory copy operation (no merges).
import pandas as pd
df1 = pd.DataFrame(data={'i1':['a','a','b','b'],
'i2':[0,1,0,1],
'x':[1.,2.,3.,4.]})
df1.set_index(['i1','i2'], inplace=True)
df1.sort_index(inplace=True)
df2 = pd.DataFrame(data={'y':[5,6,7,8]}, index=df1.index)
pd.concat([df1, df2], axis=1, sort=False)
x y
i1 i2
a 0 1.0 5
1 2.0 6
b 0 3.0 7
1 4.0 8
for col in df2:
df1[col] = df2[col].values

Number of rows changes even after `pandas.merge` with `left` option

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]

How to remove duplicate columns from a dataframe using python pandas

By grouping two columns I made some changes.
I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?
It's probably easiest to use a groupby (assuming they have duplicate names too):
In [11]: df
Out[11]:
A B B
0 a 4 4
1 b 4 4
2 c 4 4
In [12]: df.T.groupby(level=0).first().T
Out[12]:
A B
0 a 4
1 b 4
2 c 4
If they have different names you can drop_duplicates on the transpose:
In [21]: df
Out[21]:
A B C
0 a 4 4
1 b 4 4
2 c 4 4
In [22]: df.T.drop_duplicates().T
Out[22]:
A B
0 a 4
1 b 4
2 c 4
Usually read_csv will usually ensure they have different names...
Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442
This is the best I found so far.
remove = []
cols = df.columns
for i in range(len(cols)-1):
v = df[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(v,df[cols[j]].values):
remove.append(cols[j])
df.drop(remove, axis=1, inplace=True)
https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code
It's already answered here python pandas remove duplicate columns.
Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.
Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.
column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]
I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):
df.drop(df.columns[i], axis=1)
The fast solution for dataset without NANs:
share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]

Categories

Resources