pandas: low level concatenation of DataFrames along axis=1 - python

The problem:
one has 2 DataFrames
one knows that the two have identical (MultiIndex) indices
(just in case it helps) both indices are sorted
both DataFrames have columns which do not intersect
How can I concatenate the 2 DataFrames very efficiently by just slapping their memory blocks together, i.e. do equivalent of
pd.concat([df1, df2], axis=1, sort=False)
but forcing somehow to completely disregard index values of both DataFrames to make it very fast? I want it to be essentially as close as possible to a memory copy operation (no merges).
import pandas as pd
df1 = pd.DataFrame(data={'i1':['a','a','b','b'],
'i2':[0,1,0,1],
'x':[1.,2.,3.,4.]})
df1.set_index(['i1','i2'], inplace=True)
df1.sort_index(inplace=True)
df2 = pd.DataFrame(data={'y':[5,6,7,8]}, index=df1.index)
pd.concat([df1, df2], axis=1, sort=False)
x y
i1 i2
a 0 1.0 5
1 2.0 6
b 0 3.0 7
1 4.0 8

for col in df2:
df1[col] = df2[col].values

Related

How to avoid few columns in a data frame while merging data with an other data frame?

I have two data frames df1 and df2.
df1 =
A B C D
1 2 3 7
.
.
df2 =
A E F G
1 5 4 5
.
.
When I usually want to merge specific columns from two data frames using pandas I do this:
import pandas as pd
df3 = pd.merge(df1[[A,B]],df2[[A,G]], on='A', how='inner')
However, I am interested in knowing how to avoid a few columns in a data frame and merge the rest. For example, I want to avoid the columns Cand D in df1 and columns EandF in df2 while merging so the resultant df3 has only A,B,G columns only.
It is reverse engineering. When there are few columns in each data frame it may not be useful and the first method would be enough, but while working with hundreds of columns and if any few columns are to be avoided the second approach would be helpful.
How about drop:
df1.drop(['C','D'], axis=1).merge(df2.drop(['E','F'], axis=1), on='A')
try this:
df3=df1.merge(df2, on='A',how ="inner")
df3.drop(['E,'F',C','D'], axis=1)
this works but this solution is inefficient so dropping before merging will be optimal.

Pandas concat not concatenating, but appending

I'm hoping for some help.
I am trying to concatenate three dataframes in pandas with a multiindex. Two of them work fine, but the third keeps appending, instead of concatenating.
They all have the same multiindex (I have tested this by df1.index.name == df2.index.name)
This is what I have tried:
df_final = pd.concat([df1, df2], axis = 1)
example:
df1
A B X
0 1 3
2 4
df2
A B Y
0 1 20
2 30
What I want to get is this:
df_final
A B X Y
0 1 3 20
2 4 30
But what I keep getting is this:
df_final
A B X Y
0 1 3 NaN
2 4 NaN
0 1 NaN 20
2 NaN 30
Any ideas? I have also tried
df_final = pd.concat([df1, df2], axis = 1, keys = ['A', 'B'])
But then df2 doesn't appear at all.
Thanks!
First way (and the better one in this case):
use merge:
pd.merge(left=df1, right=df2, on=['A','B'], how='inner')
Second way:
If you prefer using concat you can use groupby after it:
df_final = pd.concat([df1, df2])
df_final = df_final.groupby(['A','B']).first()
Thank you everyone for your help!
With your suggestions, I tried merging, but I got a new error:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Which led me to find that one of the indexes in the dataframe that was appending was an object instead of an integer. So I've changed that and now the concat works!
This has taken me days to get through...
So thank you again!
Try doing
pd.merge(df1, df2)
join() may also be used for your problem, provided you add the 'key' column to all your dataframes.

How can I properly use pivot on this pandas dataframe?

I have the following df:
Item Service Damage Type Price
A Fast 3.5 1 15.48403728
A Slow 3.5 1 17.41954194
B Fast 5 1 19.3550466
B Slow 5 1 21.29055126
C Fast 5.5 1 23.22605592
and so on
I want to turn this into this format:
Item Damage Type Price_Fast Price_slow
So the first row would be:
Item Damage Type Price_Fast Price_slow
A 3.5 1 15.4840.. 17.41954...
I tried:
df.pivot(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
but it threw this error:
ValueError: Length of passed values is 2340, index implies 3
To get exactly the dataframe layout you want use
dfData = dfRaw.pivot_table(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
like #CJR suggested followed by
dfData.reset_index(inplace=True)
to flatten dataframe and
dfData.rename(columns={'Fast': 'Price_fast'}, inplace=True)
dfData.rename(columns={'Slow': 'Price_slow'}, inplace=True)
to get your desired column names.
Then use
dfNew.columns = dfNew.columns.values
to get rid of custom index label and your are done (Thanks to #Akaisteph7 for pointing that out that I was not quite done with my previous solution.)
You can do it with the following code:
# You should use pivot_table as it handles multiple column pivoting and duplicates aggregation
df2 = df.pivot_table(index=['Item', 'Damage', 'Type'], columns='Service', values='Price')
# Make the pivot indexes back into columns
df2.reset_index(inplace=True)
# Change the columns' names
df2.rename(columns=lambda x: "Price_"+x if x in ["Fast", "Slow"] else x, inplace=True)
# Remove the unneeded column Index name
df2.columns = df2.columns.values
print(df2)
Output:
Item Damage Type Price_Fast Price_Slow
0 A 3.5 1 15.484037 17.419542
1 B 5.0 1 19.355047 21.290551
2 C 5.5 1 23.226056 NaN

Added column to existing dataframe but entered all numbers as NaN

So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.

Pandas column bind (cbind) two data frames

I've got a dataframe df_a with id information:
unique_id lacet_number
15 5570613 TLA-0138365
24 5025490 EMP-0138757
36 4354431 DXN-0025343
and another dataframe df_b, with the same number of rows that I know correspond to the rows in df_a:
latitude longitude
0 -93.193560 31.217029
1 -93.948082 35.360874
2 -103.131508 37.787609
What I want to do is simply concatenate the two horizontally (similar to cbind in R) and get:
unique_id lacet_number latitude longitude
0 5570613 TLA-0138365 -93.193560 31.217029
1 5025490 EMP-0138757 -93.948082 35.360874
2 4354431 DXN-0025343 -103.131508 37.787609
What I have tried:
df_c = pd.concat([df_a, df_b], axis=1)
which gives me an outer join.
unique_id lacet_number latitude longitude
0 NaN NaN -93.193560 31.217029
1 NaN NaN -93.948082 35.360874
2 NaN NaN -103.131508 37.787609
15 5570613 TLA-0138365 NaN NaN
24 5025490 EMP-0138757 NaN NaN
36 4354431 DXN-0025343 NaN NaN
The problem is that the indices for the two dataframes do not match. I read the documentation for pandas.concat, and saw that there is an option ignore_index. But that only applies to the concatenation axis, in my case the columns and it certainly is not the right choice for me. So my question is: is there a simple way to achieve this?
If you're sure the index row values are the same then to avoid the index alignment order then just call reset_index(), this will reset your index values back to start from 0:
df_c = pd.concat([df_a.reset_index(drop=True), df_b], axis=1)
DataFrame.join
While concat is fine, it's simpler to join:
C = A.join(B)
This still assumes aligned indexes, so reset_index as needed. In OP's example, B's index is already default, so we only need to reset A:
C = A.reset_index(drop=True).join(B)
# unique_id lacet_number latitude longitude
# 0 5570613 TLA-0138365 -93.193560 31.217029
# 1 5025490 EMP-0138757 -93.948082 35.360874
# 2 4354431 DXN-0025343 -103.131508 37.787609
You can use set_axis to make the index labels of one of the frames to be the same as the other's and concatenate horizontally or join. Unlike reset_index, this method preserves the index labels of one of the dataframes.
joined_df = pd.concat([df_a.set_axis(df_b.index), df_b], axis=1)
# or using `join`
joined_df = df_a.set_axis(df_b.index).join(df_b)

Categories

Resources