I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]
Related
I have the following df:
Item Service Damage Type Price
A Fast 3.5 1 15.48403728
A Slow 3.5 1 17.41954194
B Fast 5 1 19.3550466
B Slow 5 1 21.29055126
C Fast 5.5 1 23.22605592
and so on
I want to turn this into this format:
Item Damage Type Price_Fast Price_slow
So the first row would be:
Item Damage Type Price_Fast Price_slow
A 3.5 1 15.4840.. 17.41954...
I tried:
df.pivot(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
but it threw this error:
ValueError: Length of passed values is 2340, index implies 3
To get exactly the dataframe layout you want use
dfData = dfRaw.pivot_table(index=['Item', 'Damage', 'Type'],columns='Service', values='Price')
like #CJR suggested followed by
dfData.reset_index(inplace=True)
to flatten dataframe and
dfData.rename(columns={'Fast': 'Price_fast'}, inplace=True)
dfData.rename(columns={'Slow': 'Price_slow'}, inplace=True)
to get your desired column names.
Then use
dfNew.columns = dfNew.columns.values
to get rid of custom index label and your are done (Thanks to #Akaisteph7 for pointing that out that I was not quite done with my previous solution.)
You can do it with the following code:
# You should use pivot_table as it handles multiple column pivoting and duplicates aggregation
df2 = df.pivot_table(index=['Item', 'Damage', 'Type'], columns='Service', values='Price')
# Make the pivot indexes back into columns
df2.reset_index(inplace=True)
# Change the columns' names
df2.rename(columns=lambda x: "Price_"+x if x in ["Fast", "Slow"] else x, inplace=True)
# Remove the unneeded column Index name
df2.columns = df2.columns.values
print(df2)
Output:
Item Damage Type Price_Fast Price_Slow
0 A 3.5 1 15.484037 17.419542
1 B 5.0 1 19.355047 21.290551
2 C 5.5 1 23.226056 NaN
I'm just getting into pandas and I am trying to add a new column to an existing dataframe.
I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.
The code section below illustrates what I mean. The commented part is what I need as an output.
I guess I need the .loc[] function.
Another, minor, question: is it bad practice to have a non-unique indexes?
import pandas as pd
d = {'key':['a', 'b', 'c'],
'bar':[1, 2, 3]}
d2 = {'key':['a', 'a', 'b'],
'other_data':['10', '20', '30']}
df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')
print df2
## other_data new_col
##key
##a 10 1
##a 20 1
##b 30 2
Use rename index by Series:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.
You can use join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2
Another, minor, question: is it bad practice to have a non-unique
indexes?
It is not great practice, but depends on your needs and can be okay in some circumstances.
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
For example:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
A word on .loc
Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859
With the help of .loc
df2['new'] = df.set_index('key').loc[df2.index]
Output :
other_data new
key
a 10 1
a 20 1
b 30 2
Using combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2
So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.
I searched archive, but did not find what I wanted (probably because I don't really know what key words to use)
Here is my problem: I have a bunch of dataframes need to be merged; I also want to update the values of a subset of columns with the sum across the dataframes.
For example, I have two dataframes, df1 and df2:
df1=pd.DataFrame([ [1,2],[1,3], [0,4]], columns=["a", "b"])
df2=pd.DataFrame([ [1,6],[1,4]], columns=["a", "b"])
a b a b
0 1 2 0 1 5
1 1 3 2 0 6
2 0 4
after merging, I'd like to have the column 'b' updated with the sum of matched records, while column 'a' should be just like df1 (or df2, don't really care) as before:
a b
0 1 7
1 1 3
2 0 10
Now, expand this to merging three or more data frames.
Are there straightforward, build-in tricks to do this? or I need to process one by one, line by line?
===== Edit / Clarification =====
In the real world example, each data frame may contain indexes that are not in the other data frames. In this case, the merged data frame should have all of them and update the shared entries/indexes with sum (or some other operation).
Only partial, not complete solution yet. But the main point is solved:
df3 = pd.concat([df1, df2], join = "outer", axis=1)
df4 = df3.b.sum(axis=1)
df3 will have two 'a' columns, and two 'b' columns. the sum() function on df3.b add two 'b' columns and ignore NaNs. Now df4 has column 'b' with sum of df1 and df2's 'b' columns, and all the indexes.
did not solve the column 'a' though. In my real case, there are quite few number of NaN in df3.a , while others in df3.a should be the same. I haven't found a straightforward way to make a column 'a' in df4 and fill value with non-NaN. Now searching for a "count" function to get occurance of elements in rows of df3.a (imagine it has a few dozens column 'a').
By grouping two columns I made some changes.
I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?
It's probably easiest to use a groupby (assuming they have duplicate names too):
In [11]: df
Out[11]:
A B B
0 a 4 4
1 b 4 4
2 c 4 4
In [12]: df.T.groupby(level=0).first().T
Out[12]:
A B
0 a 4
1 b 4
2 c 4
If they have different names you can drop_duplicates on the transpose:
In [21]: df
Out[21]:
A B C
0 a 4 4
1 b 4 4
2 c 4 4
In [22]: df.T.drop_duplicates().T
Out[22]:
A B
0 a 4
1 b 4
2 c 4
Usually read_csv will usually ensure they have different names...
Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442
This is the best I found so far.
remove = []
cols = df.columns
for i in range(len(cols)-1):
v = df[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(v,df[cols[j]].values):
remove.append(cols[j])
df.drop(remove, axis=1, inplace=True)
https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code
It's already answered here python pandas remove duplicate columns.
Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.
Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.
column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]
I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):
df.drop(df.columns[i], axis=1)
The fast solution for dataset without NANs:
share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]