adding values in new column based on indexes with pandas in python - python

I'm just getting into pandas and I am trying to add a new column to an existing dataframe.
I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.
The code section below illustrates what I mean. The commented part is what I need as an output.
I guess I need the .loc[] function.
Another, minor, question: is it bad practice to have a non-unique indexes?
import pandas as pd
d = {'key':['a', 'b', 'c'],
'bar':[1, 2, 3]}
d2 = {'key':['a', 'a', 'b'],
'other_data':['10', '20', '30']}
df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')
print df2
## other_data new_col
##key
##a 10 1
##a 20 1
##b 30 2

Use rename index by Series:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.

You can use join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2

Another, minor, question: is it bad practice to have a non-unique
indexes?
It is not great practice, but depends on your needs and can be okay in some circumstances.
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
For example:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
A word on .loc
Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859

With the help of .loc
df2['new'] = df.set_index('key').loc[df2.index]
Output :
other_data new
key
a 10 1
a 20 1
b 30 2

Using combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2

Related

Joining 101 columns from a dictionary of dataframes

For the love of God! I have 101 single column features and I just want to join, or merge, or concatenate them so they all have the index of the first frame. I have all the frames in a dict already! I thought that would be the hard part.
Below I've done manually what I'd like to do. What I'd like to do is loop through the dict and get all 101 columns.
a=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/1byd.xls']
b=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/2byd.xls']
c=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/3byd.xls']
d=a.join(b['Value'],lsuffix='_caller')
f=d.join(c['Value'],lsuffix='_caller')
f
You will need to
Create a first variable and set it to True. The first time we iterate through ou dict() we don't have anything to merge our dataframe with, so we will just assign the value to a variable
set the first variable to False so next time we will just merge our dataframe together
df.merge() and set left_index and right_index parameter to True so that our join happens on these index.
Below is a sample code.
Input
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4]})
df1 = pd.DataFrame({'col2': [11,12,13,14]})
df2 = pd.DataFrame({'col3': [111,112,113,114]})
d = {'df':df, 'df1':df1, 'df2':df2}
first = True
for key, value in d.items():
if first:
n = value
first = False
else:
n = n.merge(value, left_index=True, right_index=True)
n.head()
output
col1 col2 col3
0 1 11 111
1 2 12 112
2 3 13 113
3 4 14 114
Here is a link to merge() for more information link
I would like to add that, if you want to keep the keys of the dictionary as the column headers of the final dataframe you just need to add this in the end:
n.columns=d.keys()

Reindexing a pandas DataFrame using a dict (python3)

Is there a way, without use of loops, to reindex a DataFrame using a dict? Here is an example:
df = pd.DataFrame([[1,2], [3,4]])
dic = {0:'first', 1:'second'}
I want to apply something efficient to df for obtaining:
0 1
first 1 2
second 3 4
Speed is important, as the index in the actual DataFrame I am dealing with has a huge number of unique values. Thanks
You need the rename function:
df.rename(index=dic)
# 0 1
#first 1 2
#second 3 4
Modified the dic to get the results: dic = {0:'first', 1:'second'}

Number of rows changes even after `pandas.merge` with `left` option

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]

Added column to existing dataframe but entered all numbers as NaN

So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.

How do I find duplicate indices in a DataFrame?

I have a pandas DataFrame with a multi-level index ("instance" and "index"). I want to find all the first-level ("instance") index values which are non-unique and to print out those values.
My frame looks like this:
A
instance index
a 1 10
2 12
3 4
b 1 12
2 5
3 2
b 1 12
2 5
3 2
I want to find "b" as the duplicate 0-level index and print its value ("b") out.
You can use the get_duplicates() method:
>>> df.index.get_level_values('instance').get_duplicates()
[0, 1]
(In my example data 0 and 1 both appear multiple times.)
The get_level_values() method can accept a label (such as 'instance') or an integer and retrieves the relevant part of the MultiIndex.
Assuming that your df has an index made of 'instance' and 'index' you could do this:
df1 = df.reset_index().pivot_table(index=['instance','index'], values='A', aggfunc='count')
df1[df1 > 1].index.get_level_values(0).drop_duplicates()
Which yields:
Index([u'b'], dtype='object')
Adding .values at the end (.drop_duplicates().values) will make an array:
array(['b'], dtype=object)
Or the same with one line using .groupby:
df[df.groupby(level=['instance','index']).count() > 1].dropna().index.get_level_values(0).drop_duplicates()
This should give you the whole row which isn't quite what you asked for but might be close enough:
df[df.index.get_level_values('instance').duplicated()]
You want the duplicated method:
df['Instance'].duplicated()

Categories

Resources