I have a DataFrame with multiIndex columns. Suppose it is this:
index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
('two', 'a'), ('two', 'b')])
df = pd.DataFrame({'col': np.arange(1.0, 5.0)}, index=index)
df = df.unstack(1)
(I know this definition could be more direct). I now want to set a new level 0 column based on a DataFrame. For example
df['col2'] = df['col'].applymap(lambda x: int(x < 3))
This does not work. The only method I have found so far is to add each column seperately:
Pandas: add a column to a multiindex column dataframe
, or some sort of convoluted joining process.
The desired result is a new level 0 column 'col2' with two level 1 subcolumns: 'a' and 'b'
Any help would be much appreciated, Thank you.
I believe need solution with no unstack and stack - filter by boolean indexing, rename values for avoid duplicates and last use DataFrame.append:
df2 = df[df['col'] < 3].rename({'one':'one1', 'two':'two1'}, level=0)
print (df2)
col
one1 a 1.0
b 2.0
df = df.append(df2)
print (df)
col
one a 1.0
b 2.0
two a 3.0
b 4.0
one1 a 1.0
b 2.0
Related
I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4
I have a dataframe such as
multiindex1 = pd.MultiIndex.from_product([['a'], np.arange(3, 8)])
df1 = pd.DataFrame(np.random.randn(5, 3), index=multiindex1)
multiindex2 = pd.MultiIndex.from_product([['s'], np.arange(1, 6)])
df2 = pd.DataFrame(np.random.randn(5, 3), index=multiindex2)
multiindex3 = pd.MultiIndex.from_product([['d'], np.arange(2, 7)])
df3 = pd.DataFrame(np.random.randn(5, 3), index=multiindex3)
df = pd.concat([df1, df2, df3])
df.index.names = ['contract', 'index']
df.columns = ['z', 'x', 'c']
>>>
z x c
contract index
a 3 0.354879 0.206557 0.308081
4 0.822102 -0.425685 1.973288
5 -0.801313 -2.101411 -0.707400
6 -0.740651 -0.564597 -0.975532
7 -0.310679 0.515918 -1.213565
s 1 -0.175135 0.777495 0.100466
2 2.295485 0.381226 -0.242292
3 -0.753414 1.172924 0.679314
4 -0.029526 -0.020714 1.546317
5 0.250066 -1.673020 -0.773842
d 2 -0.602578 -0.761066 -1.117238
3 -0.935758 0.448322 -2.135439
4 0.808704 -0.604837 -0.319351
5 0.321139 0.584896 -0.055951
6 0.041849 -1.660013 -2.157992
Now I want to replace the index of index with the column c. That is to say, I want the result as
z x
contract c
a 0.308081 0.354879 0.206557
1.973288 0.822102 -0.425685
-0.707400 -0.801313 -2.101411
-0.975532 -0.740651 -0.564597
-1.213565 -0.310679 0.515918
s 0.100466 -0.175135 0.777495
-0.242292 2.295485 0.381226
0.679314 -0.753414 1.172924
1.546317 -0.029526 -0.020714
-0.773842 0.250066 -1.673020
d -1.117238 -0.602578 -0.761066
-2.135439 -0.935758 0.448322
-0.319351 0.808704 -0.604837
-0.055951 0.321139 0.584896
-2.157992 0.041849 -1.660013
I implement it in one way
df.reset_index().set_index(['contract', 'c']).drop(['index'], axis=1)
But it seems there are some duplecate steps because I manipulate the indexs for three times. So if there is a more elegent way to achieve that?
Try this
# convert column "c" into an index and remove "index" from index
df.set_index('c', append=True).droplevel('index')
Explanation:
Pandas' set_index method has append argument that controls whether to append columns to existing index or not; setting it True appends column "c" as an index. droplevel method removes index level (can remove column level too but removes index level by default).
Given two dataframes, df1 and df2, I want to take the last column of df2 and add it to df1 based on column 'a' they both have.
That is for every row in df2, if df1['a'] has is, then I want to add it to the new column. The rows of df1['a'] that aren't in df2['a'] we set N/A. If there is a value in df2['a'] that isn't in df1['a'], we ignore it
Additionally, while adding a column, I was hoping to update df1['b'] and df1['c'] to the values of df2['b'] and df2['c'].
For the first part, this the best I've gotten
df1 = df1.merge(df2, how='outer', on='a')
df1 = df1.drop_duplicates('a')
This needlessly create doubles and I even update
Try using a left join:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6],'c':[7,8,9]})
df2 = pd.DataFrame({'a':[2,3,4],'b':[5,6,7],'c':[8,9,10],'new_column_from_df2':[11,12,13]})
df1['a'].to_frame().merge(df2, how='left', on='a')
Output:
Out[190]:
a b c new_column_from_df2
0 1 NaN NaN NaN
1 2 5.0 8.0 11.0
2 3 6.0 9.0 12.0
Note the last row of df2 being ignored/excluded because it is not in df1['a']. Columns 'b' and 'c' are "updated" with df2 values.
I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".
I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.
As stated above, in cases where df1.col1 and df2.col1 match, this would work...
df = df1.merge(df2, on='col1', how='left')
However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:
df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))
and
df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')
and
df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')
and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?
I think I would do this as two merges:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])
In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")
In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))
In [15]: res
Out[15]:
A B C D
0 1 2 1.0 7.0
1 3 4 4.0 9.0
2 5 6 NaN NaN
As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.
Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:
In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]
In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])
In [23]: res
Out[23]:
0 1.0
1 4.0
2 NaN
Name: C, dtype: float64
#will this work?
df = pd.concat([df1.merge(df3, left_on='col1', right_on='col1', how='left'), df1.merge(df3, left_on='col1', right_on='col1_alias', how='left')]
df = pd.DataFrame({'a':[2,3,5], 'b':[1,2,3], 'c':[12,13,14]})
df.set_index(['a','b'], inplace=True)
display(df)
s = df.iloc[1]
# How to get 'a' and 'b' value from s?
It is so annoying that ones columns become indices we cannot simply use df['colname'] to fetch values.
Does it encourage we use set_index(drop=False)?
When I print s I get
In [8]: s = df.iloc[1]
In [9]: s
Out[9]:
c 13
Name: (3, 2), dtype: int64
which has a and b in the name part, which you can access with:
s.name
Something else that you can do is
df.index.values
and specifically for your iloc[1]
df.index.values[1]
Does this help? Other than this I am not sure what you are looking for.
if you want to get "a" and "b"
df.index.names
gives:
FrozenList(['a', 'b'])