I would like to add columns into a Pandas multiindex dataframe, which will contain the result of an operation performed on other columns.
I have a dataframe similar to this one:
first bar baz
second one two one two
A 5 2 9 2
B 6 4 7 6
C 5 4 5 1
Now, for each group in the dataframe, I'd like to add a column "three" which equals column "one" minus column "two":
first bar baz
second one two three one two three
A 5 2 3 9 2 7
B 6 4 2 7 6 1
C 5 4 1 5 1 4
In reality my dataframe is much larger. I'm struggling to find the answer to this (hopefully) easy question. Any suggestions are appreciated.
Use DataFrame.xs for select one and two levels and subtract, then create MultiIndex in column by MultiIndex.from_product:
df1 = df.xs('one', axis=1, level=1) - df.xs('two', axis=1, level=1)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['three']])
print (df1)
bar baz
three three
A 3 7
B 2 1
C 1 4
Then concat to original and for change ordering use reindex by helper MultiIndex:
mux = pd.MultiIndex.from_product([['bar','baz'], ['one','two','three']],
names=df.columns.names)
df = pd.concat([df, df1], axis=1).reindex(columns=mux)
print (df)
first bar baz
second one two three one two three
A 5 2 3 9 2 7
B 6 4 2 7 6 1
C 5 4 1 5 1 4
Create your append df by using MultiIndex
s=pd.DataFrame([[1,2],[2,3],[3,4]],columns=pd.MultiIndex.from_arrays([['bar','baz'],['three','three']]))
s
Out[458]:
bar baz
three three
0 1 2
1 2 3
2 3 4
Then we do concat
yourdf=pd.concat([df,s],axis=1).sort_index(level=0,axis=1)
If the order is matter , you can reindex or may consider factorized the level .
Related
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
Given this DF:
a b c d
1 2 1 4
4 3 4 2
foo bar foo yes
What is the best way to delete same columns but with different name in a large pandas DF? For example:
a b d
1 2 4
4 3 2
foo bar yes
Column c was removed from the above dataframe becase a and c where the same column but with different name. So far I tried to
df = df.iloc[:, ~df.columns.duplicated()]
However it is not clear to me how to check the row values inside the DF?
use transpose as below
df.T.drop_duplicates().T
I tried straight forward approach - loop through column names and compare each column with rest of others. Use np.all for exact match. These approach took only 336ms.
repeated_columns = []
for i, column in enumerate(df.columns):
r_columns = df.columns[i+1:]
for r_c in r_columns:
if np.all(df[column] == df[r_c]):
repeated_columns.append(r_c)
new_columns = [x for x in df.columns if x not in repeated_columns]
df[new_columns]
It will give you following output
a b d
0 1 2 4
1 4 3 2
2 foo bar yes
df.loc[:,~df.T.duplicated()]
a b d
0 1 2 4
1 4 3 2
2 foo bar yes
My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())
I have N Dataframes with different number of columns, I want to get one dataframe with 2 columns x and Y where x is the data from the columns of the input dataframe and Y is the column name itself. I have many such dataframes that I need to concat (N is of the order of 10^2), so efficiency is of priority. A numpy way rather than pandas way is also welcome.
For example,
df1:
one two
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
df2:
three four
0 NaN
1 None f
2 g
3 6 7
Final Output Dataframe:
x y
0 1 one
1 2 one
2 3 one
3 4 one
4 5 one
5 a two
6 b two
7 c two
8 d two
9 e two
10 6 three
11 f four
12 g four
13 7 four
Note: I'm ignoring empty strings, NaNs and Nones in the final dataframe.
IIUC you can use melt() before concating:
final=(pd.concat([df1.melt(),df2.dropna().melt()]).
rename(columns={'variable':'y','value':'x'}). reindex(['x','y'],axis=1))
print(final)
I have a DataFrame with two columns "A" and "B".
A B
0 foo one
1 bar one
2 foo two
3 bar one
4 foo two
5 bar two
6 foo one
7 foo one
8 xyz one
For each group in "A", I'm trying to get the count of each value of "B", i.e. each sub-group of B, but aggregated on the grouping of "A".
The result should look like this:
A B countOne countTwo
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
I have tried several approaches to no avail, so far I'm using this approach:
A_grouped = df.groupby(['A', 'B'])['A'].count()
A_grouped_ones = A_grouped[:,'one']
A_grouped_twos = A_grouped[:,'two']
df['countOne'] = df['A'].map(lambda a: A_grouped_ones[a] if a in A_grouped_ones else 0)
df['countTwo'] = df['A'].map(lambda a: A_grouped_twos[a] if a in A_grouped_twos else 0)
However, this seems horribly inefficient two me. Is there a better solution?
You can use unstack with add_prefix for new DataFrame and join to original:
df1 = df.groupby(['A', 'B'])['A'].count().unstack(fill_value=0).add_prefix('count_')
print (df1)
B count_one count_two
A
bar 2 1
foo 3 2
xyz 1 0
df = df.join(df1, on='A')
print (df)
A B count_one count_two
0 foo one 3 2
1 bar one 2 1
2 foo two 3 2
3 bar one 2 1
4 foo two 3 2
5 bar two 2 1
6 foo one 3 2
7 foo one 3 2
8 xyz one 1 0
Another alternative is use size:
df1 = df.groupby(['A', 'B']).size().unstack(fill_value=0).add_prefix('count_')
Differences are size includes NaN values, count does not - check this answer.