EDIT: I noticed that I simplified my problem too much. This is probably because I assumed that the proposed solutions would work in a similar way as my original brute-force solution. I changed the multiindex to better show my problems. My apologies for those who have already put effort in it, thank you so much!
I have a pandas dataframe that is multi-indexed. Let's say the index has three levels, the second level contains the name of a color. I know that in each row all column which have the blue color in the index contain NaN except a single one, so it looks like this:
import pandas as pd
import numpy as np
iterables = [['bar', 'baz', 'foo', 'qux'], ["red", "blue", "green"], ['one', 'two']]
mi = pd.MultiIndex.from_product(iterables)
df = pd.DataFrame(np.random.randn(5, 24), columns=mi)
df[("bar", "blue","one")] = [2 , np.nan, np.nan, 3 , np.nan]
df[("baz", "blue","two")] = [np.nan, 4.4 , np.nan, np.nan, 5 ]
df[("qux", "blue","one")] = [np.nan, np.nan, 1 , np.nan, np.nan]
Output:
bar ... qux
red blue green ... red blue green
one two one two one two ... one two one two one two
0 0.046326 -0.999092 2.0 0.073113 0.958438 0.276653 ... -0.258202 -0.772636 NaN -0.639735 1.438262 -0.033578
1 0.257776 -2.499286 NaN 0.854263 -0.037380 -0.571258 ... 1.656198 -1.110911 NaN 0.757692 0.498118 1.070371
2 -0.314146 0.941367 NaN 0.265850 -0.153231 -1.092106 ... -0.208089 -0.363624 1.0 0.046457 -2.158373 0.572496
3 -1.198977 0.605490 3.0 -0.790985 0.000563 -0.958261 ... 1.339086 -1.057270 NaN -0.355639 1.050980 -1.727684
4 -0.562230 -1.721894 NaN 0.856543 -1.137364 1.185481 ... 0.986215 1.028128 NaN -0.264889 0.571484 -0.505340
Now I want to create a new dataframe that contains the non-nan value that the row has in the respective column and also names the other indices of that multi index.
word number blue
0 bar one 2.0
1 baz two 4.4
2 qux one 1.0
3 bar one 3.0
4 baz two 5.0
i.e. the word and number entries of the new dataframe should be the indeces in which the original dataframe had the non-nan value and the new blue column should contain the values.
I have a brute-force solution where I iterate over basically every entry, but my final dataframe will contain around 2000 columns, which will then take very long to run.
If select by DataFrame.xs then only reshape by DataFrame.stack, remove first Multiindex level by reset_index with drop=True and last convert Series to 2 columns DataFrame by Series.rename_axis and Series.reset_index:
df = (df.xs('blue', axis=1, level=1)
.stack()
.reset_index(level=0, drop=True)
.rename_axis('number')
.reset_index(name='blue'))
print (df)
number blue
0 1 2.0
1 2 4.4
2 3 1.0
3 1 3.0
4 2 5.0
EDIT: Solution is similar, only filtered at least one NaNs columns by DataFrame.isna and DataFrame.any with DataFrame.loc and then is used DataFrame.stack by both MultiIndex levels:
df1 = (df.loc[:, df.isna().any()]
.xs('blue', axis=1, level=1)
.stack([0,1])
.reset_index(level=0, drop=True)
.rename_axis(('word','number'))
.reset_index(name='blue'))
print (df1)
word number blue
0 bar one 2.0
1 baz two 4.4
2 qux one 1.0
3 bar one 3.0
4 baz two 5.0
You could stack one single level, only keep the blue column, and drop NaN values:
resul = df.stack(level=0)['blue'].reset_index(level=1).rename(columns={'level_1': 'number'}).dropna()
It gives:
number blue
0 1 2.0
1 2 4.4
2 3 1.0
3 1 3.0
4 2 5.0
For the edited question, it looks that you want to only process columns containing NaN values and only keep non NaN one. This could do the trick:
df.loc[:,df.isna().any()].stack(level=[0,2])[['blue']].dropna()
It gives:
blue
0 bar one 2.0
1 baz two 4.4
2 qux one 1.0
3 bar one 3.0
4 baz two 5.0
NB: if you keep other columns, you will get much more results for blue values...
You may check with two stack chain
df.stack().stack().reset_index()
level_0 level_1 level_2 0
0 0 blue 1 2.2
1 1 blue 2 5.0
2 2 blue 1 44.0
3 3 blue 3 3.3
4 4 blue 1 1.0
5 5 blue 3 1.0
Related
I have a dataframe with duplicate columns (number not known a priori) like this example:
a
a
a
b
b
0
1
1
1
1
1
1
1
nan
1
1
1
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()
The result i get is as follows, but it does not return NaN in the second row of column 'a'.
a
b
0
3
2
1
2
2
In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.
For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.
The result I want to achieve is:
a
b
0
3
2
1
nan
2
One workaround might be to use apply to get the DataFrame.sum:
df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))
Output:
a b
0 3.0 2.0
1 NaN 2.0
Another possible solution:
cols, ldf = df.columns.unique(), len(df)
pd.DataFrame(
np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
(len(cols), ldf)),
columns=cols)
Output:
a b
0 3.0 2.0
1 NaN 2.0
What is the proper way to go from a df like this:
>>>df
treatment mutation_1 mutation_2 resistance frequency
0 a hpc abc 1.2 3
1 a awd jda 2.1 4
2 b abc hpc 1.2 5
To this:
mutation_1 mutation_2 resistance frequency_a frequency_b
0 hpc abc 1.2 3 5
1 awd jda 2.1 4 0
Please notice that the order in columns a & b does not matter.
Edit: Changed column names in my example for clarity
Edit2: I added the resistance column which is important for me to keep.
First you want to sort the columns of interest horizontally, and pivot:
cols = ['mutation_1','mutation_2']
df[cols] = np.sort(df[cols],1)
(df.pivot_table(index=cols,
columns='treatment',
values='frequency')
.rename(columns=lambda x: f'frequency_{x}') # rename as needed
.reset_index()) # reset index to columns
Output:
treatment mutation_1 mutation_2 frequency_a frequency_b
0 abc hpc 3.0 5.0
1 awd jda 4.0 NaN
I have a single series with 2 columns that looks like
1 5.3
2 2.5
3 1.6
4 3.8
5 2.8
...and so on. I would like to take this series and break it into 6 columns of different sizes. So (for example) the first column would have 30 items, the next 31, the next 28, and so on. I have seen plenty of examples for same-sized columns but have not seen away to make multiple custom-sized columns.
Based on comments you can try use the index of the series to fill your dataframe
s = pd.Series([5, 2, 1, 3, 2])
df = pd.DataFrame([], index=s.index)
df['col1'] = s.loc[:2]
df['col2'] = s.loc[3:3]
df['col3'] = s.loc[4:]
Result:
col1 col2 col3
0 5.0 NaN NaN
1 2.0 NaN NaN
2 1.0 NaN NaN
3 NaN 3.0 NaN
4 NaN NaN 4.0
Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber.
However, if i simply use merge in the following way (pseudo code, not real code):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
So basically this is a either/or match condition (pesudo code):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense...
Any help is really really appreciated!
As suggested, I looked into this post:
Python pandas merge with OR logic
But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)
Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks #Scott Boston for that final step).
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 3 7 0 1 4
1 2 8 1 5 1
2 1 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
df1.loc[0] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]
We'll use suffixes to keep track of what matched where:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.
dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
I would suggest this alternate way for doing merge like this. This seems easier for me.
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.
(Please note that there's a question Pandas: group by and Pivot table difference, but this question is different.)
Suppose you start with a DataFrame
df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]:
a b val
0 x 0 0
1 x 1 1
2 y 0 2
3 y 1 3
Now suppose you want to make the index a, the columns b, the values in a cell val, and specify what to do if there are two or more values in a resulting cell:
b 0 1
a
x 0 1
y 2 3
Then you can do this either through
df.val.groupby([df.a, df.b]).sum().unstack()
or through
pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')
So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.
Is there something I've missed?
Are there operations that can be performed using one and not the other?
Are there, perhaps, operations easier to perform using one over the other?
If not, why not deprecate pivot_tale? groupby seems much more general.
If I understood the source code for pivot_table(index, columns, values, aggfunc) correctly it's tuned up equivalent for:
df.groupby([index + columns]).agg(aggfunc).unstack(columns)
plus:
margins (subtotals and grand totals as #ayhan has already said)
pivot_table() also removes extra multi-levels from columns axis (see example below)
convenient dropna parameter: Do not include columns whose entries are all NaN
Demo: (I took this DF from the docstring [source code for pivot_table()])
In [40]: df
Out[40]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
pay attention at the top level column: D
In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
D
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
why not deprecate pivot_tale? groupby seems much more general.
IMO, because it's very easy to use and very convenient!
;)