The problem is simple and so must be solution but I am not able to find it.
I want to find which row and column in Pandas DataFrame has minimum value and how much is it.
I have tried following code (in addition to various combinations):
df = pd.DataFrame(data=[[4,5,6],[2,1,3],[7,0,5],[2,5,3]],
index = ['R1','R2','R3','R4'],
columns=['C1','C2','C3'])
print(df)
print(df.loc[df.idxmin(axis=0), df.idxmin(axis=1)])
The dataframe (df) being searched is:
C1 C2 C3
R1 4 5 6
R2 2 1 3
R3 7 0 5
R4 2 5 3
Output for the loc command:
C1 C2 C2 C1
R2 2 1 1 2
R3 7 0 0 7
R2 2 1 1 2
What I need is:
C2
R3 0
How can I get this simple result?
Use:
a, b = df.stack().idxmin()
print(df.loc[[a], [b]])
C2
R3 0
Another #John Zwinck solution working with missing values - use numpy.nanargmin:
df = pd.DataFrame(data=[[4,5,6],[2,np.nan,3],[7,0,5],[2,5,3]],
index = ['R1','R2','R3','R4'],
columns=['C1','C2','C3'])
print(df)
C1 C2 C3
R1 4 5.0 6
R2 2 NaN 3
R3 7 0.0 5
R4 2 5.0 3
#https://stackoverflow.com/a/3230123
ri, ci = np.unravel_index(np.nanargmin(df.values), df.shape)
print(df.iloc[[ri], [ci]])
C2
R3 0.0
I'd get the index this way:
np.unravel_index(np.argmin(df.values), df.shape)
This is much faster than df.stack().idxmin().
It gives you a tuple such as (2, 1) in your example. Pass that to df.iloc[] to get the value.
Or min+min+dropna+T+dropna+T:
>>> df[df==df.min(axis=1).min()].dropna(how='all').T.dropna().T
C2
R3 0.0
>>>
Related
I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!
Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA
In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA
I am trying to do the same as this answer, but with the difference that I'd like to ignore NaN in some cases. For instance:
#df1
c1 c2 c3
0 a b 1
1 a c 2
2 a nan 1
3 b nan 3
4 c d 1
5 d e 3
#df2
c1 c2 c4
0 a nan 1
1 a c 2
2 a x 1
3 b nan 3
4 z y 2
#merged output based on [c1, c2], dropping instances
#with `NaN` unless both dataframes have `NaN`.
c1 c2 c3 c4
0 a b 1 1 #c1,c2 from df1 because df2 has a nan in c2
1 a c 2 2 #in both
2 a x 1 1 #c1,c2 from df2 because df1 has a nan in c2
3 b nan 3 3 #c1,c2 as found in both
4 c d 1 nan #from df1
5 d e 3 nan #from df1
6 z y nan 2 #from df2
NaNs may come from either c1 or c2, but for this example I kept it simpler.
I'm not sure what's the cleanest way to do this. I was thinking to merge based on [c1,c2], and then loop by rows with nan, but this will not be so direct. Do you see a better way to do it?
Edit - clarifying conditions
1. No duplicates are found anywhere.
2. No combination is performed between two rows if they both have values. c1 may not be combined with c2, so order must be respected.
3. For the cases where one of the 2 dfs has a nan in either c1 or c2, find the rows in the other dataframe that don't have a full match on both c1+c2, and use it. For instance:
(a,c) has a match in both so it is no longer discussed.
(a,b) is only in df1. No b is found in df2.c2. The only row in df2 with a known key and a nan is row 0 so it is combined with this one. Note that order must be respected this is why (a,b) #df1 cannot be combined with any other row of df2 that also contains a b.
(a,x) is only in df2. No x is found in df1.c2. The only row in df1 with one of the known keys with a nan is row with index 2.
I'm trying to duplicate rows of a pandas DataFrame (v.0.23.4, python v.3.7.1) based on an int value in one of the columns. I'm applying code from this question to do that, but I'm running into the following data type casting error: TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'. Basically, I'm not understanding why this code is attempting to cast to int32.
Starting with this,
dummy_dict = {'c1': ['a','b','c'],
'c2': [0,1,2]}
dummy_df = pd.DataFrame(dummy_dict)
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
I'm doing this
dummy_df_test = dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2']))
I want this at the end. However, I'm getting the above error instead.
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
3 c 2 textC
Just a workaround:
pd.concat([dummy_df[dummy_df.c2.eq(0)],dummy_df.loc[dummy_df.index.repeat(dummy_df.c2)]])
Another fantastic suggestion courtesy #Wen
dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2'].clip(lower=1)))
c1 c2
0 a 0
1 b 1
2 c 2
2 c 2
I believe the answer as to why it's happening can be found here:
https://github.com/numpy/numpy/issues/4384
Specifying the dtype as int32 should solve the problem as highlighted in the original comment.
In the first attempt all rows are duplicated, and in the second attempt just the row with the index 2. Thanks to the concat function.
df2 = pd.concat([df]*2, ignore_index=True)
print(df2)
df3= pd.concat([df, df.iloc[[2]]])
print(df3)
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
3 a 0 textA
4 b 1 textB
5 c 2 textC
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
2 c 2 textC
If you plan to reset the index at the end
df3=df3.reset_index(drop=True)
I have a very large csv file with following structure:
a1 b1 c1 a2 b2 c2 a3 b3 c3 ..... a999 b999 c999
0 5 4 2 3 2 2 6 7 9 ....................
1 2 1 4 4 6 9 3 5 9 ....................
.
.
What I want to do is to group the columns in sets of N, for a, b and c, and check when the index of maximum value (argmax) of the set changes, in each row.
So in the above example, for N = 3, a1, b1, c1 is the first set in row 0, and argmax is 0, 2nd set is a2, b2, c2 and argmax is still 0, 3rd set is a3, b3, c3 but now the argmax is 2. I deally I am looking for a script that parses the whole csv file and returns [c3, c1]. c3 because thats where the argmax changes in row 0 and c1 becuase argmax doesn't change in row 1 but c1 is the largest value in that set.
I am doing this right now by using two for loops and its slow and looks very ugly, is there a better pandas pythonic way of doing this? I feel there must be.
I tried to keep to code as simple as possible. You can translate your dataframe and group by the sliced column name:
df = df.T.reset_index()
idx = df.groupby(df['index'].str.slice(1,2)).idxmax()
Output:
0 1
index
1 0 2
2 3 5
3 8 8
That means that for row 0 the max for group 1 is at index 0, the max group 2 is at index 3 (or 0 is you take the mod 3), the max for group 3 is at index 8, (or 2 if you take mod 3). Same reading for row 1 :)
If you need the actual column name:
df.columns[idx.values.flatten(order='F')]
Output:
['a1', 'a2', 'c3', 'c1', 'c2', 'c3']
You can groupby sets of columns and use .idxmax to find the column where the maximum occurs within each set. You can find where the first letter changes (if it ever does) to get your list.
n = 3
df2 = df.groupby([x//n for x in range(len(df.columns))], axis=1).idxmax(1)
mask = df2.applymap(lambda x: x[0]) # Case of 1-letter column prefix
## If possibility of words with different length ending in digits try
# import string
# mask = df2.applymap(lambda x: x.strip(string.digits))
df2.lookup(df2.index,
(mask.ne(mask.shift(-1, axis=1)).idxmax(1)+1) % (len(mask.columns))).tolist()
Sample Data
print(df)
a1 b1 c1 a2 b2 c2 a3 b3 c3
0 5 4 2 3 2 2 6 7 9
1 2 1 4 4 6 9 3 5 9
2 2 1 4 10 6 9 3 5 9
3 2 1 4 1 6 9 3 10 9
n = 3
df2 = df.groupby([x//n for x in range(len(df.columns))], axis=1).idxmax(1)
print(df2)
# 0 1 2
#0 a1 a2 c3
#1 c1 c2 c3
#2 c1 a2 c3
#3 c1 c2 b3
mask = df2.applymap(lambda x: x[0])
df2.lookup(df2.index, (mask.ne(mask.shift(-1, axis=1)).idxmax(1)+1) % (len(mask.columns))).tolist()
#['c3', 'c1', 'a2', 'b3']
I'd like to convert a Pandas DataFrame that is derived from a pivot table into a row representation as shown below.
This is where I'm at:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'goods': ['a', 'a', 'b', 'b', 'b'],
'stock': [5, 10, 30, 40, 10],
'category': ['c1', 'c2', 'c1', 'c2', 'c1'],
'date': pd.to_datetime(['2014-01-01', '2014-02-01', '2014-01-06', '2014-02-09', '2014-03-09'])
})
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
piv = df.pivot_table(["stock"], "month", ["goods", "category"], aggfunc="sum")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
print piv
which results in
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 5 10 30 40
3 5 10 10 40
And this is where I want to get to.
goods category month stock
a c1 1 5
a c1 2 0
a c1 3 0
a c2 1 0
a c2 2 10
a c2 3 0
b c1 1 30
b c1 2 0
b c1 3 10
b c2 1 0
b c2 2 40
b c2 3 0
Previously, I used
piv = piv.stack()
piv = piv.reset_index()
print piv
to get rid of the multi-indexes, but this results in this because I pivot now on two columns (["goods", "category"]):
month category stock
goods a b
0 1 c1 5 30
1 1 c2 0 0
2 2 c1 5 30
3 2 c2 10 40
4 3 c1 5 10
5 3 c2 10 40
Does anyone know how I can get rid of the multi-index in the column and get the result into a DataFrame of the exemplified format?
>>> piv.unstack().reset_index().drop('level_0', axis=1)
goods category month 0
0 a c1 1 5
1 a c1 2 5
2 a c1 3 5
3 a c2 1 0
4 a c2 2 10
5 a c2 3 10
6 b c1 1 30
7 b c1 2 30
8 b c1 3 10
9 b c2 1 0
10 b c2 2 40
11 b c2 3 40
then all you need is to change last column name from 0 to stock.
It seems to me that melt (aka unpivot) is very close to what you want to do:
In [11]: pd.melt(piv)
Out[11]:
NaN goods category value
0 stock a c1 5
1 stock a c1 5
2 stock a c1 5
3 stock a c2 0
4 stock a c2 10
5 stock a c2 10
6 stock b c1 30
7 stock b c1 30
8 stock b c1 10
9 stock b c2 0
10 stock b c2 40
11 stock b c2 40
There's a rogue column (stock), that appears here that column header is constant in piv. If we drop it first the melt works OOTB:
In [12]: piv.columns = piv.columns.droplevel(0)
In [13]: pd.melt(piv)
Out[13]:
goods category value
0 a c1 5
1 a c1 5
2 a c1 5
3 a c2 0
4 a c2 10
5 a c2 10
6 b c1 30
7 b c1 30
8 b c1 10
9 b c2 0
10 b c2 40
11 b c2 40
Edit: The above actually drops the index, you need to make it a column with reset_index:
In [21]: pd.melt(piv.reset_index(), id_vars=['month'], value_name='stock')
Out[21]:
month goods category stock
0 1 a c1 5
1 2 a c1 5
2 3 a c1 5
3 1 a c2 0
4 2 a c2 10
5 3 a c2 10
6 1 b c1 30
7 2 b c1 30
8 3 b c1 10
9 1 b c2 0
10 2 b c2 40
11 3 b c2 40
I know that the question has already been answered, but for my dataset multiindex column problem, the provided solution was unefficient. So here I am posting another solution for unpivoting multiindex columns using pandas.
Here is the problem I had:
As one can see, the dataframe is composed of 3 multiindex, and two levels of multiindex columns.
The desired dataframe format was:
When I tried the options given above, the pd.melt function didn't allow to have more than one column in the var_name attribute. Therefore, every time that I tried a melt, I would end up losing some attribute from my table.
The solution I found was to apply a double stacking function over my dataframe.
Before the coding, it is worth notice that the desired var_name for my unpivoted table column was "Populacao residente em domicilios particulares ocupados" (see in the code below). Therefore, for all my value entries, they should be stacked in this newly created var_name new column.
Here is a snippet code:
import pandas as pd
# reading my table
df = pd.read_excel(r'my_table.xls', sep=',', header=[2,3], encoding='latin3',
index_col=[0,1,2], na_values=['-', ' ', '*'], squeeze=True).fillna(0)
df.index.names = ['COD_MUNIC_7', 'NOME_MUN', 'TIPO']
df.columns.names = ['sexo', 'faixa_etaria']
df.head()
# making the stacking:
df = pd.DataFrame(pd.Series(df.stack(level=0).stack(), name='Populacao residente em domicilios particulares ocupados')).reset_index()
df.head()
Another solution I found was to first apply a stacking function over the dataframe and then apply the melt.
Here is an alternative code:
df = df.stack('faixa_etaria').reset_index().melt(id_vars=['COD_MUNIC_7', 'NOME_MUN','TIPO', 'faixa_etaria'],
value_vars=['Homens', 'Mulheres'],
value_name='Populacao residente em domicilios particulares ocupados',
var_name='sexo')
df.head()
Sincerely yours,
Philipe Riskalla Leal