Flatten out a pandas dataframe? - python

Here's some data from another question:
A B C
0 s s NaN
1 NaN x x
Trying to experiment, I would like to transform the dataframe to something like this:
0
A s
A NaN
B s
B x
C NaN
C x
As a dataframe, or series. This is equivalent to a transposition and reshape. How would I do this?

Another method is to use melt:
In[184]:
df.melt().set_index('variable')
Out[184]:
value
variable
A s
A NaN
B s
B x
C NaN
C x
The set_index step is needed due to the intermediate result:
In[188]:
df.melt()
Out[188]:
variable value
0 A s
1 A NaN
2 B s
3 B x
4 C NaN
5 C x

You can use unstack by transposing the df i.e
df.T.unstack().to_frame().reset_index(level=0, drop=True).sort_index()
Output:
0
A s
A NaN
B s
B x
C NaN
C x
In [620]:

Or simply
df.stack(dropna=False).to_frame().reset_index(level=0, drop=True).sort_index()
Out[44]:
0
A s
A NaN
B s
B x
C NaN
C x

This should work for you:
df.stack().reset_index(level=0, drop=True)
DataFrame.stack() is the flattening method for dataframes. However, to preserve data, it leaves you with a MultiIndex. Since in your output frame, you did not need the original index, you can drop it with reset_index.

Related

How to replace DataFrame.append with pd.concat to append a Series as row?

I have a data frame with numeric values, such as
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
and I append a single row with all the column sums
totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)
Simple enough.
Here are the values of df, totals, and df_append
>>> df
A B
0 1 2
1 3 4
>>> totals
A 4
B 6
Name: totals, dtype: int64
>>> df_append
A B
0 1 2
1 3 4
totals 4 6
Unfortunately, in newer versions of pandas the method DataFrame.append is deprecated, and will be removed in some future version of pandas. The advise is to replace it with pandas.concat.
Now, using pd.concat naively as follows
df_concat_bad = pd.concat([df, totals])
produces
>>> df_concat_bad
A B 0
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column.
You cannot fix this with something like calling pd.concat with axis=1, because that would add the totals as column:
>>> pd.concat([df, totals], axis=1)
A B totals
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
(In this case, the result looks the same as using the default axis=0, because the indexes of df and totals are disjoint, as are their column names.)
How to handle this (elegantly and efficiently)?
The solution is to convert totals (a Series object) to a DataFrame (which will then be a column) using to_frame() and next transpose it using T:
df_concat_good = pd.concat([df, totals.to_frame().T])
yields the desired
>>> df_concat_good
A B
0 1 2
1 3 4
totals 4 6
I prefer to use df.loc() to solve this problem than pd.concat()
df.loc["totals"]=df.sum()

Pandas: Same indices for each column. Is there a better way to solve this?

Sorry for the lousy text in the question? I can't come up with a summarized way to ask this question.
I have a dataframe (variable df) such as the below:
df
ID
A
B
C
1
m
nan
nan
2
n
nan
nan
3
b
nan
nan
1
nan
t
nan
2
nan
e
nan
3
nan
r
nan
1
nan
nan
y
2
nan
nan
u
3
nan
nan
i
The desired output is:
ID
A
B
C
1
m
t
y
2
n
e
u
3
b
r
i
I solved this by running the following lines:
new_df = pd.DataFrame()
for column in df.columns:
new_df = pd.concat([new_df, df[column].dropna()], join='outer', axis=1)
And then I figured this would be faster:
empty_dict = {}
for column in df.columns:
empty_dict[column] = df[column].dropna()
new_df = pd.DataFrame.from_dict(empty_dict)
However, the dropna could represent a problem if, for example, there is a missing value in the rows that have the values to be used in each column. E.g. if df.loc[2,'A'] = nan, then that key in the dictionary will only have 2 values causing a misalignment with the rest of the columns. I'm not convinced.
I have the feeling pandas must have a builtin function that will do a better job and either of my two solutions. Is there? If not, is there any better way of solving this?
Looks like you only need groupby().first():
df.groupby('ID', as_index=False).first()
Output:
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i
Use stack_unstack() as suggested by #QuangHoang if ID is the index:
>>> df.stack().unstack().reset_index()
A B C
ID
1 m t y
2 n e u
3 b r i
You can use melt and pivot:
>>> df.melt('ID').dropna().pivot('ID', 'variable', 'value') \
.rename_axis(columns=None).reset_index()
ID A B C
0 1 m t y
1 2 n e u
2 3 b r i

How to sum the data in specific level after groupby or pivot_table with multi-index dataframe

Here are my schematic diagram for my question.
After I ran "groupby" or "pivot_table" for my dataframe as below, I got an dataframe ② with "A","B" index. However, I doesn't need much more data. So I would like to aggregate the data in specific index level.
Then I can get the last one dataframe ③. b should summarise e\f index data, 1+1 = 2
This function just like the fold behavior in Excel. I checked the document(pivot_table), but "margins" this parameter is unfit for this requirement.
Thanks!!
I don't know such a function, but here is a workaround:
df
C
A B
a p 1
q 1
d r 2
s 2
b t 3
v 3
w 3
c x 4
y 4
fld=["d","b","c"]
nfidx=df.index.levels[0].difference(fld)
Index(['a'], dtype='object', name='A')
df2= df.loc[fld].groupby("A").sum()
C
A
b 9
c 8
d 4
mi= pd.MultiIndex.from_product([df2.index,[np.nan]],names=["A","B"])
MultiIndex([('b', nan),
('c', nan),
('d', nan)],
names=['A', 'B'])
df2.index=mi
C
A B
b NaN 9
c NaN 8
d NaN 4
pd.concat([ df.loc[nfidx],df2])
C
A B
a p 1
q 1
b NaN 9
c NaN 8
d NaN 4

How to pick out all non-NULL value from multiple columns in Python Dataframe

I had a DataFrame like below:
column-a column-b column-c
0 Nan A B
1 A Nan C
2 Nan Nan C
3 A B C
I hope to create a new column-D to capture all non-NULL values from column A to C:
column d
0 A,B
1 A,C
2 C
3 A,B,C
Thanks!
You need to change the 'Nan' to np.nan, then using stack with groupby join
df=df.replace('Nan',np.nan)
df.stack().groupby(level=0).agg(','.join)
Out[570]:
0 A,B
1 A,C
2 C
3 A,B,C
dtype: object
#df['column-d']= df.stack().groupby(level=0).agg(','.join)
After fixing the nans:
df = df.replace('Nan', np.nan)
collect all non-null values in each row in a list and join the list items.
df['column-d'] = df.apply(lambda x: ','.join(x[x.notnull()]), axis=1)
#0 A,B
#1 A,C
#2 C
#3 A,B,C
Surprisingly, this solution is somewhat faster than the stack/groupby solution by Wen, at least for the posted dataset.

How to combine multiple rows of same category to one in pandas?

I'm trying to get from table 1 to table 2 from the image but I can't seem to get it right. I tried pivot table to change col A - D from rows to cols. Then I try groupby but it doesn't give me one row but messes up my dataframe instead.
You can fill the null values with the value in the column and drop duplicates:
with :
df = pd.DataFrame([["A", pd.np.nan, pd.np.nan, "Y", "Z"],
[pd.np.nan, "B", pd.np.nan, "Y", "Z"],
[pd.np.nan,pd.np.nan, "C", "Y", "Z"]], columns=list("ABCDE"))
df
A B C D E
0 A NaN NaN Y Z
1 NaN B NaN Y Z
2 NaN NaN C Y Z
df.ffill().bfill().drop_duplicates()
A B C D E
0 A B C Y Z
df.ffill().bfill() gives:
A B C D E
0 A B C Y Z
1 A B C Y Z
2 A B C Y Z
As per your comment, you could define a function that fill the missing value of the first row by the unique value that lies somewhere else in the same column.
def fillna_uniq(df, col):
if isinstance(col, list):
for c in col:
df.loc[df.index[0], c] = df[c].dropna().iloc[0]
else:
df.loc[df.index[0], col] = df[col].dropna().iloc[0]
return df.iloc[[0]]
You could then do:
fillna_uniq(df.copy(), ["B", "C", "D"])
A B C D E F
0 Hello I am lost Pandas Data
It is a bit faster I think. You can modify your df inplace by passing directly the dataframe, not a copy.
HTH
One way you can do this is using apply and dropna:
Assuming those blanks in your table above are really nulls:
df = pd.DataFrame({'A':['Hello',np.nan,np.nan,np.nan],'B':[np.nan,'I',np.nan,np.nan],
'C':[np.nan,np.nan,'am',np.nan],
'D':[np.nan,np.nan,np.nan,'lost'],
'E':['Pandas']*4,
'F':['Data']*4})
print(df)
A B C D E F
0 Hello NaN NaN NaN Pandas Data
1 NaN I NaN NaN Pandas Data
2 NaN NaN am NaN Pandas Data
3 NaN NaN NaN lost Pandas Data
Using apply, you can apply the lambda function to each column of the dataframe, first dropping null values then find the max:
df.apply(lambda x: x.dropna().max()).to_frame().T
A B C D E F
0 Hello I am lost Pandas Data
Or if your blanks are really empty strings, then you can do this:
df1 = df.replace(np.nan,'')
df1
A B C D E F
0 Hello Pandas Data
1 I Pandas Data
2 am Pandas Data
3 lost Pandas Data
df1.apply(lambda x: x[x!=''].max()).to_frame().T
A B C D E F
0 Hello I am lost Pandas Data

Categories

Resources