Unmelt Pandas DataFrame - python

I have a pandas dataframe with two id variables:
df = pd.DataFrame({'id': [1,1,1,2,2,3],
'num': [10,10,12,13,14,15],
'q': ['a', 'b', 'd', 'a', 'b', 'z'],
'v': [2,4,6,8,10,12]})
id num q v
0 1 10 a 2
1 1 10 b 4
2 1 12 d 6
3 2 13 a 8
4 2 14 b 10
5 3 15 z 12
I can pivot the table with:
df.pivot('id','q','v')
And end up with something close:
q a b d z
id
1 2 4 6 NaN
2 8 10 NaN NaN
3 NaN NaN NaN 12
However, what I really want is (the original unmelted form):
id num a b d z
1 10 2 4 NaN NaN
1 12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
2 14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
In other words:
'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form)
'q' are my columns
'v' are my values in the table
Update
I found a close solution from Wes McKinney's blog:
df.pivot_table(index=['id','num'], columns='q')
v
q a b d z
id num
1 10 2 4 NaN NaN
12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
However, the format is not quite the same as what I want above.

You could use set_index and unstack
In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
Out[18]:
q id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0

You're really close slaw. Just rename your column index to None and you've got what you want.
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel().rename(None)
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with:
DataError: No numeric types to aggregate
To resolve this, you can specify your own aggregation function by using a custom lambda function:
df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)

you can remove name q.
df1.columns=df1.columns.tolist()
Zero's answer + remove q =
df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
df1.columns=df1.columns.tolist()
id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0

This might work just fine:
Pivot
df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index())
Concatinate the 1st level column names with the 2nd
df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]

Came up with a close solution
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel()
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Still can't figure out how to drop 'q' from the dataframe

It can be done in three steps:
#1: Prepare auxilary column 'id_num':
df['id_num'] = df[['id', 'num']].apply(tuple, axis=1)
df = df.drop(columns=['id', 'num'])
#2: 'pivot' is almost an inverse of melt:
df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), ''
#3: Bring back 'id' and 'num' columns:
df['id'], df['num'] = zip(*df['id_num'])
df = df.drop(columns=['id_num'])
This is a result, but with different order of columns:
a b d z id num
0 2.0 4.0 NaN NaN 1 10
1 NaN NaN 6.0 NaN 1 12
2 8.0 NaN NaN NaN 2 13
3 NaN 10.0 NaN NaN 2 14
4 NaN NaN NaN 12.0 3 15
Alternatively with proper order:
def multiindex_pivot(df, columns=None, values=None):
#inspired by: https://github.com/pandas-dev/pandas/issues/23955
names = list(df.index.names)
df = df.reset_index()
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
df = df.reset_index() #me
df.columns.name = '' #me
return df
df = df.set_index(['id', 'num'])
df = multiindex_pivot(df, columns='q', values='v')

Related

Merge almost identical rows after removing nan values

I have a dataframe like,
pri_col col1 col2 Date
r1 3 4 2020-09-10
r1 4 1 2020-09-11
r1 2 7 2020-09-12
r1 6 4 2020-09-13
Note: There are many more unique values in 'pri_col' column. This is just a sample here. So I'm giving single value. Also, for a single value of 'pri_col' the value of 'Date' will be unique always.
I need the dataframe like,
pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
r1 3 4 2 6 4 1 7 4
According to a previous solution, I have tried this solution:
df = (df.reset_index()
.melt(id_vars=['index','pri_col','Date'],
var_name='cols',
value_name='val')
.pivot(index=['index','pri_col'],
columns=['cols','Date'],
values='val'))
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1).rename_axis(None)
print (df)
But this is the resulting dataframe:
pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
r1 3 NaN NaN NaN 4 NaN NaN NaN
r1 NaN 4 NaN NaN NaN 1 NaN NaN
r1 NaN NaN 2 NaN NaN NaN 7 NaN
r1 NaN NaN NaN 6 NaN NaN NaN 4
How do I solve the issue?
Also, I had asked a question recently that may sound similar.
IIUC, use pandas.DataFrame.set_index with unstack:
new_df = df.set_index(['pri_col', 'Date']).unstack()
new_df.columns = ["%s_%s" % (i, j) for i, j in new_df.columns]
print(new_df)
Output:
col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 \
pri_col
r1 3 4 2 6
col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
pri_col
r1 4 1 7 4

How to fill and merge df with 10 empty rows?

how to fill df with empty rows or create a df with empty rows.
have df :
df = pd.DataFrame(columns=["naming","type"])
how to fill this df with empty rows
Specify index values:
df = pd.DataFrame(columns=["naming","type"], index=range(10))
print (df)
naming type
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
If need empty strings:
df = pd.DataFrame('',columns=["naming","type"], index=range(10))
print (df)
naming type
0
1
2
3
4
5
6
7
8
9

Map values from one dataframe to new columns in other based on column values - Pandas

I have a problem with mapping values from another dataframe.
These are samples of two dataframes:
df1
product class_1 class_2 class_3
141A 11 13 5
53F4 12 11 18
GS24 14 12 10
df2
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 measure_1 measure_2 measure_3
1 141A GS24 NaN NaN 1 3 NaN NaN
2 53F4 NaN NaN NaN 1 NaN NaN NaN
3 53F4 141A 141A NaN 2 2 1 NaN
4 141A GS24 NaN NaN 3 2 NaN NaN
What I'm trying to get is next:
I need to add a new columns called "Max_Class_1", "Max_Class_2", "Max_Class_3" and that value would be taken from df1.
For each order number (_1, _2, _3) look at existing columns (for example product_type_1) product_type_1 and take a row from df1 where the product has the same value. Then look at the measure columns (for example measure_1) and if the value is 1 (it's possible max four different values in original data), new column called "Max_Class_1" would have value same as class_1 for that product_type, in this case 11.
I think it's a little bit simpler than I explained it.
Desired output
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 measure_1 measure_2 measure_3 max_class_0 max_class_1 max_class_2 max_class_3
1 141A GS24 NaN NaN 1 3 NaN NaN 1 10 NaN NaN
2 53F4 NaN NaN NaN 1 NaN NaN NaN 12 NaN NaN NaN
3 53F4 141A 141A NaN 2 2 1 NaN 11 13 11 NaN
4 141A GS24 NaN NaN 3 2 NaN NaN 5 12 NaN NaN
The code I have tried with:
df2['max_class_1'] = None
df2['max_class_2'] = None
df2['max_class_3'] = None
def get_max_class(product_df, measure_df, product_type_column, measure_column, max_class_columns):
for index, row in measure_df.iterrows():
product_df_new = product_df[product_df['product'] == row[product_type_column]]
for ind, r in product_df_new.iterrows():
if row[measure_column] == 1:
row[max_class_columns] = r['class_1']
elif row[measure_column] == 2:
row[max_class_columns] = r['class_2']
elif row[measure_column] == 3:
row[max_class_columns] = r['class_3']
else:
row[tilt_column] = "There is no measure or type"
return measure_df
# And the function call
first_class = get_max_class(product_df=df1, measure_df=df2, product_type_column=product_type_1, measure_column='measure_1', max_class_columns='max_class_1')
second_class = get_max_class(product_df=df1, measure_df=first_class, product_type_column=product_type_2, measure_column='measure_2', max_class_columns='max_class_2')
third_class = get_max_class(product_df=df1, measure_df=second_class, product_type_column=product_type_3, measure_column='measure_3', max_class_columns='max_class_3')
I'm pretty sure there is a simpler solution, but don't know why is not working. I'm getting all None values, nothing changes.
pd.DataFrame.lookup is the standard method for lookups by row and column labels.
Your problem is complicated by the existence of null values. But this can be accommodated by modifying your input mapping dataframe.
Step 1
Rename columns in df1 to integers and add an extra row / column. We will use the added data later to deal with null values.
def rename_cols(x):
return x if not x.startswith('class') else int(x.split('_')[-1])
df1 = df1.rename(columns=rename_cols)
df1 = df1.set_index('product')
df1.loc['X'] = 0
df1[0] = 0
Your mapping dataframe now looks like:
print(df1)
1 2 3 0
product
141A 11 13 5 0
53F4 12 11 18 0
GS24 14 12 10 0
X 0 0 0 0
Step 2
Iterate the number of categories and use pd.DataFrame.lookup. Notice how we fillna with X and 0, exactly what we used for additional mapping data in Step 1.
n = df2.columns.str.startswith('measure').sum()
for i in range(n):
rows = df2['product_type_{}'.format(i)].fillna('X')
cols = df2['measure_{}'.format(i)].fillna(0).astype(int)
df2['max_{}'.format(i)] = df1.lookup(rows, cols)
Result
print(df2)
id product_type_0 product_type_1 product_type_2 product_type_3 measure_0 \
0 1 141A GS24 NaN NaN 1
1 2 53F4 NaN NaN NaN 1
2 3 53F4 141A 141A NaN 2
3 4 141A GS24 NaN NaN 3
measure_1 measure_2 measure_3 max_0 max_1 max_2 max_3
0 3.0 NaN NaN 11 10 0 0
1 NaN NaN NaN 12 0 0 0
2 2.0 1.0 NaN 11 13 11 0
3 2.0 NaN NaN 5 12 0 0
You can convert the 0 to np.nan if required. This will be at the expense of converting your series from int to float, since NaN is considered float.
Of course, if X and 0 are valid values, you can use alternative filler values from the start.

pandas returning the unnamed columns

The following is example of data I have in excel sheet.
A B C
1 2 3
4 5 6
I am trying to get the columns name using the following code:
p1 = list(df1t.columns.values)
the output is like this
[A, B, C, 'Unnamed: 3', 'unnamed 4', 'unnamed 5', .....]
I checked the excel sheet, there is only three columns named A, B, and C. Other columns are blank. Any suggestion?
Just in case anybody stumbles over this problem: The issue can also arise if the excel sheet contains empty cells that are formatted with a background color:
import pandas as pd
df1t = pd.read_excel('test.xlsx')
print(df1t)
A B C Unnamed: 3
0 1 2 3 NaN
1 4 5 6 NaN
One option is to drop the 'Unnamed' columns as described here:
https://stackoverflow.com/a/44272830/11826257
df1t = df1t[df1t.columns.drop(list(df1t.filter(regex='Unnamed:')))]
print(df1t)
A B C
0 1 2 3
1 4 5 6
There is problem some cells are not empty but contains some whitespaces.
If need columns names with filtering Unnamed:
cols = [col for col in df if not col.startswith('Unnamed:')]
print (cols)
['A', 'B', 'C']
Sample with file:
df = pd.read_excel('https://dl.dropboxusercontent.com/u/84444599/file_unnamed_cols.xlsx')
print (df)
A B C Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 4.0 6.0 8.0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
cols = [col for col in df if not col.startswith('Unnamed:')]
print (cols)
['A', 'B', 'C']
Another solution:
cols = df.columns[~df.columns.str.startswith('Unnamed:')]
print (cols)
Index(['A', 'B', 'C'], dtype='object')
And for return all columns by cols use:
print (df[cols])
A B C
0 4.0 6.0 8.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
And if necessary remove all NaNs rows:
print (df[cols].dropna(how='all'))
A B C
0 4.0 6.0 8.0

Pandas: concatenate and reindex dataframes

I would like to combine two pandas dataframes into a new third dataframe using a new index. Suppose I start with the following:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(25).reshape((5,5))*2,index = ['A','B','C','D','E'])
df[2] = np.nan
df1[3] = np.nan
df[4] = np.nan
df1[4] = np.nan
I would like the least convoluted way to achieve the following result:
NewIndex OldIndex df df1
1 A 1 2
2 B 1 2
3 C 1 2
4 D 1 2
5 E 1 2
6 A 1 2
7 B 1 2
8 C 1 2
9 D 1 2
10 E 1 2
11 A NaN 2
12 B NaN 2
13 C NaN 2
14 D NaN 2
15 E NaN 2
16 A 1 NaN
17 B 1 NaN
18 C 1 NaN
19 D 1 NaN
20 E 1 NaN
What's the best way to do this?
You have to unstack your dataframes and then reindex concatenated dataframe.
import numpy as np
import pandas as pd
# test data
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(25).reshape((5,5))*2,index = ['A','B','C','D','E'])
df[2] = np.nan
df1[3] = np.nan
df[4] = np.nan
df1[4] = np.nan
# unstack tables and concat
newdf = pd.concat([df.unstack(),df1.unstack()], axis=1)
# reset multiindex for level 1
newdf.reset_index(1, inplace=True)
# rename columns
newdf.columns = ['OldIndex','df','df1']
# drop old index
newdf = newdf.reset_index().drop('index',1)
# set index from 1
newdf.index = np.arange(1, len(newdf) + 1)
# rename new index
newdf.index.name='NewIndex'
print(newdf)
Output:
OldIndex df df1
NewIndex
1 A 1.0 2.0
2 B 1.0 2.0
3 C 1.0 2.0
4 D 1.0 2.0
5 E 1.0 2.0
6 A 1.0 2.0
7 B 1.0 2.0
8 C 1.0 2.0
9 D 1.0 2.0
10 E 1.0 2.0
11 A NaN 2.0
12 B NaN 2.0
13 C NaN 2.0
14 D NaN 2.0
15 E NaN 2.0
16 A 1.0 NaN
17 B 1.0 NaN
18 C 1.0 NaN
19 D 1.0 NaN
20 E 1.0 NaN
21 A NaN NaN
22 B NaN NaN
23 C NaN NaN
24 D NaN NaN
25 E NaN NaN

Categories

Resources