Python convert long to wide data frame(Strings) [duplicate] - python

I have data in long format and am trying to reshape to wide, but there doesn't seem to be a straightforward way to do this using melt/stack/unstack:
Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2
Becomes:
Salesman Height product_1 price_1 product_2 price_2 product_3 price_3
Knut 6 bat 5 ball 1 wand 3
Steve 5 pen 2 NA NA NA NA
I think Stata can do something like this with the reshape command.

Here's another solution more fleshed out, taken from Chris Albon's site.
Create "long" dataframe
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
Make a "wide" data
df.pivot(index='patient', columns='obs', values='score')

A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:
df['idx'] = df.groupby('Salesman').cumcount()
Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:
print df.pivot(index='Salesman',columns='idx')[['product','price']]
product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN
To get closer to your desired output I added the following:
df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)
product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')
reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape
product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5
Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):
df['idx'] = df.groupby('Salesman').cumcount()
tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))
reshape = pd.concat(tmp,axis=1)
#Luke said:
I think Stata can do something like this with the reshape command.
You can but I think you also need a within group counter to get the reshape in stata to get your desired output:
+-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+
If you add idx then you could do reshape in stata:
reshape wide product price, i(salesman) j(idx)

Karl D's solution gets at the heart of the problem. But I find it's far easier to pivot everything (with .pivot_table because of the two index columns) and then sort and assign the columns to collapse the MultiIndex:
df['idx'] = df.groupby('Salesman').cumcount()+1
df = df.pivot_table(index=['Salesman', 'Height'], columns='idx',
values=['product', 'price'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Output:
Salesman Height price_1 product_1 price_2 product_2 price_3 product_3
0 Knut 6 5.0 bat 1.0 ball 3.0 wand
1 Steve 5 2.0 pen NaN NaN NaN NaN

A bit old but I will post this for other people.
What you want can be achieved, but you probably shouldn't want it ;)
Pandas supports hierarchical indexes for both rows and columns.
In Python 2.7.x ...
from StringIO import StringIO
raw = '''Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2'''
dff = pd.read_csv(StringIO(raw), sep='\s+')
print dff.set_index(['Salesman', 'Height', 'product']).unstack('product')
Produces a probably more convenient representation than what you were looking for
price
product ball bat pen wand
Salesman Height
Knut 6 1 5 NaN 3
Steve 5 NaN NaN 2 NaN
The advantage of using set_index and unstacking vs a single function as pivot is that you can break the operations down into clear small steps, which simplifies debugging.

pivoted = df.pivot('salesman', 'product', 'price')
pg. 192 Python for Data Analysis

An old question; this is an addition to the already excellent answers. pivot_wider from pyjanitor may be helpful as an abstraction for reshaping from long to wide (it is a wrapper around pd.pivot):
# pip install pyjanitor
import pandas as pd
import janitor
idx = df.groupby(['Salesman', 'Height']).cumcount().add(1)
(df.assign(idx = idx)
.pivot_wider(index = ['Salesman', 'Height'], names_from = 'idx')
)
Salesman Height product_1 product_2 product_3 price_1 price_2 price_3
0 Knut 6 bat ball wand 5.0 1.0 3.0
1 Steve 5 pen NaN NaN 2.0 NaN NaN

Related

More efictive method of test large dataframe and add value based on another value different size/not merge

Lot of answers on merging and full col, but can't figure out a more effective method. for my situation.
current version of python, pandas, numpy, and file format is parquet
Simply put if col1 ==x the col 10 = 1, col11 = 2, col... etc.
look1 = 'EMPLOYEE'
look2 = 'CHESTER'
look3 = "TONY'S"
look4 = "VICTOR'S"
tgt1 = 'inv_group'
tgt2 = 'acc_num'
for x in range(len(df['ph_name'])):
df[tgt1][x] = 'MEMORIAL'
df[tgt2][x] = 12345
elif df['ph_name'][x] == look2:
df[tgt1][x] = 'WALMART'
df[tgt2][x] = 45678
elif df['ph_name'][x] == look3:
df[tgt1][x] = 'TONYS'
df[tgt2][x] = 27359
elif df['ph_name'][x] == look4:
df[tgt1][x] = 'VICTOR'
df[tgt2][x] = 45378
basic sample:
unit_name tgt1 tgt2
0 EMPLOYEE Nan Nan
1 EMPLOYEE Nan Nan
2 TONY'S Nan Nan
3 CHESTER Nan Nan
4 VICTOR'S Nan Nan
5 EMPLOYEE Nan Nan
GOAL:
unit_name tgt1 tgt2
0 EMPLOYEE MEMORIAL 12345
1 EMPLOYEE MEMORIAL 12345
2 TONY'S TONYS 27359
3 CHESTER WALMART 45678
4 VICTOR'S VICTOR 45378
5 EMPLOYEE MEMORIAL 12345
So this works... I get the custom columns values added, It's not the fastest under the sun, but it works.
It takes 6.2429744 on 28896 rows. I'm concerned when I put it to the grind, It's going to start dragging me down.
The other downside is I get this annoyance... Yes I can silence, but I feel like this might be due to a bad practice that I should know how to curtail.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Basically...
Is there a way to optimize this?
Is this warning due to a bad habit, my ignorance, or do I just need to silence it?
Given: (It's silly to have all NaN columns)
unit_name
0 EMPLOYEE
1 EMPLOYEE
2 TONY'S
3 CHESTER
4 VICTOR'S
5 EMPLOYEE
df = pd.DataFrame({'unit_name': {0: 'EMPLOYEE', 1: 'EMPLOYEE', 2: "TONY'S", 3: 'CHESTER', 4: "VICTOR'S", 5: 'EMPLOYEE'}})
Doing: (Let's use pd.Series.map and create a dictionary for easier future modification)
looks = ['EMPLOYEE', 'CHESTER', "TONY'S", "VICTOR'S"]
new_cols = {
'inv_group': ["MEMORIAL", "WALMART", "TONYS", "VICTOR"],
'acc_num': [12345, 45678, 27359, 45378]
}
for col, values in new_cols.items():
df[col] = df['unit_name'].map(dict(zip(looks, values)))
print(df)
Output: (I assumed you'd typed the column names wrong)
unit_name inv_group acc_num
0 EMPLOYEE MEMORIAL 12345
1 EMPLOYEE MEMORIAL 12345
2 TONY'S TONYS 27359
3 CHESTER WALMART 45678
4 VICTOR'S VICTOR 45378
5 EMPLOYEE MEMORIAL 12345
Flying blind here since I don't see your data:
cond_list = [df["ph_name"] == look for look in [look1, look2, look3, look4]]
# Rows ph_name outside of the list will keep their original values
df[tgt1] = np.select(cond_list, ["MEMORIAL", "WALMART", "TONY'S", "VICTOR"])
df[tgt2] = np.select(cond_list, [12345, 45678, 27359, 45378])

Python pandas concat to multi index groupby

I'm new to pandas and I need help. I have two following reports which are quite simple.
$ cat test_report1
ID;TYPE;VAL
1;USD;5
2;EUR;10
3;PLN;3
$ cat test_report2
ID;TYPE;VAL
1;USD;5
2;EUR;10
3;PLN;1
Then I'm using concat to connect two reports with unique index:
A=pd.read_csv('test_report1', delimiter=';', index_col=False)
B=pd.read_csv('test_report2', delimiter=';', index_col=False)
C=pd.concat([A.set_index('ID'), B.set_index('ID')], axis=1, keys=['PRE','POST'])
print(C)
Which gives me following output:
PRE POST
TYPE VAL TYPE VAL
ID
1 USD 5 USD 5
2 EUR 10 EUR 10
3 PLN 3 PLN 1
I find this pretty good but actually I would like rather to have:
STATE TYPE VAL
ID
1 PRE USD 5
POST USD 5
2 PRE EUR 10
POST EUR 10
3 PRE PLN 3
POST PLN 1
Then it would be perfect with diff like:
STATE TYPE VAL
ID
1 PRE Nan Nan
POST Nan Nan
2 PRE Nan Nan
POST Nan Nan
3 PRE PLN 3
POST PLN 1
I know that this is doable but I'm stuck digging 3rd day to find a solution.
Use DataFrame.rename_axis with DataFrame.stack and then sorting levels of MultiIndex:
df = (df.rename_axis(['STATE',None], axis=1)
.stack(0)
.sort_index(level=[0,1], ascending=[True, False])
)
print (df)
TYPE VAL
ID STATE
1 PRE USD 5
POST USD 5
2 PRE EUR 10
POST EUR 10
3 PRE PLN 3
POST PLN 1

Update a dataframe based on values from other. It is a traditional UPSERT task with a new indicator column

I am trying to do an UPSERT task over two dataframes.
Here I am updating df2 with df1.
I have used something like this:
final_df=df1.set_index('EmpID').combine_first(df2.set_index('EmpID'))
final_df.reset_index()
My result here is:
EmpID Name Salary Status
0 A John 1000.0 Left
1 B Mary 2000.0 Working
2 C Samie 3000.0 Left
3 D Doe 4000.0 NaN
4 E Lance 2500.0 Contractor
Also I am not able to add the 'Indicator' column
I did this and almost achieved my goal, but is there any better way? plus what to do with the column insert?
df=pd.concat([df1, df2[~df2.EmpID.isin(df1.EmpID)]])
df=df.set_index('EmpID').join(df2.set_index('EmpID'),how='outer',rsuffix='_R')
df[['Name','Salary','Status_R']].reset_index()
EmpID Name Salary Status_R
0 A John 1000.0 Left
1 B Mary 2000.0 Working
2 C Samie NaN Left
3 D Doe 4000.0 NaN
4 E Lance 2500.0 Contractor

Pandas, Dataframe, conditional sum of column for each row

I am new to python and trying to move some of my work from excel to python, and wanted an excel SUMIFS equivalent in pandas, for example something like:
SUMIFS(F:F, D:D, "<="&C2, B:B, B2, F:F, ">"&0)
I my case, I have 6 columns, a unique Trade ID, an Issuer, a Trade date, a release date, a trader, and a quantity. I wanted to get a column which show the sum of available quantity for release at each row. Something like the below:
A B C D E F G
ID Issuer TradeDate ReleaseDate Trader Quantity SumOfAvailableRelease
1 Horse 1/1/2012 13/3/2012 Amy 7 0
2 Horse 2/2/2012 15/5/2012 Dave 2 0
3 Horse 14/3/2012 NaN Dave -3 7
4 Horse 16/5/2012 NaN John -4 9
5 Horse 20/5/2012 10/6/2012 John 2 9
6 Fish 6/6/2013 20/6/2013 John 11 0
7 Fish 25/6/2013 9/9/2013 Amy 4 11
8 Fish 8/8/2013 15/9/2013 Dave 5 11
9 Fish 25/9/2013 NaN Amy -3 20
Usually, in excel, I just pull the SUMIFS formulas down the whole column and it will work, I am not sure how I can do it in python.
Many thanks!
What you could do is a df.where
so for example you could say
Qdf = df.where(df["Quantity"]>=5)
and then do you sum, Idk what you want to do since I have 0 knowledge about excell but I hope this helps

Chained conditional count in Pandas

I have a dataframe that looks at how a form has been filled out. Here's an example:
ID Name Postcode Street Employer Salary
1 John NaN Craven Road NaN NaN
2 Sue TD2 NAN NaN 15000
3 Jimmy MW6 Blake Street Bank 40000
4 Laura QE2 Mill Lane NaN 20000
5 Sam NW2 Duke Avenue Farms 35000
6 Jordan SE6 NaN NaN NaN
7 NaN CB2 NaN Startup NaN `
I want to return a count of successively filled out columns on the condition that all previous columns have been filled. The final output should look something like:
Name Postcode Street Employer salary
6 5 3 2 2
Is there a good Pandas way of doing this? I suppose there could be a way of applying a mask so that if any previous boolean is given as zero the current column is also zero and then counting that but I'm not sure if that is the best way.
Thanks!
I think you can use notnull and cummin:
In [99]: df.notnull().cummin(axis=1).sum(axis=0)
Out[99]:
Name 6
Postcode 5
Street 3
Employer 2
Salary 2
dtype: int64
Although note that I had to replace your NAN (Sue's street) with a float NaN before I did that, and I assumed that ID was your index.
The cumulative minimum is one way to implement "applying a mask so that if any previous boolean is given as zero the current column is also zero", as you predicted would work.
Maybe cumprod BTW you have 'NAN' in your df, I try then as notnull here
df.notnull().cumprod(1).sum()
Out[59]:
ID 7
Name 6
Postcode 5
Street 4
Employer 2
Salary 2
dtype: int64

Categories

Resources