reshaping the dataset in python

reshaping the dataset in python - python

I have this dataset:
Account
lookup
FY11USD
FY12USD
FY11local
FY12local
Sales
CA
1000
5000
800
4800
Sales
JP
5000
6500
10
15
Trying to arrive to get the data in this format: (below example has 2 years of data but no. of years can vary)
Account
lookup
Year
USD
Local
Sales
CA
FY11
1000
800
Sales
CA
FY12
5000
4800
Sales
JP
FY11
5000
10
Sales
JP
FY12
6500
15
I tried using the below script, but it doesn't segregate USD and local for the same year. How should I go about that?
df.melt(id_vars=["Account", "lookup"],
var_name="Year",
value_name="Value")

You can piece it together like so:
dfn = (pd.concat(
[df[["Account", "lookup", 'FY11USD','FY12USD']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="USD"),
df[["Account", "lookup", 'FY11local','FY12local']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="Local")[['Local']]], axis=1 ))
dfn['Year'] = dfn['Year'].str[:4]
Output
Account lookup Year USD Local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15

One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value placeholder ---> the .value determines which parts of the columns remain as headers:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = ['Account', 'lookup'],
names_to = ('Year', '.value'),
names_pattern = r"(FY\d+)(.+)")
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15
Another option is to use stack:
temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15
You can also pull it off with pd.wide_to_long after reshaping the columns:
index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
.columns
.str.split('(FY\d+)')
.str[::-1]
.str.join('')
)
(pd.wide_to_long(
temp.reset_index(),
stubnames = ['USD', 'local'],
i = index,
j = 'Year',
suffix = '.+')
.reset_index()
)
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15

Related

Fill missing values in a data frame based without losing values already in the data frame

I have data frame with missing values:
import pandas as pd
data = {'Brand':['residential','unclassified','tertiary','residential','unclassified','primary','residential'],
'Price': [22000,25000,27000,"NA","NA",10000,"NA"]
}
df = pd.DataFrame(data, columns = ['Brand', 'Price'])
print (df)
Resulting in this data frame:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential NA
4 unclassified NA
5 primary 10000
6 residential NA
I would like to fill in the missing values for residential and unclassified in the prices column with fixed values (residential =1000, unclassified=2000), however I dont want to lose any values that are already present in the prices column for residential or unclassified, so the out put should look like this:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential 1000
4 unclassified 2000
5 primary 10000
6 residential 1000
Whats the easiest way to get this done

We can do map with fillna , PS: you need to make sure in your df, NA is NaN
df.Price.fillna(df.Brand.map({'residential':1000,'unclassified':2000}),inplace=True)
df
Brand Price
0 residential 22000.0
1 unclassified 25000.0
2 tertiary 27000.0
3 residential 1000.0
4 unclassified 2000.0
5 primary 10000.0
6 residential 1000.0

how to check and groupby all the objects starts with of a dataframe in column

Have a dataframe where I need to check , group by and sum all the data
I have used regex function to find and group all the particular group of data starts with respective countries.
Suppose I have a dataset
Countries 31-12-17 1-1-18 2-1-18 3-1-18 Sum
India-Basic 1200 1100 800 900 4000
Sweden-Basic 1500 1300 700 1500 5000
Norway-Basic 800 400 900 900 3000
India-Exp 600 1400 300 200 2500
Sweden-Exp 1800 400 600 700 3500
Norway-Exp 1300 1600 1100 1500 4500
Expected Output :
Countries Sum
India 6500
Sweden 8500
Norway 7500
India

Use for regex solution Series.str.extract and aggregate sum:
df1 = (df.groupby(df['Countries'].str.extract('(.*)-', expand=False), sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500
Alternative si split Countries by - and select first lists by str[0]:
df1 = (df.groupby(df['Countries'].str.split('-').str[0], sort=False)['Sum']
.sum()
.reset_index())
print (df1)
Countries Sum
0 India 6500
1 Sweden 8500
2 Norway 7500

this could work - note that i only filtered for the columns that are relevant :
(df.filter(['Countries','Sum'])
.assign(Countries = lambda x: x.Countries.str.split('-').str.get(0))
.groupby('Countries')
.agg('sum')
)
Sum
Countries
India 6500
Norway 7500
Sweden 8500

Conditionally filling blank values in Pandas dataframes

I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?

You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.

For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')

You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN

Apply groupby on a DataFrame to display cumulative stats

Let's say I have a DataFrame that looks like this:
Bank Name House This Wk
Barc Germany 100
Barc UK 300
Barc UK 500
JPM Japan 200
JPM NYC 100
BOA LA 900
BOA LA 50
BOA LA 50
DB Italy 45
I would like to group-by Bank Name, while outputting the largest House Value as well as the total value...
For example, using the example above would result in:
Bank Name Total House This Wk
Barc 900 UK 500
JPM 300 Japan 200
BOA 1000 LA 900
DB 45 Italy 45
Essentially, it is grouping the Total by Bank Name, but also outputting the largest contributor, House, to the total and the amount contributed is This Wk.
How can I go about doing this?

In [121]: df.groupby('Bank Name', group_keys=False) \
...: .apply(lambda x: x.nlargest(1, 'This Wk').assign(Total=x['This Wk'].sum())) \
...: [['Bank Name','Total','House','This Wk']]
...:
Out[121]:
Bank Name Total House This Wk
5 BOA 1000 LA 900
2 Barc 900 UK 500
8 DB 45 Italy 45
3 JPM 300 Japan 200

You can consider df.groupby with a list of dfGroupBy.agg functions:
In [732]: out = df.groupby('Bank Name')['This Wk'].agg(['sum', 'idxmax', 'max'])\
.rename(columns={'sum' : 'Total', 'idxmax' : 'House', 'max' : 'This Wk'})\
.reset_index()
In [734]: out['House'] = df.loc[out['House'], 'House'].values; out
Out[734]:
Bank Name Total House This Wk
0 BOA 1000 LA 900
1 Barc 900 UK 500
2 DB 45 Italy 45
3 JPM 300 Japan 200

Another way using apply would be
In [17]: (df.groupby('Bank Name', sort=False)
.apply(lambda x: pd.Series(
[x['This Wk'].sum(),
x.loc[x['This Wk'].idxmax(), 'House'],
x['This Wk'].max()],
index=['Total', 'House', 'This Wk']))
.reset_index())
Out[17]:
Bank Name Total House This Wk
0 Barc 900 UK 500
1 JPM 300 Japan 200
2 BOA 1000 LA 900
3 DB 45 Italy 45

Combine two pandas DataFrames where the date fields are within two months of each other

I need to combine 2 pandas dataframes where df1.date is within 2 months previous of df2. I then want to calculate how many traders had traded the same stock during that period and count the total shares purchased.
I have tried using the approach listed below, but found it far to complicated. I believe there would be a smarter/simpler solution.
Pandas: how to merge two dataframes on offset dates?
A sample dataset is below:
DF1 (team_1):
date shares symbol trader
31/12/2013 154 FDX Max
30/06/2016 2367 GOOGL Max
21/07/2015 293 ORCL Max
18/07/2015 304 ORCL Sam
DF2 (team_2):
date shares symbol trader
23/08/2015 345 ORCL John
04/07/2014 567 FB John
06/12/2013 221 ACER Sally
31/11/2012 889 HP John
05/06/2010 445 ABBV Kate
Required output:
date shares symbol trader team_2_traders team_2_shares_bought
23/08/2015 345 ORCL John 2 597
04/07/2014 567 FB John 0 0
06/12/2013 221 ACER Sally 0 0
31/11/2012 889 HP John 0 0
05/06/2010 445 ABBV Kate 0 0
This adds 2 new columns...
'team_2_traders' = count of how many traders from team_1 traded the same stock during the previous 2 months from the date listed on DF2.
'team_2_shares_bought' = count of the total shares purchased by team_1 during the previous 2 months from the date listed on DF2.
If anyone is willing to give this a crack, please use the snippet below to setup the dataframes. Please keep in mind the actual dataset contains millions of rows and 6,000 company stocks.
team_1 = {'symbol':['FDX','GOOGL','ORCL','ORCL'],
'date':['31/12/2013','30/06/2016','21/07/2015','18/07/2015'],
'shares':[154,2367,293,304],
'trader':['Max','Max','Max','Sam']}
df1 = pd.DataFrame(team_1)
team_2 = {'symbol':['ORCL','FB','ACER','HP','ABBV'],
'date':['23/08/2015','04/07/2014','06/12/2013','31/11/2012','05/06/2010'],
'shares':[345,567,221,889,445],
'trader':['John','John','Sally','John','Kate']}
df2 = pd.DataFrame(team_2)
Appreciate the help - thank you.

Please check my solution.
from pandas.tseries.offsets import MonthEnd
df_ = df2.merge(df1, on=['symbol'])
df_['date_x'] = pd.to_datetime(df_['date_x'])
df_['date_y'] = pd.to_datetime(df_['date_y'])
df_2m = df_[df_['date_x'] < df_['date_y'] + MonthEnd(2)] \
.loc[:, ['date_y', 'shares_y', 'symbol', 'trader_y']] \
.groupby('symbol')
df1_ = pd.concat([df_2m['shares_y'].sum(), df_2m['trader_y'].count()], axis=1)
print(df1_)
shares_y trader_y
symbol
ORCL 597 2
print(df2.merge(df1_.reset_index(), on='symbol', how='left').fillna(0))
date shares symbol trader shares_y trader_y
0 23/08/2015 345 ORCL John 597.0 2.0
1 04/07/2014 567 FB John 0.0 0.0
2 06/12/2013 221 ACER Sally 0.0 0.0
3 30/11/2012 889 HP John 0.0 0.0
4 05/06/2010 445 ABBV Kate 0.0 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

reshaping the dataset in python - python

Related

Fill missing values in a data frame based without losing values already in the data frame

how to check and groupby all the objects starts with of a dataframe in column

Conditionally filling blank values in Pandas dataframes

Apply groupby on a DataFrame to display cumulative stats

Combine two pandas DataFrames where the date fields are within two months of each other

Categories

Resources