data manipulation example from wide to long in python - python

I've just placed a similar question here and got an answer but recognised, that by adding a new column to a DataFrame the presented solution fails as the problem is a bit different.
I want to go from here:
import pandas as pd
df = pd.DataFrame({'ID': [1, 2],
'Value_2013': [100, 200],
'Value_2014': [245, 300],
'Value_2016': [200, float('NaN')]})
print(df)
ID Value_2013 Value_2014 Value_2016
0 1 100 245 200.0
1 2 200 300 NaN
to:
df_new = pd.DataFrame({'ID': [1, 1, 1, 2, 2],
'Year': [2013, 2014, 2016, 2013, 2014],
'Value': [100, 245, 200, 200, 300]})
print(df_new)
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014
Any ideas how I can face this challenge?

The pandas.melt() method gets you halfway there. After that it's just some minor cleaning up.
df = pd.melt(df, id_vars='ID', var_name='Year', value_name='Value')
df['Year'] = df['Year'].map(lambda x: x.split('_')[1])
df = df.dropna().astype(int).sort_values(['ID', 'Year']).reset_index(drop=True)
df = df.reindex_axis(['ID', 'Value', 'Year'], axis=1)
print(df)
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014

You need add set_index first:
df = df.set_index('ID')
df.columns = df.columns.str.split('_', expand=True)
df = df.stack().rename_axis(['ID','Year']).reset_index()
df.Value = df.Value.astype(int)
#if order of columns is important
df = df.reindex_axis(['ID','Value','Year'], axis=1)
print (df)
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014

Leveraging Multi Indexing in Pandas
import numpy as np
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame({'ID': [1, 2],
'Value_2013': [100, 200],
'Value_2014': [245, 300],
'Value_2016': [200, float('NaN')]})
# Set ID column as Index
df = df.set_index('ID')
# unstack all columns, swap the levels in the row index
# and convert series to df
df = df.unstack().swaplevel().to_frame().reset_index()
# Rename columns as desired
df.columns = ['ID', 'Year', 'Value']
# Transform the year values from Value_2013 --> 2013 and so on
df['Year'] = df['Year'].apply(lambda x : x.split('_')[1]).astype(np.int)
# Sort by ID
df = df.sort_values(by='ID').reset_index(drop=True).dropna()
print(df)
ID Year Value
0 1 2013 100.0
1 1 2014 245.0
2 1 2016 200.0
3 2 2013 200.0
4 2 2014 300.0

Another option is pd.wide_to_long(). Admittedly it doesn't give you exactly the same output but you can clean up as needed.
pd.wide_to_long(df, ['Value_',], i='', j='Year')
ID Value_
Year
NaN 2013 1 100
2013 2 200
2014 1 245
2014 2 300
2016 1 200
2016 2 NaN

Yet another soution (two steps):
In [31]: x = df.set_index('ID').stack().astype(int).reset_index(name='Value')
In [32]: x
Out[32]:
ID level_1 Value
0 1 Value_2013 100
1 1 Value_2014 245
2 1 Value_2016 200
3 2 Value_2013 200
4 2 Value_2014 300
In [33]: x = x.assign(Year=x.pop('level_1').str.extract(r'(\d{4})', expand=False))
In [34]: x
Out[34]:
ID Value Year
0 1 100 2013
1 1 245 2014
2 1 200 2016
3 2 200 2013
4 2 300 2014

One option is with the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = "ID",
names_to=(".value", "Year"),
names_sep="_",
sort_by_appearance=True)
.dropna()
)
ID Year Value
0 1 2013 100.0
1 1 2014 245.0
2 1 2016 200.0
3 2 2013 200.0
4 2 2014 300.0
The .value keeps the part of the column associated with it as header, while the rest goes into the Year column.

Related

(Pandas) How to replace certain values of a column from a different dataset but leave other values in the dataset untouched?

Let's say I have the dataset:
df1 = pd.DataFrame()
df1['number'] = [0,0,0,0,0]
df1["decade"] = ["1970", "1980", "1990", "2000", "2010"]`
print(df1)
#output:
number decade
0 0 1970
1 0 1980
2 0 1990
3 0 2000
4 0 2010
and I want to merge it with another dataset:
df2 = pd.DataFrame()
df2['number'] = [1,1]
df2["decade"] = ["1990", "2010"]
print(df2)
#output:
number decade
0 1 1990
1 1 2010
such that it get's values only from the decades from df2 that have values in them and leaves the others untouched, yielding:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
how must one go about doing that in pandas? I've tried stuff like join, merge, and concat but they all seem to either not give the desired result or not work because of the different dimensions of the 2 datasets. Any suggestions regarding which function I should be looking at?
Thank you so much!
You can use pandas.DataFrame.merge with pandas.Series.fillna :
out = (
df1[["decade"]]
.merge(df2, on="decade", how="left")
.fillna({"number": df1["number"]}, downcast="infer")
)
# Output :
print(out)
decade number
0 1970 0
1 1980 0
2 1990 1
3 2000 0
4 2010 1
What about using apply?
First you create a function
def validation(previous,latest):
if pd.isna(latest):
return previous
else:
return latest
Then you can use the function dataframe.apply to compare the data in df1 to df2
df1['number'] = df1.apply(lambda row: validation(row['number'],df2.loc[df2['decade'] == row.decade].number.max()),axis = 1)
Your result:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010

Reduce output time - count with groupby (python, pandas)

I have a loop that calculates the "total_count" of a group of elements from multiple periods. Is there a way to optimize the script to have a shorter output time? The dataframe is 33MD and running a loop takes over 300++ms. Actual script runs over 50k loops; which takes over 2 days to complete.
#sample df with similar output time
df = pd.DataFrame(np.random.randint(3, size=(400000,1)), columns=['type'])
df['class'] = np.random.randint(1, 7, df.shape[0])
df['country'] = np.random.randint(1, 12, df.shape[0])
df['period'] = np.random.randint(2010, 2018, df.shape[0])
df['season'] = np.random.randint(1, 4, df.shape[0])
%%time
#period
tr1_sta = 2011
tr1_end = 2016
h0 = 'type'
h1 = 'class'
h2 = 'country'
holder = [h0,h1,h2]
df = (df.set_index(holder).assign(tr1_tc = df[(df['period'].between(tr1_sta, tr1_end))].groupby(holder)['season'].count()).reset_index())
Kindly advise
Thank you
Your code took 75.1ms on my computer. The code below run in 14.6ms:
df['tr1_tc'] = (df.assign(tr1_tc=df['period'].between(tr1_sta, tr1_end))
.groupby(holder)['tr1_tc'].transform('sum'))
print(df)
# Output
type class country period season tr1_tc
0 1 5 1 2016 2 1343
1 1 5 9 2014 1 1302
2 2 3 4 2013 2 1299
3 0 6 1 2014 1 1326
4 2 4 5 2012 3 1367
... ... ... ... ... ... ...
349995 2 1 3 2010 3 1332
349996 0 1 8 2015 1 1362
349997 1 3 1 2013 3 1283
349998 1 6 7 2015 3 1305
349999 0 6 9 2017 2 1250
[350000 rows x 6 columns]

How to get initial row's indexes from df.groupby?

Actually, I have df
print(df):
date value other_columns
0 1995 5
1 1995 13
2 1995 478
and so on...
After grouping them by date df1 = df.groupby(by='date')['value'].min()
I wonder how to get initial row's index. In this case, I want to get integer 0, because there was the lowest value in 1995. Thank you in advance.
You have to create a Column with the index value before doing the groupby:
df['initialIndex'] = df.index.values
#do the groupby
I think this what you mean:
Actually, you want the original dataframe with only the rows with minimal value across each group.
For this you can use pandas transform method:
>>> df = pd.DataFrame({'date' : [1995, 1995, 1995, 2000, 2000, 2000], 'value': [5, 13, 478, 7, 1, 8]})
>>> df
date value
0 1995 5
1 1995 13
2 1995 478
3 2000 7
4 2000 1
5 2000 8
>>> minimal_value = df.groupby(['date'])['value'].transform(min)
>>> minimal_value
0 5
1 5
2 5
3 1
4 1
5 1
Name: value, dtype: int64
Now you can use this to get only the relevant rows:
>>> df.loc[df['value'] == minimal_value]
date value
0 1995 5
4 2000 1

How to duplicate entries in a dataframe

I have a dataframe of the form:
df2 = pd.DataFrame({'Date': np.array([2018,2017,2016,2015]),
'Rev': np.array([4000,5000,6000,7000]),
'Other': np.array([0,0,0,0]),
'High':np.array([75.11,70.93,48.63,43.59]),
'Low':np.array([60.42,45.74,34.15,33.12]),
'Mean':np.array([67.765,58.335,41.390,39.355]) #mean of high/low columns
})
This looks like:
I want to convert this dataframe to something that looks like:
Basically you are copying each row two more times. Then you are taking the high, low, and mean values and column-wise under the 'price' column. Then you add a new 'category' that keeps a track of which is from high/low/medium (0 meaning high, 1 meaning low, and 2 meaning mean).
This is a simple melt (wide to long) problem:
# convert df2 from wide to long, melting the High, Low and Mean cols
df3 = df2.melt(df2.columns.difference(['High', 'Low', 'Mean']).tolist(),
var_name='category',
value_name='price')
# remap "category" to integer
df3['category'] = pd.factorize(df['category'])[0]
# sort and display
df3.sort_values('Date', ascending=False))
Date Other Rev category price
0 2018 0 4000 0 75.110
4 2018 0 4000 1 60.420
8 2018 0 4000 2 67.765
1 2017 0 5000 0 70.930
5 2017 0 5000 1 45.740
9 2017 0 5000 2 58.335
2 2016 0 6000 0 48.630
6 2016 0 6000 1 34.150
10 2016 0 6000 2 41.390
3 2015 0 7000 0 43.590
7 2015 0 7000 1 33.120
11 2015 0 7000 2 39.355
instead of melt, you can use stack, which saves you the sort_values:
new_df = (df2.set_index(['Date','Rev', 'Other'])
.stack()
.to_frame(name='price')
.reset_index()
)
output:
Date Rev Other level_3 price
0 2018 4000 0 High 75.110
1 2018 4000 0 Low 60.420
2 2018 4000 0 Mean 67.765
3 2017 5000 0 High 70.930
4 2017 5000 0 Low 45.740
5 2017 5000 0 Mean 58.335
6 2016 6000 0 High 48.630
7 2016 6000 0 Low 34.150
8 2016 6000 0 Mean 41.390
9 2015 7000 0 High 43.590
10 2015 7000 0 Low 33.120
11 2015 7000 0 Mean 39.355
and if you want the category column:
new_df['category'] = new_df['level_3'].map({'High':0, 'Low':1, 'Mean':2'})
Here's another version:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'Date': np.array([2018,2017,2016,2015]),
'Rev': np.array([4000,5000,6000,7000]),
'Other': np.array([0,0,0,0]),
'High':np.array([75.11,70.93,48.63,43.59]),
'Low':np.array([60.42,45.74,34.15,33.12]),
'Mean':np.array([67.765,58.335,41.390,39.355]) #mean of high/low columns
})
#create one dataframe per category
df_high = df2[['Date', 'Other', 'Rev', 'High']]
df_mean = df2[['Date', 'Other', 'Rev', 'Mean']]
df_low = df2[['Date', 'Other', 'Rev', 'Low']]
#rename the category column to price
df_high = df_high.rename(index = str, columns = {'High': 'price'})
df_mean = df_mean.rename(index = str, columns = {'Mean': 'price'})
df_low = df_low.rename(index = str, columns = {'Low': 'price'})
#create new category column
df_high['category'] = 0
df_mean['category'] = 2
df_low['category'] = 1
#concatenate the dataframes together
frames = [df_high, df_mean, df_low]
df_concat = pd.concat(frames)
#sort values per example
df_concat = df_concat.sort_values(by = ['Date', 'category'], ascending = [False, True])
#print result
print(df_concat)
Result:
Date Other Rev price category
0 2018 0 4000 75.110 0
0 2018 0 4000 60.420 1
0 2018 0 4000 67.765 2
1 2017 0 5000 70.930 0
1 2017 0 5000 45.740 1
1 2017 0 5000 58.335 2
2 2016 0 6000 48.630 0
2 2016 0 6000 34.150 1
2 2016 0 6000 41.390 2
3 2015 0 7000 43.590 0
3 2015 0 7000 33.120 1
3 2015 0 7000 39.355 2

Merge pandas Data Frames based on conditions

I have two files which show information about a transaction over products
Operations of type 1
d_op_1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3],'cost':[10,20,20,20,10,20,20,20],
'date':[2000,2006,2012,2000,2009,2009,2002,2006]})
Operations of type 2
d_op_2 = pd.DataFrame({'id':[1,1,2,2,3,4,5,5],'cost':[3000,3100,3200,4000,4200,3400,2000,2500],
'date':[2010,2015,2008,2010,2006,2010,1990,2000]})
I want to keep only those registers were there have been operations of type one between two operations of type 2.
E.G. for the product wit the id "1" there was an operation of type 1 (2012) between two operations of type 2 (2010,2015) so I want to keep that record.
The desired output it cloud be either this:
or this:
Using pd.merge() I got this result:
How can I filter this to get the desired output?
You can use:
#concat DataFrames together
df4 = pd.concat([d_op_1.rename(columns={'cost':'cost1'}),
d_op_2.rename(columns={'cost':'cost2'})]).fillna(0).astype(int)
#print (df4)
#find max and min dates per goups
df3 = d_op_2.groupby('id')['date'].agg({'start':'min','end':'max'})
#print (df3)
#join max and min dates to concated df
df = df4.join(df3, on='id')
df = df[(df.date > df.start) & (df.date < df.end)]
#reshape df for min, max and dated between them
df = pd.melt(df,
id_vars=['id','cost1'],
value_vars=['date','start','end'],
value_name='date')
#remove columns
df = df.drop(['cost1','variable'], axis=1) \
.drop_duplicates()
#merge to original, sorting
df = pd.merge(df, df4, on=['id', 'date']) \
.sort_values(['id','date']).reset_index(drop=True)
#reorder columns
df = df[['id','cost1','cost2','date']]
print (df)
id cost1 cost2 date
0 1 0 3000 2010
1 1 20 0 2012
2 1 0 3100 2015
3 2 0 3200 2008
4 2 10 0 2009
5 2 20 0 2009
6 2 0 4000 2010
#if need lists for duplicates
df = df.groupby(['id','cost2', 'date'])['cost1'] \
.apply(lambda x: list(x) if len(x) > 1 else x.values[0]) \
.reset_index()
df = df[['id','cost1','cost2','date']]
print (df)
id cost1 cost2 date
0 1 20 0 2012
1 1 0 3000 2010
2 1 0 3100 2015
3 2 [10, 20] 0 2009
4 2 0 3200 2008
5 2 0 4000 2010

Categories

Resources