Data augmentation with pandas - python

I'm doing some data augmentation in my data.
Basically they look like this:
country. size. price. product
CA. 1. 3.99. 12
US. 1. 2.99. 12
BR. 1. 10.99. 13
What I want to do is that because the size is fixed to 1, I want to add 3 more sizes per country, per product and increase the price accordingly. So, if the size is 2 then the price is price for 1 times 2, etc...
So basically, I'm looking for this:
country. size. price. product
CA. 1. 3.99. 12
CA. 2. 7.98. 12
CA. 3. 11.97. 12
CA. 4. 15.96. 12
US. 1. 2.99. 12
US. 2. 5.98. 12
US. 3. 8.97. 12
US. 4. 11.96. 12
BR. 1. 10.99. 13
BR. 2. 21.98. 13
BR. 3. 32.97. 13
BR. 4. 43.96. 13
What is a good way to do this with pandas?
I'm tried doing it in a loop with iterrows() but that wasn't a fast solution for my data. So am I missing something?

Use Index.repeat for add new rows, then aggregate GroupBy.cumsum and add counter by GroupBy.cumcount, last reset index for default unique one:
df = df.loc[df.index.repeat(4)]
df['size'] = df.groupby(level=0).cumcount().add(1)
df['price'] = df.groupby(level=0)['price'].cumsum()
df = df.reset_index(drop=True)
print (df)
country size price product
0 CA 1 3.99 12
1 CA 2 7.98 12
2 CA 3 11.97 12
3 CA 4 15.96 12
4 US 1 2.99 12
5 US 2 5.98 12
6 US 3 8.97 12
7 US 4 11.96 12
8 BR 1 10.99 13
9 BR 2 21.98 13
10 BR 3 32.97 13
11 BR 4 43.96 13
Another idea without cumcount, but with numpy.tile:
add = 3
df1 = df.loc[df.index.repeat(add + 1)]
df1['size'] = np.tile(range(1, add + 2), len(df))
df1['price'] = df1.groupby(level=0)['price'].cumsum()
df1 = df1.reset_index(drop=True)
print (df1)
country size price product
0 CA 1 3.99 12
1 CA 2 7.98 12
2 CA 3 11.97 12
3 CA 4 15.96 12
4 US 1 2.99 12
5 US 2 5.98 12
6 US 3 8.97 12
7 US 4 11.96 12
8 BR 1 10.99 13
9 BR 2 21.98 13
10 BR 3 32.97 13
11 BR 4 43.96 13

Construct 2 columns using assign and lambda:
s = np.tile(np.arange(4), df.shape[0])
df_final = df.loc[df.index.repeat(4)].assign(size=lambda x: x['size'] + s,
price=lambda x: x['price'] * (s+1))
Out[90]:
country size price product
0 CA 1.0 3.99 12
0 CA 2.0 7.98 12
0 CA 3.0 11.97 12
0 CA 4.0 15.96 12
1 US 1.0 2.99 12
1 US 2.0 5.98 12
1 US 3.0 8.97 12
1 US 4.0 11.96 12
2 BR 1.0 10.99 13
2 BR 2.0 21.98 13
2 BR 3.0 32.97 13
2 BR 4.0 43.96 13

Since the size is always 1, you basically only need to multiply size and price by a constant factor. You can do this straightforward, write the result into a seperate DataFrame and then use pd.concat to join them together
In [20]: df2 = pd.concat([df[['country.', 'product']], df[['size.', 'price.']] * 2], axis=1)
In [21]: pd.concat([df, df2])
Out[21]:
country. size. price. product
0 CA. 1.0 3.99 12
1 US. 1.0 2.99 12
2 BR. 1.0 10.99 13
0 CA. 2.0 7.98 12
1 US. 2.0 5.98 12
2 BR. 2.0 21.98 13
To augment some more, simply loop over all desired prices:
In [22]: list_of_dfs = []
In [23]: list_of_dfs.append(df)
In [24]: for size in range(2,5):
...: list_of_dfs.append(pd.concat([df[['country.', 'product']], df[['size.', 'price.']] * size], axis=1))
...:
In [25]: pd.concat(list_of_dfs)
Out[25]:
country. size. price. product
0 CA. 1.0 3.99 12
1 US. 1.0 2.99 12
2 BR. 1.0 10.99 13
0 CA. 2.0 7.98 12
1 US. 2.0 5.98 12
2 BR. 2.0 21.98 13
0 CA. 3.0 11.97 12
1 US. 3.0 8.97 12
2 BR. 3.0 32.97 13
0 CA. 4.0 15.96 12
1 US. 4.0 11.96 12
2 BR. 4.0 43.96 13
This is a relatively naive approach, but should work fine in your case and makes good use of vectorization under the hood.

Related

Panel Data - dealing with missing year when creating lead and lag variables

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN

Filtering outliers before using group by

I have a dataframe with price column (p) and I have some undesired values like (0, 1.50, 92.80, 0.80). Before I calculate the mean of the price by product code, I would like to remove these outliers
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
7 100 2017 1 28 2.0 92.80
8 100 2017 2 1 0.0 0.00
9 100 2017 2 7 2.0 1.50
10 100 2017 2 8 5.0 0.80
11 100 2017 2 9 1.0 45.05
12 100 2017 2 11 1.0 1.50
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
16 100 2017 3 30 2.0 1.50
How would be a good way to filter the outliers for each product (group by code) ?
I tried this:
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Code','P']].groupby('Code').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
df[outliers.any(axis=1)]
And then :
print(df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean())
But the outlier filter doesn`t work properly.
IIUC You can use a groupby on Code, do your z score calculation on P, and filter if the z score is greater than your threshold:
stds = 1.0
filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
11 100 2017 2 9 1.0 45.05
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()
P
Code Year Month
100 2017 1 44.821429
2 45.050000
3 46.666667
You have the right idea. Just take the Boolean opposite of your outliers['P'] series via ~ and filter your dataframe via loc:
res = df.loc[~outliers['P']]\
.groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean()
print(res)
Code Year Month P
0 100 2017 1 44.821429
1 100 2017 2 45.050000
2 100 2017 3 46.666667

Change rolling window size as it rolls

I have a pandas data frame like this;
>df
leg speed
1 10
1 11
1 12
1 13
1 12
1 15
1 19
1 12
2 10
2 10
2 12
2 15
2 19
2 11
: :
I want to make a new column roll_speed where it takes a rolling average speed of the last 5 positions. But I wanna put more detailed condition in it.
Groupby leg(it doesn't take into account the speed of the rows in different leg.
I want the rolling window to be changed from 1 to 5 maximum according to the available rows. For example in leg == 1, in the first row there is only one row to calculate, so the rolling speed should be 10/1 = 10. For the second row, there are only two rows available for calculation, the rolling speed should be (10+11)/2 = 10.5.
leg speed roll_speed
1 10 10 # 10/1
1 11 10.5 # (10+11)/2
1 12 11 # (10+11+12)/3
1 13 11.5 # (10+11+12+13)/4
1 12 11.6 # (10+11+12+13+12)/5
1 15 12.6 # (11+12+13+12+15)/5
1 19 14.2 # (12+13+12+15+19)/5
1 12 14.2 # (13+12+15+19+12)/5
2 10 10 # 10/1
2 10 10 # (10+10)/2
2 12 10.7 # (10+10+12)/3
2 15 11.8 # (10+10+12+15)/4
2 19 13.2 # (10+10+12+15+19)/5
2 11 13.4 # (10+12+15+19+11)/5
: :
My attempt:
df['roll_speed'] = df.speed.rolling(5).mean()
But it just returns NA for rows where less than five rows are available for calculation. How should I solve this problem? Thank you for any help!
Set the parameter min_periods to 1
df['roll_speed'] = df.groupby('leg').speed.rolling(5, min_periods = 1).mean()\
.round(1).reset_index(drop = True)
leg speed roll_speed
0 1 10 10.0
1 1 11 10.5
2 1 12 11.0
3 1 13 11.5
4 1 12 11.6
5 1 15 12.6
6 1 19 14.2
7 1 12 14.2
8 2 10 10.0
9 2 10 10.0
10 2 12 10.7
11 2 15 11.8
12 2 19 13.2
13 2 11 13.4
Using rolling(5) will get you your results for all but the first 4 occurences of each group. We can fill the remaining values with the expanding mean:
(df.groupby('leg').speed.rolling(5)
.mean().fillna(df.groupby('leg').speed.expanding().mean())
).reset_index(drop=True)
0 10.000000
1 10.500000
2 11.000000
3 11.500000
4 11.600000
5 12.600000
6 14.200000
7 14.200000
8 10.000000
9 10.000000
10 10.666667
11 11.750000
12 13.200000
13 13.400000
Name: speed, dtype: float64

Cumulative Sum using 2 columns

I am trying to create a column that does a cumulative sum using 2 columns , please see example of what I am trying to do :#Faith Akici
index lodgement_year words sum cum_sum
0 2000 the 14 14
1 2000 australia 10 10
2 2000 word 12 12
3 2000 brand 8 8
4 2000 fresh 5 5
5 2001 the 8 22
6 2001 australia 3 13
7 2001 banana 1 1
8 2001 brand 7 15
9 2001 fresh 1 6
I have used the code below , however my computer keep crashing , I am unsure if is the code or the computer. Any help will be greatly appreciated:
df_2['cumsum']= df_2.groupby('lodgement_year')['words'].transform(pd.Series.cumsum)
Update ; I have also used the code below , it worked and said exit code 0 . However with some warnings.
df_2['cum_sum'] =df_2.groupby(['words'])['count'].cumsum()
You are almost there, Ian!
cumsum() method calculates the cumulative sum of a Pandas column. You are looking for that applied to the grouped words. Therefore:
In [303]: df_2['cumsum'] = df_2.groupby(['words'])['sum'].cumsum()
In [304]: df_2
Out[304]:
index lodgement_year words sum cum_sum cumsum
0 0 2000 the 14 14 14
1 1 2000 australia 10 10 10
2 2 2000 word 12 12 12
3 3 2000 brand 8 8 8
4 4 2000 fresh 5 5 5
5 5 2001 the 8 22 22
6 6 2001 australia 3 13 13
7 7 2001 banana 1 1 1
8 8 2001 brand 7 15 15
9 9 2001 fresh 1 6 6
Please comment if this fails on your bigger data set, and we'll work on a possibly more accurate version of this.
If we only need to consider the column 'words', we might need to loop through unique values of the words
for unique_words in df_2.words.unique():
if 'cum_sum' not in df_2:
df_2['cum_sum'] = df_2.loc[df_2['words'] == unique_words]['sum'].cumsum()
else:
df_2.update(pd.DataFrame({'cum_sum': df_2.loc[df_2['words'] == unique_words]['sum'].cumsum()}))
above will result to:
>>> print(df_2)
lodgement_year sum words cum_sum
0 2000 14 the 14.0
1 2000 10 australia 10.0
2 2000 12 word 12.0
3 2000 8 brand 8.0
4 2000 5 fresh 5.0
5 2001 8 the 22.0
6 2001 3 australia 13.0
7 2001 1 banana 1.0
8 2001 7 brand 15.0
9 2001 1 fresh 6.0

Call a Nan Value and change to a number in python

I have a DataFrame, say df, which looks like this:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 NaN
Now, I need the pro column to have the same value as the property_type column, whenever the property_type1 column has a NaN value. This is how it should be:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 29
That is, in line 11, where property_type1 is NaN, the value of the pro column becomes 29, which is the value of property_type. How can I do this?
ix is deprecated, don't use it.
Option 1
I'd do this with np.where -
df = df.assign(pro=np.where(df.pro.isnull(), df.property_type, df.pro))
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Option 2
If you want to perform in-place assignment, use loc -
m = df.pro.isnull()
df.loc[m, 'pro'] = df.loc[m, 'property_type']
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Compute the mask just once, and use it to index multiple times, which should be more efficient than computing it twice.
Find the rows where property_type1 column is NaN, and for those rows: assign the property_type values to the pro column.
df.ix[df.property_type1.isnull(), 'pro'] = df.ix[df.property_type1.isnull(), 'property_type']

Categories

Resources