Improving Code Efficiency in Dataset Generation

Improving Code Efficiency in Dataset Generation - python

I am trying to generate a dataset where each day in a given year range has a fixed number of stores. In turn, each store sells a fixed number of products. The products specific to each store and day have a value for sales (£) and number of products sold.
However, running these for loops takes a while to create the dataset.
Is there anyway I can improve the efficiency of my code?
# Generate one row Dataframes (for concatenation) for each product, in each store, on each date
dataframes = []
for d in datelist:
for s in store_IDs:
for p in product_IDs:
products_sold = random.randint(1,101)
sales = random.randint(100,1001)
data_dict = {'Date': [d], 'Store ID': [s], 'Product ID': [p], 'Sales': [sales], 'Number of Products Sold': [products_sold]}
dataframe = pd.DataFrame(data_dict)
dataframes.append(dataframe)
test_dataframe = pd.concat(dataframes)

The main reason your code is really slow now is that you have the dataframe construction buried inside of your triple loop. This is not necessary. Right now, you are creating a new dataframe inside of each loop. It is much more efficient to create all of the data in some type of format that pandas can ingest and then create the dataframe once.
For the structure that you have, the easiest mod you could do is to make a list of the data rows, append a new dictionary to that list for each row as you are constructing now, and then make a df from the list of dictionaries... Pandas knows how to do that. I also removed the list brackets of the items you had in your dictionary. That isn't necessary.
import pandas as pd
import random
datelist = [1, 2, 4, 55]
store_IDs = ['6A', '27B', '12C']
product_IDs = ['soap', 'gum']
data = [] # I just renamed this for clarity
for d in datelist:
for s in store_IDs:
for p in product_IDs:
products_sold = random.randint(1,101)
sales = random.randint(100,1001)
data_dict = {'Date': d, 'Store ID': s, 'Product ID': p, 'Sales': sales, 'Number of Products Sold': products_sold}
data.append(data_dict) # this is building a list of dictionaries...
print(data[:3])
df = pd.DataFrame(data)
print(df.head())
Yields:
[{'Date': 1, 'Store ID': '6A', 'Product ID': 'soap', 'Sales': 310, 'Number of Products Sold': 35}, {'Date': 1, 'Store ID': '6A', 'Product ID': 'gum', 'Sales': 149, 'Number of Products Sold': 34}, {'Date': 1, 'Store ID': '27B', 'Product ID': 'soap', 'Sales': 332, 'Number of Products Sold': 60}]
Date Store ID Product ID Sales Number of Products Sold
0 1 6A soap 310 35
1 1 6A gum 149 34
2 1 27B soap 332 60
3 1 27B gum 698 21
4 1 12C soap 658 51
[Finished in 0.6s]

Do you realize your sizes are huge?
Size is approximately 3 and a half years (in days) = 1277 multiplied
by 99 stores = 126,423 multiplied by 8999 products = 1,137,680,577
rows.
If you need in average 16 bytes (which is already not much) you need at least 17GB of memory for this!
For this reason, Store_IDs and Product_IDs should really be just small integers, like index in a table of more descriptive names.
The way to gain efficiency is to reduce function calls! E.g. you can use numpy random number generation to generate random values in bulk.
Assuming all numbers involved can fit in 16bits, here's one solution to your problem (still needing a lot of memory):
import pandas as pd
import numpy as np
def gen_data(datelist, store_IDs, product_IDs):
date16 = np.arange(len(datelist), dtype=np.int16)
store16 = np.arange(len(store_IDs), dtype=np.int16)
product16 = np.arange(len(product_IDs), dtype=np.int16)
A = np.array(np.meshgrid(date16, store16, product16), dtype=np.int16).reshape(3,-1)
length = A.shape[1]
sales = np.random.randint(100, 1001, size=(1,length), dtype=np.int16)
sold = np.random.randint(1, 101, size=(1,length), dtype=np.int16)
data = np.concatenate((A,sales,sold), axis=0)
df = pd.DataFrame(data.T, columns=['Date index', 'Store ID index', 'Product ID index', 'Sales', 'Number of Products Sold'], dtype=np.int16)
return df
FWIW on my machine I obtain:
Date Store ID Product ID Sales Number of Products Sold
0 0 0 0 127 85
1 0 0 1 292 37
2 0 0 2 180 36
3 0 0 3 558 88
4 0 0 4 519 79
... ... ... ... ... ...
1137680572 1276 98 8994 932 78
1137680573 1276 98 8995 401 47
1137680574 1276 98 8996 840 77
1137680575 1276 98 8997 717 91
1137680576 1276 98 8998 632 24
[1137680577 rows x 5 columns]
real 1m16.325s
user 0m22.086s
sys 0m25.800s
(I don't have enough memory and use swap)

Related

Pandas - Grouping and summing rows using a certain cell identifier

What I'm looking to do is take a table similar to the one below, identify all rows with the Type = 'Fee', and then add the total of that row to a row where some of the other columns match (So take the total from rows with Fee, find a row where the WEEK, STORE, and ID match, and add the total to that row). I should note that the row where Week, Store, and ID will match and it is NOT Type = Fee will be unique (only one of them) however there may be multiple fees that we want to group into it. As a single row example, the third row in the table below has the following:
Week = 15
Store = US1
ID = T3400
Total = 13
What I would look to do is find the row that matches those criteria, and add the sum. In this case, that would be row 1.
Within this data, there will be multiple Type = 'Fee' that I want to all collapse into this one row, the thing that I am struggling to do is preserve the Type that is not Fee the same.
I've given what the expected output would be below. In the expected output:
Row 1 Total = 1098 = 200 (starting) + 13 (row 3 from input) + 885 (row 8 from input)
Row 2 Total = 287 = 189 (starting) + 98 (row 5 from input)
Row 3 Total = 15 (Did not change from input as there were no Fee where the ID matched)
Row 4 Total = 581 = 146 (starting) + 435 (row 6 from input)
Row 5 Total = 189 (Did not change because even though the Store and ID matches, it is from a different week)
As you can see, it will find the rows with Fee, match the other 3 columns, sum the total, and there are no more rows with 'Fee' in the entire dataset. Obviously this is only a small snippet of the data, in total it will have about 20,000 rows to go through.
Input:
Week
Store
Type
ID
Total
15
US1
RE-G
T3400
200
15
US4
TO
T656
189
15
US1
Fee
T3400
13
16
US4
RD
T173
15
15
US4
Fee
T656
98
16
US4
Fee
T1121
435
17
US4
TO
T656
189
15
US1
Fee
T3400
885
16
US4
MX
T1121
146
Expected output:
Week
Store
Type
ID
Total
15
US1
RE-G
T3400
1098
15
US4
TO
T656
287
16
US4
RD
T173
15
16
US4
MX
T1121
581
17
US4
TO
T656
189

It looks like you want to groupby week, store, and ID, and get the sum total. You can also use first on Type after replacing Fee with nulls to get the correct type.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Week': [15, 15, 15, 16, 15, 16, 17, 15, 16],
'Store': ['US1', 'US4', 'US1', 'US4', 'US4', 'US4', 'US4', 'US1', 'US4'],
'Type': ['RE-G', 'TO', 'Fee', 'RD', 'Fee', 'Fee', 'TO', 'Fee', 'MX'],
'ID': ['T3400',
'T656',
'T3400',
'T173',
'T656',
'T1121',
'T656',
'T3400',
'T1121'],
'Total': [200, 189, 13, 15, 98, 435, 189, 885, 146]})
df['Type'].replace('Fee', np.nan, inplace=True)
df = df.groupby(['Week','Store', 'ID'], as_index=False).agg({'Type':'first', 'Total':sum})
print(df)
Output
Week Store ID Type Total
0 15 US1 T3400 RE-G 1098
1 15 US4 T656 TO 287
2 16 US4 T1121 MX 581
3 16 US4 T173 RD 15
4 17 US4 T656 TO 189

merge dataframes and add price data for each instance of an item ID

I am trying to merge two dataframes so that each instance of an item ID in DF3 displays the pricing data associated with the matching ID from DF1.
DF3 (what I am trying to accomplish)
recipeID
itemID_out
qty_out
buy_price
sell_price
buy_quantity
sell_quantity
id_1_in
qty_id1
buy_price
sell_price
buy_quantity
sell_quantity
id_2_in
qty_id2
buy_price
sell_price
buy_quantity
sell_quantity
id_3_in
qty_id3
buy_price
sell_price
buy_quantity
sell_quantity
id_4_in
qty_id4
buy_price
sell_price
buy_quantity
sell_quantity
id_5_in
qty_id5
buy_price
sell_price
buy_quantity
sell_quantity
1
1986
1
129
167
67267
21637
123
1
10
15
1500
3000
124
1
12
14
550
800
125
1
8
12
124
254
126
1
22
25
1251
890
127
1
64
72
12783
1251515
2
1987
1
1521
1675
654
1245
123
2
10
15
1500
3000
3
1988
1
128376
131429
47
23
123
10
10
15
1500
3000
124
3
12
14
550
800
These are the two dataframes I am trying to merge from.
DF1: Contains 26863 rows; master list of item names, IDs, and price data. Pulled from API, new items can be added and will appear as new rows after an update request from the user.
itemID
name
buy_price
sell_price
buy_quantity
sell_quantity
1986
XYZ
129
167
67267
21637
123
ABC
10
15
1500
3000
124
DEF
12
14
550
800
DF2 (contains 12784 rows; recipes that combine from items in the master list. Pulled from API, new recipes can be added and will appear as new rows after an update request from the user.)
recipeID
itemID_out
qty_out
id_1_in
qty_id1
id_2_in
qty_id2
id_3_in
qty_id3
id_4_in
qty_id4
id_5_in
qty_id5
1
1986
1
123
1
124
1
125
1
126
1
127
1
2
1987
1
123
2
3
1988
1
123
10
124
3
Recipes can contain a combination of 1 to 5 items (null values occur) that consist of IDs from DF1 and/or the itemID_out column in DF2.
The "id_#_in" columns in DF2 can contain item IDs from the "itemID_out" column, due to that recipe using the item that is being output from another recipe.
I have tried to merge it using:
pd.merge(itemlist_modified, recipelist_modified, left_on='itemID', right_on='itemID_out')
But this only ever results in a single column of ideas receiving the pricing data as intended.
I feel like I'm trying to use the wrong function for this, any help would be very much appreciated!
Thanks in advance!

Not a pretty approach, but it first melts the ingredient table into long form and then merges it on the itemlist table
import pandas as pd
import numpy as np
itemlist_modified = pd.DataFrame({
'itemID': [1986, 123, 124],
'name': ['XYZ', 'ABC', 'DEF'],
'buy_price': [129, 10, 12],
'sell_price': [167, 15, 14],
'buy_quantity': [67267, 1500, 550],
'sell_quantity': [21637, 3000, 800],
})
recipelist_modified = pd.DataFrame({
'RecipeID': [1, 2, 3],
'itemID_out': [1986, 1987, 1988],
'qty_out': [1, 1, 1],
'id_1_in': [123, 123, 123],
'qty_id1': [1, 2, 10],
'id_2_in': [124.0, np.nan, 124.0],
'qty_id2': [1.0, np.nan, 3.0],
'id_3_in': [125.0, np.nan, np.nan],
'qty_id3': [1.0, np.nan, np.nan],
'id_4_in': [126.0, np.nan, np.nan],
'qty_id4': [1.0, np.nan, np.nan],
'id_5_in': [127.0, np.nan, np.nan],
'qty_id5': [1.0, np.nan, np.nan],
})
#columns which are not qty or input id cols
id_vars = ['RecipeID','itemID_out','qty_out']
#prepare dict to map column name to ingredient number
col_renames = {}
col_renames.update({'id_{}_in'.format(i+1):'ingr_{}'.format(i+1) for i in range(5)})
col_renames.update({'qty_id{}'.format(i+1):'ingr_{}'.format(i+1) for i in range(5)})
#melt reciplist into longform
long_recipelist = recipelist_modified.melt(
id_vars=id_vars,
var_name='ingredient',
).dropna()
#add a new column to specify whether each row is a qty or an id
long_recipelist['kind'] = np.where(long_recipelist['ingredient'].str.contains('qty'),'qty_in','id_in')
#convert ingredient names
long_recipelist['ingredient'] = long_recipelist['ingredient'].map(col_renames)
#pivot on the new ingredient column
reshape_recipe_list = long_recipelist.pivot(
index=['RecipeID','itemID_out','qty_out','ingredient'],
columns='kind',
values='value',
).reset_index()
#merge with the itemlist
priced_ingredients = pd.merge(reshape_recipe_list, itemlist_modified, left_on='id_in', right_on='itemID')
#pivot on the priced ingredients
priced_ingredients = priced_ingredients.pivot(
index = ['RecipeID','itemID_out','qty_out'],
columns = 'ingredient',
)
#flatten the hierarchical columns
priced_ingredients.columns = ["_".join(a[::-1]) for a in priced_ingredients.columns.to_flat_index()]
priced_ingredients.columns.name = ''
priced_ingredients = priced_ingredients.reset_index()
priced_ingredients partial output:

making and updating multiple pandas dataframes using dicts (avoiding repetative code)

I have a dataframe of id numbers (n = 140, but it could be more or less) and I have 5 group leaders. Each group leader needs to be randomly assigned an amount of these ids (for ease lets make it even so n=28, but I need to be able to control the amounts) and those rows need to be split out into a new df and then droped from the original dataframe so that there is no crossover between leaders.
import pandas as pd
import numpy as np
#making the df
df = pd.DataFrame()
df['ids'] = np.random.randint(1, 140, size=140)
df['group_leader'] = ''
# list of leader names
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
I can do this for each leader with something like
df.loc[df.sample(n=28).index, 'group_leader'] = 'George'
g = df[df['group_leader']=='George'].copy()
df = df[df['group_leader] != 'George']
print(df.shape()[0]) #double checking that df has less ids in it
However, doing this individually for each group leaders seems really un-pythonic (not that I'm an expert on that) and is not easy to refactor into a function.
I thought that I might be able to do it with a dict and a for loop
frames = dict.fromkeys('group_leaders', pd.DataFrame())
for i in frames.keys(): #allows me to fill the cells with the string key?
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
frames[i].update(df[df['group_leader']== str(i)].copy())#also tried append()
print(frames[i].head())
df = df[df['group_leader'] != str(i)]
print(f'df now has {df.shape[0]} ids left') #just in case there's a remainder of ids
However, the new dataframes are still empty and I get the error:
Traceback (most recent call last):
File "C:\Users\path\to\the\file\file.py", line 38, in <module>
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
File "C:\Users\path\to\the\file\pandas\core\generic.py", line 5356, in sample
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 909, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken
This leads me to believe that I'm doing two things wrong:
Either making the dict incorectly or updating it incorrectly.
Making the for loop run in such a way that it tries to run 1 too many times.
I have tried to be as clear as possible and present a minimally useful version of what I need, any help would be appreciated.
Note - I'm aware that 5 divides well into 140 and there may be cases where this isn't the case but I'm pretty sure I can handle that myself with if-else if it's needed.

You can use np.repeat and np.random.shuffle:
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, 28)
np.random.shuffle(leaders)
df['group_leader'] = leaders
Output:
>>> df
ids group_leader
0 138 John
1 36 Apu
2 99 John
3 91 George
4 58 Ringo
.. ... ...
135 43 Ringo
136 84 Apu
137 94 John
138 56 Ringo
139 58 Paul
[140 rows x 2 columns]
>>> df.value_counts('group_leader')
group_leader
Apu 28
George 28
John 28
Paul 28
Ringo 28
dtype: int64
Update
df = pd.DataFrame({'ids': np.random.randint(1, 113, size=113)})
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, np.ceil(len(df) / len(leaders)))
np.random.shuffle(leaders)
df['group_leader'] = leaders[:len(df)]
Output:
>>> df.value_counts('group_leader')
group_leader
Apu 23
John 23
Ringo 23
George 22
Paul 22
dtype: int64

groupby: trying to group by country and list top 10 varieties per country along with avg price and avg points

I am trying to generate a dataframe which is grouped by country and lists top 10 varieties of wine in each country along with their average price and points.
I have successfully grouped by country and wine and generated average values of price and points.
I can generate top 10 varieties in each country by using value_counts().nlargesst(10) but I can't get rid of the remaining in the initial group by with the averages
countryGroup = df.groupby(['country', 'variety'])['price','points'].mean().round(2).rename(columns = {'price':'AvgPrice','points':'AvgPoints'})
countryVariety = df.groupby('country')['variety']
countryVariety = countryVariety.apply(lambda x:x.value_counts().nlargest(10))
data link
actual result is a list of top 10 varieties in each country.
but what I need along with this is the average price and points

Here's some sample data. For these problems, where a large quantity of data is required, it's useful to generate random test data, which can be done in a few lines:
import pandas as pd
import numpy as np
import string
np.random.seed(123)
n = 1000
df = pd.DataFrame({'country': np.random.choice(list('AB'), n),
'variety': np.random.choice(list(string.ascii_lowercase), n),
'price': np.random.normal(100, 10, n),
'points': np.random.choice(100, n)})
One way to solve this is to groupby twice. The first allows us to calculate the quantities for each country-variety group. The second keeps the top 10 per country (based on size) with .sort_values + tail
df_agg = (df.groupby(['country', 'variety']).agg({'variety': 'size', 'price': 'mean', 'points': 'mean'})
.rename(columns={'variety': 'size'}))
df_agg = df_agg.sort_values('size').groupby(level=0).tail(10).sort_index()
Output:
size price points
country variety
A c 19 98.606563 45.842105
e 19 102.264391 48.894737
l 23 96.469739 52.913043
n 27 99.532544 55.740741
p 20 98.298753 49.700000
q 21 98.660938 60.666667
u 26 101.330755 63.615385
x 20 102.540790 48.550000
y 23 99.553557 49.869565
z 27 99.968973 44.259259
B b 25 99.375984 56.360000
c 22 100.632402 56.181818
e 25 99.476491 49.520000
k 22 96.991041 40.090909
p 24 99.802004 51.333333
q 26 99.022372 53.884615
u 22 103.063360 49.090909
v 24 101.907610 53.250000
x 22 94.607472 49.227273
z 23 98.984382 44.739130

Assigning result of pandas groupby

I have the following dataframe:
date, industry, symbol, roc
25-02-2015, Health, abc, 200
25-02-2015, Health, xyz, 150
25-02-2015, Mining, tyr, 45
25-02-2015, Mining, ujk, 70
26-02-2015, Health, abc, 60
26-02-2015, Health, xyz, 310
26-02-2015, Mining, tyr, 65
26-02-2015, Mining, ujk, 23
I need to determine the average 'roc', max 'roc', min 'roc' as well as how many symbols exist for each date+industry. In other words I need to groupby date and industry, and then determine various averages, max/min etc.
So far I am doing the following, which is working but seems to be very slow and inefficient:
sector_df = primary_df.groupby(['date', 'industry'], sort=True).mean()
tmp_max_df = primary_df.groupby(['date', 'industry'], sort=True).max()
tmp_min_df = primary_df.groupby(['date', 'industry'], sort=True).min()
tmp_count_df = primary_df.groupby(['date', 'industry'], sort=True).count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)
The above code works, resulting in a dataframe indexed by date+industry, showing me what was the min/max 'roc' for each date+industry, as well as how many symbols existed for each date+industry.
I am basically doing a complete groupby multiple times (to determine the mean, max, min, count of the 'roc'). This is very slow because it's doing the same thing over and over.
Is there a way to just do the group by once. Then perform the mean, max etc on that object and assign the result to the sector_df?

You want to perform an aggregate using agg:
In [72]:
df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
Out[72]:
roc
mean max min count
date industry
2015-02-25 Health 175.0 200 150 2
Mining 57.5 70 45 2
2015-02-26 Health 185.0 310 60 2
Mining 44.0 65 23 2
This allows you to pass an iterable (a list in this case) of functions to perform.
EDIT
To access individual results you need to pass a tuple for each axis:
In [78]:
gp.loc[('2015-02-25','Health'),('roc','mean')]
Out[78]:
175.0
Where gp = df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])

You can just save the groupby part to a variable as shown below:
primary_df = pd.DataFrame([['25-02-2015', 'Health', 'abc', 200],
['25-02-2015', 'Health', 'xyz', 150],
['25-02-2015', 'Mining', 'tyr', 45],
['25-02-2015', 'Mining', 'ujk', 70],
['26-02-2015', 'Health', 'abc', 60],
['26-02-2015', 'Health', 'xyz', 310],
['26-02-2015', 'Mining', 'tyr', 65],
['26-02-2015', 'Mining', 'ujk', 23]],
columns='date industry symbol roc'.split())
grouped = primary_df.groupby(['date', 'industry'], sort=True)
sector_df = grouped.mean()
tmp_max_df = grouped.max()
tmp_min_df = grouped.min()
tmp_count_df = grouped.count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improving Code Efficiency in Dataset Generation - python

Related

Pandas - Grouping and summing rows using a certain cell identifier

merge dataframes and add price data for each instance of an item ID

making and updating multiple pandas dataframes using dicts (avoiding repetative code)

groupby: trying to group by country and list top 10 varieties per country along with avg price and avg points

Assigning result of pandas groupby

Categories

Resources