making and updating multiple pandas dataframes using dicts (avoiding repetative code) - python

I have a dataframe of id numbers (n = 140, but it could be more or less) and I have 5 group leaders. Each group leader needs to be randomly assigned an amount of these ids (for ease lets make it even so n=28, but I need to be able to control the amounts) and those rows need to be split out into a new df and then droped from the original dataframe so that there is no crossover between leaders.
import pandas as pd
import numpy as np
#making the df
df = pd.DataFrame()
df['ids'] = np.random.randint(1, 140, size=140)
df['group_leader'] = ''
# list of leader names
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
I can do this for each leader with something like
df.loc[df.sample(n=28).index, 'group_leader'] = 'George'
g = df[df['group_leader']=='George'].copy()
df = df[df['group_leader] != 'George']
print(df.shape()[0]) #double checking that df has less ids in it
However, doing this individually for each group leaders seems really un-pythonic (not that I'm an expert on that) and is not easy to refactor into a function.
I thought that I might be able to do it with a dict and a for loop
frames = dict.fromkeys('group_leaders', pd.DataFrame())
for i in frames.keys(): #allows me to fill the cells with the string key?
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
frames[i].update(df[df['group_leader']== str(i)].copy())#also tried append()
print(frames[i].head())
df = df[df['group_leader'] != str(i)]
print(f'df now has {df.shape[0]} ids left') #just in case there's a remainder of ids
However, the new dataframes are still empty and I get the error:
Traceback (most recent call last):
File "C:\Users\path\to\the\file\file.py", line 38, in <module>
df.loc[df.sample(n=28).index, 'group_leader'] = str(i)
File "C:\Users\path\to\the\file\pandas\core\generic.py", line 5356, in sample
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 909, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken
This leads me to believe that I'm doing two things wrong:
Either making the dict incorectly or updating it incorrectly.
Making the for loop run in such a way that it tries to run 1 too many times.
I have tried to be as clear as possible and present a minimally useful version of what I need, any help would be appreciated.
Note - I'm aware that 5 divides well into 140 and there may be cases where this isn't the case but I'm pretty sure I can handle that myself with if-else if it's needed.

You can use np.repeat and np.random.shuffle:
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, 28)
np.random.shuffle(leaders)
df['group_leader'] = leaders
Output:
>>> df
ids group_leader
0 138 John
1 36 Apu
2 99 John
3 91 George
4 58 Ringo
.. ... ...
135 43 Ringo
136 84 Apu
137 94 John
138 56 Ringo
139 58 Paul
[140 rows x 2 columns]
>>> df.value_counts('group_leader')
group_leader
Apu 28
George 28
John 28
Paul 28
Ringo 28
dtype: int64
Update
df = pd.DataFrame({'ids': np.random.randint(1, 113, size=113)})
leaders = ['John', 'Paul', 'George', 'Ringo', 'Apu']
leaders = np.repeat(leaders, np.ceil(len(df) / len(leaders)))
np.random.shuffle(leaders)
df['group_leader'] = leaders[:len(df)]
Output:
>>> df.value_counts('group_leader')
group_leader
Apu 23
John 23
Ringo 23
George 22
Paul 22
dtype: int64

Related

Creating new DF column based on average values from specific columns identified in second DF

I apologize as I prefer to ask questions where I've made an attempt at the code needed to resolve the issue. Here, despite many attempts, I haven't gotten any closer to a resolution (in part because I'm a hobbyist and self-taught). I'm attempting to use two dataframes together to calculate the average values in a specific column, then generate a new column to store that average.
I have two dataframes. The first contains the players and their stats. The second contains a list of each player's opponents during the season.
What I'm attempting to do is use the two dataframes to calculate expected values when facing a specific opponent. Stated otherwise, I'd like to be able to see if a player is performing better or worse than the expected results based on the opponent but first need to calculate the average of their opponents.
My dataframes actually have thousands of players and hundreds of matchups, so I've shortened them here to have a representative dataframe that isn't overwhelming.
The first dataframe (df) contains five columns. Name, STAT1, STAT2, STAT3, and STAT4.
The second dataframe (df_Schedule) has a Name column but then has a separate column for each opponent faced. The df_Schedule usually contains different numbers of columns depending on the week of the season. For example, after week 1 there may be four columns. After week 26 there might be 100 columns. For simplicity sake, I've included just five columns ['Name', 'Opp1', 'Opp2', 'Opp3', 'Opp4', 'Opp5'].
Using these two dataframes I'm trying to create new columns in the first dataframe (df). EXP1 (for "Expected STAT1"), EXP2, EXP3, EXP4. The expected columns are simply an average of the STAT columns based on the opponents faced during the season. For example, Edgar faced Ralph three times, Marc once and David once. The formula to calculate Edgar's EXP1 is simply:
((Ralph.STAT1 * 3) + (Marc.STAT1 * 1) + (David.STAT1 * 1) / Number_of_Contests (which is five in this example) = 100.2
import pandas as pd
data = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],}
data2 = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'Opp1':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp2':['Ralph', 'Edgar', 'David', 'Marc'],
'Opp3':['Marc', 'David', 'Edgar', 'Ralph'],
'Opp4':['David', 'Marc', 'Ralph', 'Edgar'],
'Opp5':['Ralph', 'Edgar', 'David', 'Marc'],}
df = pd.DataFrame(data)
df_Schedule = pd.DataFrame(data2)
print(df)
print(df_Schedule)
I would like the result to be something like:
data_Final = {'Name':['Edgar', 'Ralph', 'Marc', 'David'],
'STAT1':[100, 96, 110, 103],
'STAT2':[116, 93, 85, 100],
'STAT3':[56, 59, 41, 83],
'STAT4':[55, 96, 113, 40],
'EXP1':[100.2, 102.6, 101, 105.2],
'EXP2':[92.8, 106.6, 101.8, 92.8],
'EXP3':[60.2, 58.4, 72.8, 47.6],
'EXP4':[88.2, 63.6, 54.2, 98],}
df_Final = pd.DataFrame(data_Final)
print(df_Final)
Is there a way to use the scheduling dataframe to lookup the values of opponents, average them, and then create a new column based on those averages?
Try:
df = df.set_index("Name")
df_Schedule = df_Schedule.set_index("Name")
for i, c in enumerate(df.filter(like="STAT"), 1):
df[f"EXP{i}"] = df_Schedule.replace(df[c]).mean(axis=1)
print(df.reset_index())
Prints:
Name STAT1 STAT2 STAT3 STAT4 EXP1 EXP2 EXP3 EXP4
0 Edgar 100 116 56 55 100.2 92.8 60.2 88.2
1 Ralph 96 93 59 96 102.6 106.6 58.4 63.6
2 Marc 110 85 41 113 101.0 101.8 72.8 54.2
3 David 103 100 83 40 105.2 92.8 47.6 98.0

Pandas DataFrame pivot (reshape?)

I can't seem to get this right... here's what I'm trying to do:
import pandas as pd
df = pd.DataFrame({
'item_id': [1,1,3,3,3],
'contributor_id': [1,2,1,4,5],
'contributor_role': ['sing', 'laugh', 'laugh', 'sing', 'sing'],
'metric_1': [80, 90, 100, 92, 50],
'metric_2': [180, 190, 200, 192, 150]
})
--->
item_id contributor_id contributor_role metric_1 metric_2
0 1 1 sing 80 180
1 1 2 laugh 90 190
2 3 1 laugh 100 200
3 3 4 sing 92 192
4 3 5 sing 50 150
And I want to reshape it into:
item_id SING_1_contributor_id SING_1_metric_1 SING_1_metric_2 SING_2_contributor_id SING_2_metric_1 SING_2_metric_2 ... LAUGH_1_contributor_id LAUGH_1_metric_1 LAUGH_1_metric_2 ... <LAUGH_2_...>
0 1 1 80 180 N/A N/A N/A ... 2 90 190 ... N/A..
1 3 4 92 192 5 50 150 ... 1 100 200 ... N/A..
Basically, for each item_id, I want to collect all relevant data into a single row. Each item could have multiple types of contributors, and there is a max for each type (e.g. max SING contributor = A per item, max LAUGH contributor = B per item). There are a set of metrics tied to each contributor (but for the same contributor, the values could be different across different items / contributor types).
I can probably achieve this through some seemingly inefficient methods (e.g. looping and matching then populating a template df), but I was wondering if there is a more efficient way to achieve this, potentially through cleverly specifying the index / values / columns in the pivot operation (or any other method..).
Thanks in advance for any suggestions!
EDIT:
Ended up adapting Ben's script below into the following:
df['role_count'] = df.groupby(['item_id', 'contributor_role']).cumcount().add(1).astype(str)
df['contributor_role'] = df.apply(lambda row: row['contributor_role'] + '_' + row['role_count'], axis=1)
df = df.set_index(['item_id','contributor_role']).unstack()
df.columns = ['_'.join(x) for x in df.columns.values]
You can create the additional key with cumcount then do unstack
df['newkey']=df.groupby('item_id').cumcount().add(1).astype(str)
df['contributor_id']=df['contributor_id'].astype(str)
s = df.set_index(['item_id','newkey']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
Out[38]:
contributor_id_1 contributor_role_1 ... metric_1_3 metric_2_3
item_id ...
1 1 sing ... NaN NaN
3 1 messaround ... 50.0 150.0

Improving Code Efficiency in Dataset Generation

I am trying to generate a dataset where each day in a given year range has a fixed number of stores. In turn, each store sells a fixed number of products. The products specific to each store and day have a value for sales (£) and number of products sold.
However, running these for loops takes a while to create the dataset.
Is there anyway I can improve the efficiency of my code?
# Generate one row Dataframes (for concatenation) for each product, in each store, on each date
dataframes = []
for d in datelist:
for s in store_IDs:
for p in product_IDs:
products_sold = random.randint(1,101)
sales = random.randint(100,1001)
data_dict = {'Date': [d], 'Store ID': [s], 'Product ID': [p], 'Sales': [sales], 'Number of Products Sold': [products_sold]}
dataframe = pd.DataFrame(data_dict)
dataframes.append(dataframe)
test_dataframe = pd.concat(dataframes)
The main reason your code is really slow now is that you have the dataframe construction buried inside of your triple loop. This is not necessary. Right now, you are creating a new dataframe inside of each loop. It is much more efficient to create all of the data in some type of format that pandas can ingest and then create the dataframe once.
For the structure that you have, the easiest mod you could do is to make a list of the data rows, append a new dictionary to that list for each row as you are constructing now, and then make a df from the list of dictionaries... Pandas knows how to do that. I also removed the list brackets of the items you had in your dictionary. That isn't necessary.
import pandas as pd
import random
datelist = [1, 2, 4, 55]
store_IDs = ['6A', '27B', '12C']
product_IDs = ['soap', 'gum']
data = [] # I just renamed this for clarity
for d in datelist:
for s in store_IDs:
for p in product_IDs:
products_sold = random.randint(1,101)
sales = random.randint(100,1001)
data_dict = {'Date': d, 'Store ID': s, 'Product ID': p, 'Sales': sales, 'Number of Products Sold': products_sold}
data.append(data_dict) # this is building a list of dictionaries...
print(data[:3])
df = pd.DataFrame(data)
print(df.head())
Yields:
[{'Date': 1, 'Store ID': '6A', 'Product ID': 'soap', 'Sales': 310, 'Number of Products Sold': 35}, {'Date': 1, 'Store ID': '6A', 'Product ID': 'gum', 'Sales': 149, 'Number of Products Sold': 34}, {'Date': 1, 'Store ID': '27B', 'Product ID': 'soap', 'Sales': 332, 'Number of Products Sold': 60}]
Date Store ID Product ID Sales Number of Products Sold
0 1 6A soap 310 35
1 1 6A gum 149 34
2 1 27B soap 332 60
3 1 27B gum 698 21
4 1 12C soap 658 51
[Finished in 0.6s]
Do you realize your sizes are huge?
Size is approximately 3 and a half years (in days) = 1277 multiplied
by 99 stores = 126,423 multiplied by 8999 products = 1,137,680,577
rows.
If you need in average 16 bytes (which is already not much) you need at least 17GB of memory for this!
For this reason, Store_IDs and Product_IDs should really be just small integers, like index in a table of more descriptive names.
The way to gain efficiency is to reduce function calls! E.g. you can use numpy random number generation to generate random values in bulk.
Assuming all numbers involved can fit in 16bits, here's one solution to your problem (still needing a lot of memory):
import pandas as pd
import numpy as np
def gen_data(datelist, store_IDs, product_IDs):
date16 = np.arange(len(datelist), dtype=np.int16)
store16 = np.arange(len(store_IDs), dtype=np.int16)
product16 = np.arange(len(product_IDs), dtype=np.int16)
A = np.array(np.meshgrid(date16, store16, product16), dtype=np.int16).reshape(3,-1)
length = A.shape[1]
sales = np.random.randint(100, 1001, size=(1,length), dtype=np.int16)
sold = np.random.randint(1, 101, size=(1,length), dtype=np.int16)
data = np.concatenate((A,sales,sold), axis=0)
df = pd.DataFrame(data.T, columns=['Date index', 'Store ID index', 'Product ID index', 'Sales', 'Number of Products Sold'], dtype=np.int16)
return df
FWIW on my machine I obtain:
Date Store ID Product ID Sales Number of Products Sold
0 0 0 0 127 85
1 0 0 1 292 37
2 0 0 2 180 36
3 0 0 3 558 88
4 0 0 4 519 79
... ... ... ... ... ...
1137680572 1276 98 8994 932 78
1137680573 1276 98 8995 401 47
1137680574 1276 98 8996 840 77
1137680575 1276 98 8997 717 91
1137680576 1276 98 8998 632 24
[1137680577 rows x 5 columns]
real 1m16.325s
user 0m22.086s
sys 0m25.800s
(I don't have enough memory and use swap)

Assigning result of pandas groupby

I have the following dataframe:
date, industry, symbol, roc
25-02-2015, Health, abc, 200
25-02-2015, Health, xyz, 150
25-02-2015, Mining, tyr, 45
25-02-2015, Mining, ujk, 70
26-02-2015, Health, abc, 60
26-02-2015, Health, xyz, 310
26-02-2015, Mining, tyr, 65
26-02-2015, Mining, ujk, 23
I need to determine the average 'roc', max 'roc', min 'roc' as well as how many symbols exist for each date+industry. In other words I need to groupby date and industry, and then determine various averages, max/min etc.
So far I am doing the following, which is working but seems to be very slow and inefficient:
sector_df = primary_df.groupby(['date', 'industry'], sort=True).mean()
tmp_max_df = primary_df.groupby(['date', 'industry'], sort=True).max()
tmp_min_df = primary_df.groupby(['date', 'industry'], sort=True).min()
tmp_count_df = primary_df.groupby(['date', 'industry'], sort=True).count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)
The above code works, resulting in a dataframe indexed by date+industry, showing me what was the min/max 'roc' for each date+industry, as well as how many symbols existed for each date+industry.
I am basically doing a complete groupby multiple times (to determine the mean, max, min, count of the 'roc'). This is very slow because it's doing the same thing over and over.
Is there a way to just do the group by once. Then perform the mean, max etc on that object and assign the result to the sector_df?
You want to perform an aggregate using agg:
In [72]:
df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
Out[72]:
roc
mean max min count
date industry
2015-02-25 Health 175.0 200 150 2
Mining 57.5 70 45 2
2015-02-26 Health 185.0 310 60 2
Mining 44.0 65 23 2
This allows you to pass an iterable (a list in this case) of functions to perform.
EDIT
To access individual results you need to pass a tuple for each axis:
In [78]:
gp.loc[('2015-02-25','Health'),('roc','mean')]
Out[78]:
175.0
Where gp = df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
You can just save the groupby part to a variable as shown below:
primary_df = pd.DataFrame([['25-02-2015', 'Health', 'abc', 200],
['25-02-2015', 'Health', 'xyz', 150],
['25-02-2015', 'Mining', 'tyr', 45],
['25-02-2015', 'Mining', 'ujk', 70],
['26-02-2015', 'Health', 'abc', 60],
['26-02-2015', 'Health', 'xyz', 310],
['26-02-2015', 'Mining', 'tyr', 65],
['26-02-2015', 'Mining', 'ujk', 23]],
columns='date industry symbol roc'.split())
grouped = primary_df.groupby(['date', 'industry'], sort=True)
sector_df = grouped.mean()
tmp_max_df = grouped.max()
tmp_min_df = grouped.min()
tmp_count_df = grouped.count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)

Looping through pandas DataFrame and having the output switch from a DataFrame to a Series between loops causes an error

I have vehicle information that I want to evaluate over several different time periods and I'm modifying different columns in the DataFrame as I move through the information. I'm working with the current and previous time periods so I need to concat the two and work on them together.
The problem I'm having is when I use the 'time' column as a index in pandas and loop through the data the object that is returned is either a DataFrame or a Series depending on number of vehicles (or rows) in the time period. This change in object type creates a error as I'm trying to use DataFrame methods on Series objects.
I created a small sample program that shows what I'm trying to do and the error that I'm receiving. Note this is a sample and not the real code. I have tried just simple querying the data by time period instead of using a index and that works but it is too slow for what I need to do.
import pandas as pd
df = pd.DataFrame({
'id' : range(44, 51),
'time' : [99,99,97,97,96,96,100],
'spd' : [13,22,32,41,42,53,34],
})
df = df.set_index(['time'], drop = False)
st = True
for ind in df.index.unique():
data = df.ix[ind]
print data
if st:
old_data = data
st = False
else:
c = pd.concat([data, old_data])
#do some work here
OUTPUT IS:
id spd time
time
99 44 13 99
99 45 22 99
id spd time
time
97 46 32 97
97 47 41 97
id spd time
time
96 48 42 96
96 49 53 96
id 50
spd 34
time 100
Name: 100, dtype: int64
Traceback (most recent call last):
File "C:/Users/m28050/Documents/Projects/fhwa/tca/v_2/code/pandas_ind.py", line 24, in <module>
c = pd.concat([data, old_data])
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 873, in concat
return op.get_result()
File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 946, in get_result
new_data = com._concat_compat([x.values for x in self.objs])
File "C:\Python27\lib\site-packages\pandas\core\common.py", line 1737, in _concat_compat
return np.concatenate(to_concat, axis=axis)
ValueError: all the input arrays must have same number of dimensions
If anyone has the correct way to loop through the DataFrame and update the columns or can point out a different method to use, that would be great.
Thanks for your help.
Jim
I think groupby could help here:
In [11]: spd_lt_40 = df1[df1.spd < 40]
In [12]: spd_lt_40_count = spd_lt_40.groupby('time')['id'].count()
In [13]: spd_lt_40_count
Out[13]:
time
97 1
99 2
100 1
dtype: int64
and then set this to a column in the original DataFrame:
In [14]: df1['spd_lt_40_count'] = spd_lt_40_count
In [15]: df1['spd_lt_40_count'].fillna(0, inplace=True)
In [16]: df1
Out[16]:
id spd time spd_lt_40_count
time
99 44 13 99 2
99 45 22 99 2
97 46 32 97 1
97 47 41 97 1
96 48 42 96 0
96 49 53 96 0
100 50 34 100 1

Categories

Resources