I am trying to create summaries of unique buyers for each of the different products in my Sales table. My target outcome is as follows:
CustSeg
UNIQUE_PROD1_CUST
0
High
7
1
Low
8
2
Mid
4
This summary is created and assigned to variable as below:
# Count of DISTINCT PROD1 CUSTOMERS
PROD1_CUST = (
Sales_Df.loc[(Sales_Df.Prod1_Qty > 0)]
.groupby("CustSeg")["CustID"]
.count()
.reset_index(name="UNIQUE_PROD1_CUST")
)
PROD1_CUST
The Sales_Df dataframe can be replicated thus:
Sales_Qty = {
"CustID": ['C01', 'C02', 'C03', 'C04', 'C05', 'C06', 'C07', 'C08', 'C09', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', ],
"CustSeg": ['High', 'High', 'Mid', 'High', 'Low', 'Low', 'Low', 'Low', 'Low', 'Mid', 'Low', 'Low', 'Mid', 'Low', 'High', 'High', 'High', 'High', 'Mid', 'Low', ],
"Prod1_Qty": [8, 7, 12, 15, 7, 15, 7, 8, 3, 15, 0, 3, 4, 4, 7, 11, 12, 12, 6, 1, ],
"Prod2_Qty": [2, 5, 0, 1, 14, 15, 3, 1, 11, 0, 5, 11, 12, 8, 6, 15, 7, 4, 3, 10, ],
"Prod3_Qty": [13, 4, 0, 11, 3, 5, 11, 11, 10, 14, 2, 4, 3, 14, 14, 10, 5, 0, 0, 9, ],
"Prod4_Qty": [11, 15, 2, 0, 6, 2, 12, 14, 11, 15, 5, 14, 13, 0, 10, 2, 13, 11, 12, 15, ],
"Prod5_Qty": [9, 15, 5, 4, 9, 0, 13, 9, 8, 11, 10, 12, 8, 3, 14, 11, 9, 15, 8, 14, ]
}
Sales_Df = pd.DataFrame(Sales_Qty)
Sales_Df
Now, in real life, the dataframe's shape is larger by far (at least (5000000, 130)), which makes manual repeat of the summary for each of the products tenuous so I am trying to automate the creation of the variables and the summary. I am approaching the task with the following steps.
# Extract the proposed variable names from the dataframe column names.
all_cols = Sales_Df.columns.values.tolist()
# Remove non-product quantity columns from the list
not_prod_cols = ["CustSeg", "CustID"]
prod_cols = [x for x in all_cols if x not in not_prod_cols]
I know the next things should be:
creating the variable names from the list prod_cols and storing
those variables in a list - let's name the list prod_dfs
prod_dfs = []
Creating the dynamic formula that creates the dataframes and append
their variable names to prod_dfs using the "logic" below.
for x in prod_cols:
x[:-4] + "_CUST" = (
Sales_Df.loc[(Sales_Df.x > 0)]
.groupby("CustSeg")["CustID"]
.count()
.reset_index(name="UNIQUE" + x[:-4] + "_CUST")
)
prod_dfs.append(x)
This is where I am stuck. Kindly assist.
Thank you for sharing a reproducible example, and seems like you have made good progress. If I understand correctly, you want to be able to compute the number of unique customers per segment who have purchased a given item.
To follow your approach, you could iterate through the product columns, compute the counts, and assign this to a results dataframe:
prod_cols = [col for col in Sales_Df.columns if col.startswith('Prod')]
result = None
for prod in prod_cols:
counts = (
Sales_Df
.loc[Sales_Df[prod] > 0]
.groupby('CustSeg')
[prod]
.count()
)
if result is None:
result = counts.to_frame()
else:
result[prod] = counts
CustSeg
Prod1_Qty
Prod2_Qty
Prod3_Qty
Prod4_Qty
Prod5_Qty
High
7
7
6
6
7
Low
8
9
9
8
8
Mid
4
2
2
4
4
This would help you in the second dimension, in the sense that you do not have to write this aggregation code for all your columns.
However, the resulting code is not very efficient because it does O(m) groupby operations, where m is the number of columns.
You can get your desired result with slightly different logic.
Form groups of each customer segment.
For each product, count the number of purchasers
Combine the results
This one liner implements this logic.
Sales_Df.drop('CustID', axis=1).groupby('CustSeg').apply(lambda group: (group>0).sum(axis=0))
Note that we first drop CustID because in your example, after grouping by CustSeg, it is the only column that is not a product quantity.
As an aside: consider reviewing the pandas indexing basics. You may find it easier to use the syntax of df['A'] rather than df.A because it allows you to use other programming constructs more effectively.
Related
display (df)
display(prices)
I have 2 dataframes, I want to replace the month numbers in dataframe 1 with the DA HB West value for that month. It also has to have the same cheat code as the df.
I feel like this is really easy to do but I keep getting an error.
The error reads "ValueError: Can only compare identically-labeled Series objects"
With a sample of your data:
df2 = pd.DataFrame({"Month": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
"DA HB West": np.random.random(12),
"Year": [2019]*12,
"Cheat": ["2019PeakWE"]*12})
df = pd.DataFrame({"Month1": [7, 7, 7, 9, 11],
"Month2": [8, 8, 8, 10, 12],
"Month3": [9.0, 9.0, 9.0, 11.0, np.nan],
"Cheat4": ["2019PeakWE"]*5})
df.columns = df.columns.str[:-1]
Fill the na values so that there isn't an error with changing value types to integers:
df.fillna(0, inplace=True)
Map all but the last column:
d = {}
for i, j in df.groupby("Cheat"):
mapping = df2[df2["Cheat"] == i].set_index("Month")["DA HB West"].to_dict()
d[i] = j
d[i].iloc[:, :-1] = j.iloc[:, :-1].astype(int).apply(lambda x: x.map(mapping))
This creates a dictionary of all the different Cheats.
You can then append them all together, if you need to.
I would like to correct the values in hyperspectral readings from a cameara using the formula described over here;
the captured data is subtracted by dark reference and divided with
white reference subtracted dark reference.
In the original example, the task is rather simple, white and dark reference has the same shape as the main data so the formula is executed as:
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr),
np.subtract(white_nparr, dark_nparr))
However the main data is much larger in my experience. Shapes in my case are as following;
$ white_nparr.shape, dark_nparr.shape, data_nparr.shape
((100, 640, 224), (100, 640, 224), (4300, 640, 224))
that's why I repeat the reference arrays.
white_nparr_rep = white_nparr.repeat(43, axis=0)
dark_nparr_rep = dark_nparr.repeat(43, axis=0)
return np.divide(np.subtract(data_nparr, dark_nparr_rep), np.subtract(white_nparr_rep, dark_nparr_rep))
And it works almost perfectly, as can be seen in the image at the left. But this approach requires enormous amount of memory, so I decided to traverse the large array and replace the original values with corrected ones on-the-go instead:
ref_scale = dark_nparr.shape[0]
data_scale = data_nparr.shape[0]
for i in range(int(data_scale / ref_scale)):
data_nparr[i*ref_scale:(i+1)*ref_scale] =
np.divide
(
np.subtract(data_nparr[i*ref_scale:(i+1)*ref_scale], dark_nparr),
np.subtract(white_nparr, dark_nparr)
)
But that traversal approach gives me the ugliest of results, as can be seen in the right. I'd appreciate any idea that would help me fix this.
Note: I apply 20-times co-adding (mean of 20 readings) to obtain the images below.
EDIT: dtype of each array is as following:
$ white_nparr.dtype, dark_nparr.dtype, data_nparr.dtype
(dtype('float32'), dtype('float32'), dtype('float32'))
Your two methods don't agree because in the first method you used
white_nparr_rep = white_nparr.repeat(43, axis=0)
but the second method corresponds to using
white_nparr_rep = np.tile(white_nparr, (43, 1, 1))
If the first method is correct, you'll have to adjust the second method to act accordingly. Perhaps
for i in range(int(data_scale / ref_scale)):
data_nparr[i*ref_scale:(i+1)*ref_scale] =
np.divide
(
np.subtract(data_nparr[i*ref_scale:(i+1)*ref_scale], dark_nparr[i]),
np.subtract(white_nparr[i], dark_nparr[i])
)
A simple example with 2-d arrays that shows the difference between repeat and tile:
In [146]: z
Out[146]:
array([[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15]])
In [147]: np.repeat(z, 3, axis=0)
Out[147]:
array([[ 1, 2, 3, 4, 5],
[ 1, 2, 3, 4, 5],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[11, 12, 13, 14, 15],
[11, 12, 13, 14, 15]])
In [148]: np.tile(z, (3, 1))
Out[148]:
array([[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15]])
Off topic postscript: I don't know why the author of the page that you linked to writes NumPy expressions as (for example):
corrected_nparr = np.divide(
np.subtract(data_nparr, dark_nparr),
np.subtract(white_nparr, dark_nparr))
NumPy allows you to write that as
corrected_nparr = (data_nparr - dark_nparr) / (white_nparr - dark_nparr)
whick looks much nicer to me.
The dataframe looks like this:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
5, 3713.110107421875, 2012-01-07T03:14:03.937Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
7, 3713.89892578125, 2012-01-07T03:14:13.968Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
10, 3714.64990234375, 2012-01-07T03:14:24.015Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
12, 3714.64990234375, 2012-01-07T03:14:29.031Z
At some rows, there are lines with millisecond different timestamps, I want to drop them and only keep the rows that have different second timestamps. there are rows that have the same value for milliseconds and seconds different rows like from row 9 to 12, therefore, I can't use a.loc[a.shift() != a]
The desired output would be:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
Try:
df.groupby(pd.to_datetime(df[2]).astype('datetime64[s]')).head(1)
I hope it's self-explained.
You can use below script. I didn't get your dataframe column names so I invented below columns ['x', 'date_time']
df = pd.DataFrame([
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:43.859Z')),
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:48.890Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:53.906Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:58.921Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.900Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.937Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.900Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.968Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:19.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.015Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.031Z'))],
columns=['x', 'date_time'])
create a column 'time_diff' to get the difference between the
datetime of current row and next row
only get those difference either
None or more than 1 second
drop temp column time_diff
df['time_diff'] = df.groupby('x')['date_time'].diff()
df = df[(df['time_diff'].isnull()) | (df['time_diff'].map(lambda x: x.seconds > 1))]
df = df.drop(['time_diff'], axis=1)
df
I'm trying to get the length of the values of the states array into a separate array then sort them by descending order, but I'm having trouble getting all the length values of the string into the array instead of having a single value after the iteration.
states = ["Abia", "Adamawa", "Anambra", "Akwa Ibom", "Bauchi", "Bayelsa", "Benue", "Borno", "Cross River", "Delta", "Ebonyi", "Enugu", "Edo", "Ekiti", "Gombe", "Imo", "Jigawa", "Kaduna", "Kano", "Katsina", "Kebbi", "Kogi", "Kwara", "Lagos", "Nasarawa", "Niger", "Ogun", "Ondo", "Osun", "Oyo", "Plateau", "Rivers", "Sokoto", "Taraba", "Yobe", "Zamfara"]
for i in states:
a = [len(i)]
print(a)
Since you want the lengths sorted in descending order, use sorted with reverse=True and list comprehension
states = ["Abia", "Adamawa", "Anambra", "Akwa Ibom", "Bauchi", "Bayelsa", "Benue", "Borno", "Cross River", "Delta", "Ebonyi", "Enugu", "Edo", "Ekiti", "Gombe", "Imo", "Jigawa", "Kaduna", "Kano", "Katsina", "Kebbi", "Kogi", "Kwara", "Lagos", "Nasarawa", "Niger", "Ogun", "Ondo", "Osun", "Oyo", "Plateau", "Rivers", "Sokoto", "Taraba", "Yobe", "Zamfara"]
a = sorted([len(i) for i in states], reverse=True)
print (a)
Output
[11, 9, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3]
To get the indices of the sorted list without resorting to NumPy arrays, there are many ways: see here. I personally prefer to directly make use of NumPy's argsort. As the name suggests, it returns an array of indices corresponding to the sorted array/list in ascending order. To get the indices for descending order, you can just reverse the array returned by argsort by using [::-1]. Following is a solution to your problem:
import numpy as np
states = ["Abia", "Adamawa", "Anambra", "Akwa Ibom", "Bauchi", "Bayelsa", "Benue", "Borno", "Cross River", "Delta", "Ebonyi", "Enugu", "Edo", "Ekiti", "Gombe", "Imo", "Jigawa", "Kaduna", "Kano", "Katsina", "Kebbi", "Kogi", "Kwara", "Lagos", "Nasarawa", "Niger", "Ogun", "Ondo", "Osun", "Oyo", "Plateau", "Rivers", "Sokoto", "Taraba", "Yobe", "Zamfara"]
a = [len(i) for i in states]
indices_sorted = np.argsort(a)[::-1] # [::-1] gives you indices for decreasing order
Output
array([ 8, 3, 24, 35, 19, 1, 2, 30, 5, 4, 10, 16, 17, 33, 32, 31, 22,
13, 6, 7, 9, 11, 14, 25, 23, 20, 21, 26, 27, 34, 28, 18, 0, 12,
15, 29])
Now as you can see, the first index in the above output is 8 which means the 9th element of states which is Cross River. Similarly you can access and verify the other elements.
You can use a list comprehension:
lengths = [len(state) for state in states]
If you need to use a for loop, create a list and append to it:
lengths = []
for i in states:
lengths.append(len(i))
You can also do this using the map function without using a for loop:
a = list(map(len,states))
Through generator:
lens = [len(a) for a in states]
I am trying to test some strategies for a game, which can be defined by 10 non-negative integers that add up to 100. There are 109 choose 9, or roughly 10^12 of these, so comparing them all is not practical. I would like to take a random sample of about 1,000,000 of these.
I have tried the methods from the answers to this question, and this one, but all still seem far too slow to work. The quickest method seems like it will take about 180 hours on my machine.
This is how I've tried to make the generator (adapted from a previous SE answer). For some reason, changing prob does not seem to impact the run time of turning it into a list.
def tuples_sum_sample(nbval,total, prob, order=True) :
"""
Generate all the tuples L of nbval positive or nul integer
such that sum(L)=total.
The tuples may be ordered (decreasing order) or not
"""
if nbval == 0 and total == 0 : yield tuple() ; raise StopIteration
if nbval == 1 : yield (total,) ; raise StopIteration
if total==0 : yield (0,)*nbval ; raise StopIteration
for start in range(total,0,-1) :
for qu in tuples_sum(nbval-1,total-start) :
if qu[0]<=start :
sol=(start,)+qu
if order :
if random.random() <prob:
yield sol
else :
l=set()
for p in permutations(sol,len(sol)) :
if p not in l :
l.add(p)
if random.random()<prob:
yield p
Rejection sampling seems like it would take about 3 million years, so this is out as well.
randsample = []
while len(randsample)<1000000:
x = (random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100),random.randint(0,100))
if sum(x) == 100:
randsample.append(x)
randsample
Can anyone think of another way to do this?
Thanks
A couple of frame-challenging questions:
Is there any reason you must generate the entire population, then sample that population?
Why do you need to check if your numbers sum to 100?
You can generate a set of numbers that sum to a value. Check out the first answer here:
Random numbers that add to 100: Matlab
Then generate the number of such sets you desire (1,000,000 in this case).
import numpy as np
def set_sum(number=10, total=100):
initial = np.random.random(number-1) * total
sort_list = np.append(initial, [0, total]).astype(int)
sort_list.sort()
set_ = np.diff(sort_list)
return set_
if __name__ == '__main__':
import timeit
a = set_sum()
n = 1000000
sample = [set_sum() for i in range(n)]
Numpy to the rescue!
Specifically, you need a multinomial distribution:
import numpy as np
desired_sum = 100
n = 10
np.random.multinomial(desired_sum, np.ones(n)/n, size=1000000)
It outputs a matrix with a million rows of 10 random integers in a few seconds. Each row sums up to 100.
Here's a smaller example:
np.random.multinomial(desired_sum, np.ones(n)/n, size=10)
which outputs:
array([[ 8, 7, 12, 11, 11, 9, 9, 10, 11, 12],
[ 7, 11, 8, 9, 9, 10, 11, 14, 11, 10],
[ 6, 10, 11, 13, 8, 10, 14, 12, 9, 7],
[ 6, 11, 6, 7, 8, 10, 8, 18, 13, 13],
[ 7, 7, 13, 11, 9, 12, 13, 8, 8, 12],
[10, 11, 13, 9, 6, 11, 7, 5, 14, 14],
[12, 5, 9, 9, 10, 8, 8, 16, 9, 14],
[14, 8, 14, 9, 11, 6, 10, 9, 11, 8],
[12, 10, 12, 9, 12, 10, 7, 10, 8, 10],
[10, 7, 10, 19, 8, 5, 11, 8, 8, 14]])
The sums appear to be correct:
sum(np.random.multinomial(desired_sum, np.ones(n)/n, size=10).T)
# array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100])
Python only
You could also start with a list on 10 zeroes, iterate 100 times and increment a random cell each time :
import random
desired_sum = 100
n = 10
row = [0] * n
for _ in range(desired_sum):
row[random.randrange(n)] += 1
row
# [16, 7, 9, 7, 10, 11, 4, 19, 4, 13]
sum(row)
# 100