I have this dataframe (shortened) :
+-------+------------+--------+----------+-------+------+
| index | id_product | margin | supplier | price | seen |
+-------+------------+--------+----------+-------+------+
| 0 | 100000000 | 92.00 | 14 | 0.56 | 2 |
| 1 | 100000230 | 72.21 | 27 | 8.17 | 0 |
| 2 | 100001440 | 72.07 | 15 | 16.20 | 687 |
| 3 | 100002331 | 30.55 | 13 | 41.67 | 0 |
| 7 | 100001604 | 35.17 | 27 | 18.80 | 491 |
| ... | ... | ... | ... | ... | ... |
| 9830 | 100000320 | 77.78 | 18 | 13.33 | 0 |
| 9831 | 100000321 | 77.78 | 98 | 13.33 | 0 |
| 9832 | 100000443 | 77.78 | 17 | 13.33 | 4587 |
| 9834 | 100000292 | 88.13 | 3 | 10.56 | 0 |
| 9835 | 100000236 | 72.21 | 18 | 10.56 | 0 |
+-------+------------+--------+----------+-------+------+
What I am trying to do is to extract randomly 3 rows, using df.sample(3) maybe, but with this conditions :
the 3 rows selected should have 3 ecom_id different : (14,27,13) is good, (14,27,14) is not.
rows with higher margins should be privileged. I use weights='margin', it works fine.
rows with lower seen should be privileged. Is it possible to reverse the weight count with sample() to privilege lowest values ?
The 3 selected rows should be found in 3 different price slicing : first selected row should have a price < 20.0, second one should have a price between 30 and 50, and finally the third and last selected row should have a price > 80.
Is this possible ?
I have tried stuff like :
pr_1_pd = pr_pd.loc[pr_pd['price'] < 20]
pr_2_pd = pr_pd.loc[(pr_pd['price'] > 30) & (pr_pd['price'] < 50)]
pr_3_pd = pr_pd.loc[pr_pd['price'] > 80]
pr_1_pd = pr_1_pd.sort_values(by=['margin','seen'],ascending=[False,True])
pr_2_pd = pr_2_pd.sort_values(by=['margin','seen'],ascending=[False,True])
pr_3_pd = pr_3_pd.sort_values(by=['margin','seen'],ascending=[False,True])
But I'm not sure how to combine all the filters together
the 3 rows selected should have 3 ecom_id different : (14,27,13) is good, (14,27,14) is not.
Setting replace=False in pd.sample should achieve this if ecom_id is unique.
rows with lower seen should be privileged. Is it possible to reverse the weight count with sample() to privilege lowest values ?
You could invert the weights new_weight = 1 / seen to achieve this.
The 3 selected rows should be found in 3 different price slicing : first selected row should have a price < 20.0, second one should have a price between 30 and 50, and finally the third and last selected row should have a price > 80.
You'll have to sample from pr_1_pd, pr_2_pd, and pr_3_pd individually and then combine the results using pd.concat to achieve this.
Related
I have a dataframe with the below specs
| ID | Name| count |
| -- |---- | ---- |
| 1 | A | 75 |
| 2 | B | 10 |
| 3 | A | 15 |
| 4 | A | 10 |
| 5 | A | 5 |
| 6 | A | 3 |
If I set the threshold for the count to be 15, I want the below rows to get added up uniformly. So the output should be
| ID | Name | count |
| -- |---- | ---- |
| 1 | A | 15 |
| 2 | B | 10 |
| 3 | A | 30 |
| 4 | A | 25 |
| 5 | A | 20 |
| 6 | A | 18 |
75 from ID 1 gets added up based on group "Name" and it is always based on threshold value. Please advice
IIUC you can do something like:
def distribute(sr, thres=15):
idxmax = sr.idxmax()
remain = max((sr[idxmax] - thres) / max(len(sr) - 1, 1), 0)
return np.where(sr.index == idxmax, min(sr[idxmax], thres), sr + remain)
df['newcount'] = df.groupby('Name')['count'].transform(distribute, thres=15)
Output:
ID
Name
count
newcount
1
A
75
15
2
B
10
10
3
A
15
30
4
A
10
25
5
A
5
20
6
A
3
18
I have a dataframe with a column with either a 1 or 0 in it.
This is the Signal column.
I want to cycle through this dataframe until I get to the first 1 then take the value in the Open column and put that into another Dataframe Total, column Buy
Then as it continues through the dataframe when it reaches the first 0 then take that value in the Open column and put that into the same Dataframe Total, column Sold.
I know I need a loop within a loop but I'm not getting very far!
Any pointers/help would be appreciated!
Total = DataFrame()
for i in range(len(df)) :
if i.Signal == 1 :
Total['Buy'] = i.Open
if i.Signal == 0:
Total['Sold'] = i.Open
I know the code is wrong!
Cheers
Example of DataFrame
df = pd.DataFrame({'Signal': [0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,1,1,0,0], 'Open': np.random.rand(20)})
>>> df
| | Signal | Open |
|---:|---------:|----------:|
| 0 | 0 | 0.959061 |
| 1 | 0 | 0.820516 |
| 2 | 1 | 0.0562783 |
| 3 | 1 | 0.612508 |
| 4 | 1 | 0.288703 |
| 5 | 1 | 0.332118 |
| 6 | 0 | 0.949236 |
| 7 | 0 | 0.20909 |
| 8 | 1 | 0.574924 |
| 9 | 1 | 0.170012 |
| 10 | 1 | 0.0255655 |
| 11 | 1 | 0.788829 |
| 12 | 0 | 0.39462 |
| 13 | 0 | 0.493338 |
| 14 | 0 | 0.347471 |
| 15 | 1 | 0.574096 |
| 16 | 1 | 0.286367 |
| 17 | 1 | 0.131327 |
| 18 | 0 | 0.38943 |
| 19 | 0 | 0.592241 |
# get the position of the first 1
first_1 = (df['Signal']==1).idxmax()
# Create a mask with True in the position of the first 1
# and every time a different value appears (0 after a 1, or 1 after a 0)
mask = np.full(len(df), False)
mask[first_1] = True
for i in range (first_1 + 1, len(df)):
mask[i] = df['Signal'][i] != df['Signal'][i-1]
>>> df[mask]
| | Signal | Open |
|---:|---------:|----------:|
| 2 | 1 | 0.0562783 |
| 6 | 0 | 0.949236 |
| 8 | 1 | 0.574924 |
| 12 | 0 | 0.39462 |
| 15 | 1 | 0.574096 |
| 18 | 0 | 0.38943 |
# Create new DF with 'Buy' = odd values of masked df['Open']
# and 'Sold' = even values of masked df['Open']
open_values = df[mask]['Open'].to_numpy()
total = pd.DataFrame({'Buy': [open_values[i] for i in range(0, len(open_values), 2)], 'Sold': [open_values[i] for i in range(1, len(open_values), 2)]})
>>> total
| | Buy | Sold |
|---:|----------:|---------:|
| 0 | 0.0562783 | 0.949236 |
| 1 | 0.574924 | 0.39462 |
| 2 | 0.574096 | 0.38943 |
It works under the assumption that the original df table ends with 0s and not with 1s, i.e. for each first 1 in a row, there must be at least one 0 afterwards.
The assumption makes sense since the objective is to take differences later.
If the last value is 1, it will produce ValueError: All arrays must be of the same length.
So I have following dask dataframe grouped by Problem column.
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 5 | 2 | 15 | 38 |
| A | 15 | 2 | 15 | 23 |
| B | 11 | 6 | 10 | 54 |
| B | 10 | 6 | 10 | 48 |
| B | 18 | 6 | 10 | 79 |
| C | 50 | 8 | 25 | 120 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
The goal is to create a new dataframe with all rows where the Cost values is minimal for this particular Problem group. So we want following result:
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 15 | 2 | 15 | 23 |
| B | 10 | 6 | 10 | 48 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
How can I achieve this result, i already tried using idxmin() as mentioned in another question on here, but then I get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
What if you create another dataframe that is grouped by Problem and Cost.min()? Let's say the new column is called cost_min.
df1 = df.groupby('Problem')['Cost'].min().reset_index()
Then, merge back this new cost_min column back to the dataframe.
df2 = pd.merge(df, df1, how='left', on='Problem')
From there, do something like:
df_new = df2.loc[df2['Cost'] == df2['cost_min']]
Just wrote some pseudocode, but I think that all works with Dask.
I'm stuck with a little problem with python and pandas dataframe.
I want to make a pivot table that count some relates items.
I got a dataframe with this structure.
+-----+------------+-----------+-----------+
| ID | Item_Type | Basket_ID | OwnerName |
+-----+------------+-----------+-----------+
| 3 | Basket | | |
| 336 | ChickenEgg | 3 | Henk |
| 841 | SomeEgg | 3 | Henk |
| 671 | EasterEgg | 3 | Piet |
| 9 | Basket | | |
| 336 | Orange | 9 | Piet |
| 841 | Banana | 9 | Piet |
| 671 | Strawberry | 9 | Herman |
| 888 | Apple | 9 | Herman |
| 821 | Apricots | 9 | NaN |
+-----+------------+-----------+-----------+
I want to count how many items are related to the ‘Basket’ item (Parent) and how often the ‘OwnerName’ appears with the related ‘Basket’ item.
I want my dataframe like below.
You can see the total item count from the Items that are related with the parent Item_Type ‘Basket’, and the total count how often the name appears.
You can also see how many ‘Total_Owners’ are and also item without a owner.
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| ID | Item_Type | Total_Items_Count | Henk_Count | Piet_Count | Herman_Count | Total_Owners | Total_NaN_Values |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
| 3 | Basket | 3 | 2 | 1 | 0 | 3 | |
| 9 | Basket | 5 | 0 | 2 | 2 | 4 | 1 |
+----+-----------+-------------------+------------+------------+--------------+--------------+------------------+
Answering your question requires multiple steps but the core idea is that you should use pivot_table.
The table is conceptually a multilevel index. Basket ID is the high-level index and 'ID' is the more granular level index. First thing you have to do is remove the lines where basket_id is missing so that the granularity of the table is consistent.
Let's say you named your dataframe df.
# Preparation steps
df = df[~df["Basket_ID"].isna()] # Remove the rows that shouldnt be counted.
df.loc[df["OwnerName"].isna(),"OwnerName"] = "unknown" # set missing to arbitrary value
# Make a pivot table
df = df.pivot_table(index=['Basket_ID'],columns=['OwnerName'],values=['Item_Type'],aggfunc='count').fillna(0)
From there onwards you should be able to calculate your remaining columns
I have a large data set like the below:
+---+--------------------------------------+---------------------------------------+--------------------------+
| | dc(uid): interaction_single_message_ | dc(uid): interaction_single_message_1 | dc(uid): interaction_yes |
+---+--------------------------------------+---------------------------------------+--------------------------+
| 0 | 0 | 0 | 89884 |
| 1 | 0 | 29 | 0 |
| 2 | 0 | 239 | 239 |
| 3 | 324 | 0 | 0 |
| 4 | 12021 | 0 | 12021 |
+---+--------------------------------------+---------------------------------------+--------------------------+
How can I go row by row and pick the first number that is encountered and add it to a new column at the same index. If there are multiple numbers in a row, they will always be the same, which is why I am asking for the first encountered number.
The ideal output would be something like this:
+---+-------+
| | nums |
+---+-------+
| 0 | 89884 |
| 1 | 29 |
| 2 | 239 |
| 3 | 324 |
| 4 | 12021 |
+---+-------+
Would I be using groupby.unique() to do this?
In your sample df, so long as all the present values are the same then the following would work, we use a boolean condition to mask out the 0 values and then call mean and pass param axis=1 to calculate the mean row-wise:
In [24]:
df[df>0].mean(axis=1)
Out[24]:
0 89884
1 29
2 239
3 324
4 12021
dtype: float64