here's a pretty basic question, but my brain is giving up on me and I would really appreciate some help.
I have a dataset with 10000 rows.
I have an area name column with 100 unique area names.
I have a type column with types ranging from 1 to 10.
And I have a spend column.
I would like to group it by area name, and add a new column with an average spend per name (or even in the old spend column).
However:
I only want the average of the types from 1-7. So I want to exclude any types 8, 9 or 10 that are in that are in the area.
Except, if an area contains only types 8, 9 or 10. In that case, I want the average of that spend.
What I've played with, but haven't managed to actually do it:
Approach 1:
Create 2 datasets, one with types1-7, another where there's only types 8, 9 or 10 in an area:
main=['1.','2.', '3.','4.', '5.', '6.', '7.']
eight_to_ten=['8.', '9.', '10.']
df_main = df[df['Type'].isin(main)]
df_main['avg_sales'] = df_main.groupby(['Area Name'])['Sales'].mean()
Approach 2:
df_new['avg_sales'] = df[df['Type'].isin(main)].groupby('Area Name')['Sales'].mean()
I assume there is a really short way of doing this, most likely without having to split the dataset into 2 and then concat it back.
Is it easier to do it with a for loop?
Any help would be appreciated
I believe you need filter first rows by lists and if need new column per groups use GroupBy.transform:
m1 = df['Type'].isin(main)
m2 = df['Type'].isin(eight_to_ten)
df = df_main[m1 | m2].copy()
df['avg_sales'] = df.groupby(['Area Name', m1])['Sales'].transform('mean')
Or for new DataFrame with aggregation add new array for distinguish groups:
arr = np.where(m1, 'first','second')
df1 = df.groupby(['Area Name', arr])['Sales'].mean().reset_index()
Related
I am attempting to use pandas to create a new df based on a set of conditions that compares the rows from one another within the original df. I am new to using pandas and feel comfortable comparing two df from one another and basic column comparisons, but for some reason the row by row comparison is stumping me. My specific conditions and problem are found below:
Cosine_i_ start_time fid_ Shape_Area
0 0.820108 2022-08-31T10:48:34Z emit20220831t104834_o24307_s000 0.067763
1 0.962301 2022-08-27T12:25:06Z emit20220827t122506_o23908_s000 0.067763
2 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.404882
3 0.788322 2023-01-29T13:23:39Z emit20230129t132339_o02909_s000 0.404882
4 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.108256
^^Above is my original df that I will be working with.
Goal: I am hoping to create a new df that contains only the FIDs that meet the following conditions: If the shape area is equal, the cosi values have a difference greater than 0.1, and the start time has a difference greater than 5 days. This is going to be applied to a large dataset, the df displayed is just a small sample one I made to help write the code.
For example: Rows 2 & 3 have the same shape area, so then looking at the cosi values, they have a difference in values greater than 0.1, and lastly they have a difference in their start times that is greater than 5 days. They meet all set conditions, so I would then like to take the FID values for BOTH of these rows and append it to a new df.
So essentially I want to compare every row with the other rows and that's where I am having trouble.
I am looking for as much guidance as possible on how to set this up as I am very very new to coding and am hoping to get a tutorial of some sort!
Thanks in advance.
Group by Shape_Area and filter each pair (single items on Shape_Area are omitted) by required conditions:
fids = df.groupby('Shape_Area').filter(lambda x: x.index.size > 1
and x['Cosine_i_'].diff().values[-1] >= 0.1
and x['start_time'].diff().abs().dt.days.values[-1] > 5,
dropna=True)['fid_'].tolist()
print(fids)
Python beginner here so I reckon I'm massively overcomplicating this in some way.
I have a dataframe with about 20 columns, but I've only shown a small subset for simplicity. I want to get totals for red blue and none as percentages of the total for that month. So I thought it might be easiest to take a subset of these three columns then add the result back to the rest of the data:
data = [['2022-08', 10,'red',0,0], ['2022-04', 15,'blue',1,0], ['2022-08', 14,'none',1,1],['2022-04', 14,'blue',0,0],['2022-03', 14,'none',1,0]]
df = pd.DataFrame(data, columns=['Month', 'Balance','Type','Flag_1','Flag_2'])
df2=df[['Month','Type','Balance']].groupby(['Month','Type']).sum('Balance').unstack().fillna(0)
df2['balance_all_categories']= df2.sum(axis=1)
Now I want to add this back to my full dataframe and make my balances for red, blue, none into percentages of the total for that month. I have many more than just 2 flags and I will need to make subsets based on all flags being zero, all flags being 1 and so on. So if I groupby month and type here, to access any one column they start to have incredibly long names I think, so I don't want to do that if avoidable.
Is there an easy way to deal with this? Thanks for any suggestions! :)
I am trying to decile a sales column in my dataframe but also partition by year. So each year should have different deciles.
df = ['year','name', 'sales']
I think I can use this function but want to partition by year
df['decile']=pd.qcut(df['sales'],10,labels=False)
I suppose I can use groupby but I am not able to figure out the syntax.
Would really appreciate any help?
You can try:
df['decile'] = df.groupby('year')[['sales'']].apply(lambda g: pd.qcut(g.rank(method='first'), 10, labels=False)+1)
Explain:
g.rank(method='first'): in case there are many sales with same values. I add this one because in my experiment, I encountered many case where you have same values. If there is low chance of duplicated values, then you can replace by g which is fine.
10, labels=False)+1): you can leave option +1 if you want to label from 1 to 10. If not, it will label from 0 to 9
Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!
split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe
I'm a newbie to Python, and have done my absolute best to exhaust all resources before posting here looking for assistance. I have spent all weekend and all day today trying to come up with what I feel ought a straightforward scenario to code for using two dataframes, but, for the life of me, I am spinning my wheels and not making any significant progress.
The situation is there is one dataframe with Sales Data:
CUSTOMER ORDER SALES_DATE SALES_ITEM_NUMBER UNIT_PRICE SALES_QTY
001871 225404 01/31/2018 03266465555 1 200
001871 225643 02/02/2018 03266465555 2 600
001871 225655 02/02/2018 03266465555 3 1000
001956 228901 05/29/2018 03266461234 2.2658 20
and a second dataframe with Purchasing Data:
PO_DATE PO_ITEM_NUMBER PO_QTY PO_PRICE
01/15/2017 03266465555 1000 1.55
01/25/2017 03266465555 500 5.55
02/01/2017 03266461234 700 4.44
02/01/2017 03266461234 700 2.22
All I'm trying to do is to figure out what the maximum PO_PRICE could be for each of the lines on the Sales Order dataframe, because I'm trying to maximize the difference between what I bought it for, and what I sold it for.
When I first looked at this, I figured a straightforward nested for loop would do the trick, and increment the counters. The issue though is that I'm not well versed enough in dataframes, and so I keep getting hung up trying to access the elements within them. The thing to keep in mind as well is that I've sold 1800 of the first item, but, only bought 1500 of them. So, as I iterate through this:
For the first Sales Order row, I sold 200. The Max_PO_PRICE = $5.55 [for 500 of them]. So, I need to deduct 200 from the PO_QTY dataframe, because I've now accounted for them.
For the second Sales Order row, I sold 600. There are still 300 I can claim that I bought for $5.55, but, then I've exhausted all of those 500, and so the best I can no do is dip into the other row, which has the Max_PO_PRICE = $1.55 (for 1,000 of them). So for this one, I'd be able to claim 300 at $5.55, and the other $300 at $1.55. I can't claim for more than I bought.
Here's the code I've come up with, and, I think I may have gone about this all wrong, but, some guidance and advice would be beyond incredibly appreciated and helpful.
I'm not asking anyone to write my code for me, but, simply to advise what approach you would have taken, and if there is a better way. I figure there has to be....
Thanks in advance for your feedback and assistance.
-Clare
for index1,row1 in sales.iterrows():
SalesQty = sales.loc[index1]["SALES_QTY"]
for index2,row2 in purchases.iterrows():
if (row1['SALES_ITEM_NUMBER']==row2['PO_ITEM_NUMBER']) and (row2['PO_QTY']>0):
# Find the Maximum PO Price in the result set
max_PO_Price = abc["PO_PRICE"].max()
xyz = purchases.loc[index2]
abc = abc.append(xyz)
if(SalesQty <= Purchase_Qty):
print("Before decrement, PO_QTY = ",??????? *<==== this is where I'm struggle busing****)
print()
+index2
#Drop the data from the xyz DataFrame
xyz=xyz.iloc[0:0]
#Drop the data from the abc DataFrame
abc=abc.iloc[0:0]
+index1
This looks like something SQL would elegantly handle through analytical functions. Fortunately Pandas comes with most (but not all) of this functionality and it's a lot faster than doing nested iterrows. I'm not a Pandas expert by any means but I'll give it a whizz. Apologies if I've misinterpreted the question.
Makes sense to group the SALES_QTY, we'll use this to track how much QTY we have:
sales_grouped = sales.groupby(["SALES_ITEM_NUMBER"], as_index = False).agg({"SALES_QTY":"sum"})
Let's group the table into one so we can iterate over one table instead of two. We can use a JOIN action on the common column "PO_ITEM_NUMBER" and "SALES_ITEM_NUMBER", or what Pandas likes to call it "merge". While we're at it let's sort the table categorised by "PO_ITEM_NUMBER" and with the most expensive "PO_PRICE" on the top, this is and the next code block is the equivalent of a FN OVER PARTITION BY ORDER BY SQL analytical function.
sorted_table = purchases.merge(sales_grouped,
how = "left",
left_on = "PO_ITEM_NUMBER",
right_on = "SALES_ITEM_NUMBER").sort_values(by = ["PO_ITEM_NUMBER", "PO_PRICE"],
ascending = False)
Let's create a column CUM_PO_QTY with the cumulative sum of the PO_QTY (partitioned/grouped by PO_ITEM_NUMBER). We'll use this to mark when we go over the max SALES_QTY.
sorted_table["CUM_PO_QTY"] = sorted_table.groupby(["PO_ITEM_NUMBER"], as_index = False)["PO_QTY"].cumsum()
This is where the custom part comes in, we can integrate custom functions to apply row-by-row (or by column even) along the dataframe using apply(). We're creating two columns TRACKED_QTY which is simply the SALES_QTY minus CUM_PO_QTY so we know when we have run into the negative, and PRICE_SUM which will eventually be the maximum value gained or spent. But for now: If the TRACKED_QTY is less than 0 we multiply by the PO_QTY else the SALES_QTY for conservation purposes.
sorted_table[["TRACKED_QTY", "PRICE_SUM"]] = sorted_table.apply(lambda x: pd.Series([x["SALES_QTY"] - x["CUM_PO_QTY"],
x["PO_QTY"] * x["PO_PRICE"]
if x["SALES_QTY"] - x["CUM_PO_QTY"] >= 0
else x["SALES_QTY"] * x["PO_PRICE"]]), axis = 1)
To handle the trailing TRACKED_QTY negatives, we can filter the positive using a conditional mask, and groupby the negative revealing only the maximum PRICE_SUM value.
Then simply append these two tables and sum them.
evaluated_table = sorted_table[sorted_table["TRACKED_QTY"] >= 0]
evaluated_table = evaluated_table.append(sorted_table[sorted_table["TRACKED_QTY"] < 0].groupby(["PO_ITEM_NUMBER"], as_index = False).max())
evaluated_table = evaluated_table.groupby(["PO_ITEM_NUMBER"], as_index = False).agg({"PRICE_SUM":"sum"})
Hope this works for you.