Using pandas to sum columns based on a criteria - python

I am trying to use pandas to group sales information based on category and a criteria.
For example in "Table 1" below, I want sales totals for each category excluding those with a "Not Stated" in the Reg/Org column. My ideal output would be in "Table 2" below. My actual data set has 184 columns, and I am trying to capture the sales volume by category across any values excluding those that are "Not Stated".
Thank you for any help or direction that you can provide.
TABLE 1
Category
Reg/Org
Sales
Apple
Regular
10
Apple
Organic
5
Apple
Not Stated
5
Banana
Regular
15
Banana
Organic
5
TABLE 2
Category
Reg/Org
Apple
15
Banana
20
The first part was to summarize the values by column for the entire data set. I utilized the code below to gather that info for each of the 184 columns. Now I want to create a further summary where I create those column totals again, but split by the 89 categories I have. Ideally, I am trying to create a cross tab, where the categories are listed down the rows, and each of the 184 columns contains the sales. (e.g. the column "Reg/Org" would no longer show "Organic" or "Regular", it would just show the sales volume for all values that are not "Not Stated".)
att_list = att.columns.tolist()
ex_list = ['NOT STATED','NOT COLLECTED']
sales_list = []
for att_col in att_list:
sales_list.append(att[~att[att_col].isin(ex_list)]['$'].sum())

Try
df[df["Reg/Org"]!="Not Stated"].groupby("Category").sum()
Or
df.groupby("Category").sum().drop(index= ["Not Stated"])

try using "YourDataframe.loc[]" with a filter inside
import pandas as pd
data = pd.read_excel('Test_excel.xlsx')
sales_volum = data.loc[data["Reg/Org"] != "Not Stated"]
print(str(sales_volum))

Related

Python and pandas, groupby only column in DataFrame

I would like to group some strings in the column called 'type' and insert them in a plotly bar, the problem is that from the new table created with groupby I can't extract the x and y to define them in the graph:
tipol1 = df.groupby(['tipology']).nunique()
tipol1
the outpot gives me tipology as index and the grouping based on how many times they repeat
number data
typology
one 2 113
two 33 33
three 12 88
four 44 888
five 11 66
in the number column (in which I have other values ​​it gives me the correct grouping of the tipology column)
Also in the date column it gives me values ​​(I think grouping the dates but not the dates in the correct format)
I also found:
tipol=df.groupby(['tipology']).nunique()
tipol2 = tipol[['number']]
tipol2
to take only the number column,
but nothing to do, I would need the tipology column (not in index) and the column with the tipology grouping numbers to get the x and y axis to import it into plotly!
One last try I made (making a big mess):
tipol=df.groupby(['tipology'],as_index=False).nunique()
tipol2 = tipol[['number']]
fig = go.Figure(data=[
go.Bar(name='test', x=df['tipology'], y=tipol2)
])
fig.update_layout(barmode='stack')
fig.show()
any suggestions
thanks!
UPDATE
I would have too much code to give an example, it would be difficult for me and it would waste your time too. basically I would need a groupby with the addition of a column that would show the grouping value eg:
tipology Date
home 10/01/18
home 11/01/18
garden 12/01/18
garden 12/01/18
garden 13/01/18
bathroom 13/01/18
bedroom 14/01/18
bedroom 15/01/18
kitchen 16/01/18
kitchen 16/01/18
kitchen 17/01/18
I wish this would happen:
by deleting the date column and inserting the value column in the DataFrame that does the count
tipology value
home 2
garden 3
bathroom 1
bedroom 2
kitchen 3
Then (I'm working with jupyer notebook)
leaving the date column and adding the corresponding values ​​to the value column based on their grouping:
tipology Date value
home 10/01/18 1
home 11/01/18 1
garden 12/01/18 2
garden 12/01/18_____.
garden 13/01/18 1
bathroom 13/01/18 1
bedroom 14/01/18 1
bedroom 15/01/18 1
kitchen 16/01/18 2
kitchen 16/01/18_____.
kitchen 17/01/18 1
I would need the columns to assign them to the x and y axes to import them to a graph! so none of the columns should be index
By default the method groupby will return a dataframe where the fields you are grouping on will be in the index of the dataframe. You can adjust this behaviour by setting as_index=False in the group by. Then tipology will still be a column in the dataframe that is returned:
tipol1 = df.groupby('tipology', as_index=False).nunique()

Pandas to create a column with specific value based on the input in another column

In my csv file I have column 'category' where I need to set a vertical for each category and save value in the new additional column. I know how to read csv and save data frame into a new file including the new column creation in Pandas. However I need some help regarding the logic for my scenario.
my.csv:
id category
1 auto,auto.car_dealers
2 hotelstravel,hotelstravel.hotels
3 shopping,shopping.homeandgarden,shopping.homeandgarden.appliances
4 financialservices,financialservices.insurance
5
6 realestate
7 pets,pets.petservices,pets.petservices.petinsurance
8 homeservices,homeservices.windowsinstallation
9 professional
Rules that I need to apply:
1. If category column has no value then set Vertical column = Other
2. If category column has value then check if value is a single word and then set vertical depending on the value. If auto then set to Automotive, if hotelstravel then set to Travel etc.
3. If value has more than one word then take the word before the first comma and set the vertical value based on category. If auto then set to Automotive, if hotelstravel then set to Travel etc.
Expected output.csv:
id category vertical
1 auto,auto.car_dealers Automotive
2 hotelstravel,hotelstravel.hotels Travel
3 shopping,shopping.homeandgarden,shopping.homeandgarden.appliances Retail
4 financialservices,financialservices.insurance Financial
5 Other
6 realestate Real Estate
7 pets,pets.petservices,pets.petservices.petinsurance Pet Services
8 homeservices,homeservices.windowsinstallation Home Services
9 professional Professional Services
my code so far:
import pandas as pd
df = pd.read_csv('path/to/my.csv')
#do something here and then something like
df.loc[df['category'] == 'auto', 'vertical'] = 'Automotive'
df.to_csv('path/to/output.csv', index=False)
Any help with this will be much appreciated. Thank you in advance!
You will likely need to iterate through the category column and perform checks on the value. You can use something along the following (more info):
for index, row in df.iterrows():
if (row['Category'].is_a_list()):
tokens = row['Category'].split()
row['Vertical'] = tokens[0]
else:
....
And because you're wanting to change values, i.e. 'hotelstravel' to 'Travel', you might need a dictionary set up with the Category name as the key and the Vertical name as the value so that you can quickly convert it

Pandas - Find String and return adjacent values for the matched data

I'm struggling to write a piece of code to achieve/overcome below problem.
I have two excel spreadsheets. Lets take as an example
DF1 - 1. Master Data
DF2 - 2. consumer details.
I need to iterate description column in Consumer details which contains string or sub string which is in Master data sheet and return a adjacent value. I understand, its pretty straight forward and simple but unable to succeed.
I was using Index Match in Excel -
INDEX('Path\[Master Sheet.xlsx]Master
List'!$B$2:$B$199,MATCH(TRUE,ISNUMBER(SEARCH('path\[Master
Sheet.xlsx]Master List'!$A$2:$A$199,B3)),0))
But need a solution in Python/Pandas -
Eg Df1 - Master SheetMaster Sheet -
Store Category
Nike Shoes
GAP Clothing
Addidas Shoes
Apple Electronics
Abercrombie Clothing
Hollister Clothing
Samsung Electornics
Netflix Movies
etc.....
df2 - Consumer Sheet-
Date Description Amount Category
01/01/20 GAP Stores 1.1
01/01/20 Apple Limited 1000
01/01/20 Aber fajdfal 50
01/01/20 hollister das 20
01/01/20 NETFLIX.COM 10
01/01/20 GAP Kids 5.6
Now, I need to update Category column in consumer sheet based on description(string/substring) column in consumer sheet referring to stores column in master sheet
Any inputs/suggestion, highly appreciated.
One option is to make a custom function that loops through the Df1 values in order to match a store to a string provided as an argument. if a match if found it will return the associated category string and if none is found return None or some other default value. You can use str.lower to increase the chances of a match being found. You then use pandas.Series.apply to apply this function to the column you want to try and find matches in.
import pandas as pd
df1 = pd.DataFrame(dict(
Store = ['Nike','GAP','Addidas','Apple','Abercrombie'],
Category = ['Shoes','Clothing','Shoes','Electronics','Clothing'],
))
df2 = pd.DataFrame(dict(
Date = ['01/01/20','01/01/20','01/01/20'],
Description = ['GAP Stores','Apple Limited','Aber fajdfal'],
Amount = [1.1,1000,50],
))
def get_cat(x):
global df1
for store, cat in df1[['Store','Category']].values:
if store.lower() in x.lower():
return cat
df2['Category'] = df2['Description'].apply(get_cat)
print(df2)
Output:
Date Description Amount Category
0 01/01/20 GAP Stores 1.1 Clothing
1 01/01/20 Apple Limited 1000.0 Electronics
2 01/01/20 Aber fajdfal 50.0 None
Python tutor link to example
I should note that if 'Aber fajdfal' is supposed to match to 'Abercrombie' then this solution is not going to work. You'll need to add more complex logic to the function in order to match partial strings like that.

Return a python DataFrame with full rows containing the highest sales for each company

Suppose I have the following data table:
import pandas as pd
data = {'Company':['ELCO','ELCO','ELCO','BOBCO','BOBCO','BOBCO','LAMECO','LAMECO','LAMECO'],
'Person':['Sam','Mikey','Amy','Vanessa','Carl','Sarah','Emily','Laura','Steve'],
'Sales':[220,123,312,125,263,321,243,275,198]}
df = pd.DataFrame(data)
df
How would I go about logically extracting the data to end up with a data table that just shows the highest 'Sales' for each company whist keeping the full rows for those highest sales figures. In other words, how would I get the smaller DataFrame shown at the bottom of the attached image using conditional logic etc?
DataFrame Outputs
You want groupby().idxmax() and loc:
df.loc[df.groupby('Company').Sales.idxmax()]
Output:
Company Person Sales
5 BOBCO Sarah 321
2 ELCO Amy 312
7 LAMECO Laura 275
Note: The above gives you only one sale person per company. If you want all sale persons with max sale in each company, you need transform:
df[df['Sales'] == df.groupby('Company').Sales.transform('max')]

Data too big in pandas & seaborn. How to create an "other" column?

I have a series of names each related to an ID.
In pandas I then combined these names so each ID would just have a combination as opposed to many individual names.
Then I created a count to see how many times these combinations would appear.
For example I wanted people who ate apples and oranges.
**Combination Count**
Apples, Oranges 2
Apples 1
Oranges 1
However, my specific data set was far too large and I have many elements with the count of 1. I am trying to combine these into an "other" group to display using seaborn for a bar chart. However, all the names overlap due the such volume of data. I want to merge probably the last 500 rows of my data set to "other" (as the combination name) and the count is the sum of all those counts.
In this example it would be like this:
**Combination Count**
Apples, Oranges 2
Other 2
I have tried using groupby, but lacking experience in pandas I am unsure of how to write this syntactically. Any help would be appreciated.
Assuming you have done import numpy as np, you can use np.where() to generate a new column which uses 'Other' if the Count is 1, or the existing Combination otherwise.Then we can .groupby and sum to find totals on 'New Combination'. Assuming your frame is called df:
df['New Combination'] = np.where(df['Count'] == 1, 'Other', df['Combination'])
totals = df.groupby('New Combination').agg({'Count': 'sum'})
This gives you:
Count
New Combination
Apples, Oranges 2
Other 2

Categories

Resources