How to take only the first 30 rows in plotly express python - python

I have an excel file with a column named 'Product' and 'Quantity'. In the Product column, there are over 100 different items (clothes, shoes, caps, hats ,etc) while the Quantity column shows how many of those products were sold.
**Product** **Quantity**
Shirt A 2
Shirt A 5
Shirt C 1
Shirt A 9
Shoes B 3
I want to group all different items and count their total quantity but only for the 25 most sold products. in pandas it would be like this:
df = pd.read_csv('directory\Sales.csv')
df_products = df[['Product',
'Quantity']].groupby('Product').sum().head(25).sort_values(by='Quantity', ascending=False)
but how can I do this exact same thing in a histogram graph made in plotly.express? I tried this:
fig_product = px.histogram(data_frame=df_products, x='Product', y='Quantity')
This shows me all +100 products name and their quantities sold, but I only want the top 25 of those to show up for me. How can I do that?

It's all in dataframe preparation
groupby().sum() to get the totals required
sort_values().head() for number of items you want to plot. I've picked top 10 in this example
there is no difference between histogram and Bar
import plotly.graph_objects as go
import plotly.express as px
df = pd.DataFrame({"product":np.random.choice(list("abcdefghijklmnonpqrstuvwxyz"), 200), "quantity":np.random.uniform(3,5,200)})
df = df.groupby("product", as_index=False).sum().sort_values("quantity", ascending=False).head(10)
go.Figure(go.Bar(x=df["product"], y=df["quantity"]))
px.histogram(data_frame=df, x='product', y='quantity')

Related

efficiently looping through pandas dataframe columns to make new dataframe

I want select 200 titles randomly(the correct title shall appear only once in the window) for each question and create a new dataframe. Here I am using list and for loop to do that which is taking a great of time since I have around 80k questions. All 80k questions are unique while around 8K titles are unique.
I have this following code
import random
questions = new_df['question_string'].tolist()
titles = new_df['titles'].tolist()
indexs = new_df['image_index'].tolist()
# print(len(titles))
# titles.remove(titles[0])
full_list = []
for x in range(len(questions)):
full_list.append([questions[x], titles[x], indexs[x], 1])
t = new_df.titles.unique().tolist()
if t.count(titles[x])> 0:
t.remove(titles[x])
for y in random.choices(t, k=199):
full_list.append([questions[x], y, indexs[x], 0])
len(full_list)
full_list_df = pd.DataFrame(full_list)
full_list_df.columns =['questions', 'titles', 'image_index', 'is_similar']
I need help to do this more efficiently, may be using the dataframe driectly.
This is how my dataframe looks like
question_string titles image_index is_similar
0 In how many countries, is the net taxes in con... Net taxes on products in different countries i... 33715 1
1 In how many countries, is the gross enrolment ... Total enrollments of female students in school... 68226 1
2 In how many years, is the percentage of popula... Percentage of the population living below the ... 152731 1
3 What is the ratio of the enrollment rate in pr... Net enrolment rate in primary and secondary ed... 27823 1
4 In how many countries, is the contraceptive pr... Percentage of women of different countries who... 72232 1

Using pandas to sum columns based on a criteria

I am trying to use pandas to group sales information based on category and a criteria.
For example in "Table 1" below, I want sales totals for each category excluding those with a "Not Stated" in the Reg/Org column. My ideal output would be in "Table 2" below. My actual data set has 184 columns, and I am trying to capture the sales volume by category across any values excluding those that are "Not Stated".
Thank you for any help or direction that you can provide.
TABLE 1
Category
Reg/Org
Sales
Apple
Regular
10
Apple
Organic
5
Apple
Not Stated
5
Banana
Regular
15
Banana
Organic
5
TABLE 2
Category
Reg/Org
Apple
15
Banana
20
The first part was to summarize the values by column for the entire data set. I utilized the code below to gather that info for each of the 184 columns. Now I want to create a further summary where I create those column totals again, but split by the 89 categories I have. Ideally, I am trying to create a cross tab, where the categories are listed down the rows, and each of the 184 columns contains the sales. (e.g. the column "Reg/Org" would no longer show "Organic" or "Regular", it would just show the sales volume for all values that are not "Not Stated".)
att_list = att.columns.tolist()
ex_list = ['NOT STATED','NOT COLLECTED']
sales_list = []
for att_col in att_list:
sales_list.append(att[~att[att_col].isin(ex_list)]['$'].sum())
Try
df[df["Reg/Org"]!="Not Stated"].groupby("Category").sum()
Or
df.groupby("Category").sum().drop(index= ["Not Stated"])
try using "YourDataframe.loc[]" with a filter inside
import pandas as pd
data = pd.read_excel('Test_excel.xlsx')
sales_volum = data.loc[data["Reg/Org"] != "Not Stated"]
print(str(sales_volum))

Python and pandas, groupby only column in DataFrame

I would like to group some strings in the column called 'type' and insert them in a plotly bar, the problem is that from the new table created with groupby I can't extract the x and y to define them in the graph:
tipol1 = df.groupby(['tipology']).nunique()
tipol1
the outpot gives me tipology as index and the grouping based on how many times they repeat
number data
typology
one 2 113
two 33 33
three 12 88
four 44 888
five 11 66
in the number column (in which I have other values ​​it gives me the correct grouping of the tipology column)
Also in the date column it gives me values ​​(I think grouping the dates but not the dates in the correct format)
I also found:
tipol=df.groupby(['tipology']).nunique()
tipol2 = tipol[['number']]
tipol2
to take only the number column,
but nothing to do, I would need the tipology column (not in index) and the column with the tipology grouping numbers to get the x and y axis to import it into plotly!
One last try I made (making a big mess):
tipol=df.groupby(['tipology'],as_index=False).nunique()
tipol2 = tipol[['number']]
fig = go.Figure(data=[
go.Bar(name='test', x=df['tipology'], y=tipol2)
])
fig.update_layout(barmode='stack')
fig.show()
any suggestions
thanks!
UPDATE
I would have too much code to give an example, it would be difficult for me and it would waste your time too. basically I would need a groupby with the addition of a column that would show the grouping value eg:
tipology Date
home 10/01/18
home 11/01/18
garden 12/01/18
garden 12/01/18
garden 13/01/18
bathroom 13/01/18
bedroom 14/01/18
bedroom 15/01/18
kitchen 16/01/18
kitchen 16/01/18
kitchen 17/01/18
I wish this would happen:
by deleting the date column and inserting the value column in the DataFrame that does the count
tipology value
home 2
garden 3
bathroom 1
bedroom 2
kitchen 3
Then (I'm working with jupyer notebook)
leaving the date column and adding the corresponding values ​​to the value column based on their grouping:
tipology Date value
home 10/01/18 1
home 11/01/18 1
garden 12/01/18 2
garden 12/01/18_____.
garden 13/01/18 1
bathroom 13/01/18 1
bedroom 14/01/18 1
bedroom 15/01/18 1
kitchen 16/01/18 2
kitchen 16/01/18_____.
kitchen 17/01/18 1
I would need the columns to assign them to the x and y axes to import them to a graph! so none of the columns should be index
By default the method groupby will return a dataframe where the fields you are grouping on will be in the index of the dataframe. You can adjust this behaviour by setting as_index=False in the group by. Then tipology will still be a column in the dataframe that is returned:
tipol1 = df.groupby('tipology', as_index=False).nunique()

plotting stacked bar graph on column values

I have a Pandas data frame that looks like this:
ID Management Administrative
1 1 2
3 2 1
4 3 3
10 1 3
essentially the 1-3 is a grade of low medium or high. I want a stacked bar chart that has Management and Administrative on x-axis and the stacked composition of 1,2,3 of each column in percentages.
e.g. if there were only 4 entries as above, 1 would compose 50% of the height, 2 would compose 25% and 3 would compose 25% of the height of the management bar. The y axis would go up to 100%.
Hope this makes sense. Hard to explain but if unclear willing to clarify further!
You will need to chain several operations: First melt your dataset to move the Department as a new variable, after that you can groupby the Department and the Rating to count the number of IDs that fall into that bucket, then you groupby again by Department to calculate the percentages. Lastly you can plot your stacked bar graph:
df4.melt().rename(columns={'variable':'Dept', 'value':'Rating'}
).query('Dept!="ID"'
).groupby(['Dept','Rating']).size(
).rename('Count'
).groupby(level=0).apply(lambda x: x/sum(x)
).unstack().plot(kind='bar', stacked=True)

add/merge items to pandas data frame

I would like to use pandas for storing some data in a structure like this:
Quantity Date Item Color
15 0312 car green
10 0312 car red
3 0512 car red
and be able to add an item to the structure:
3 0312 car green
and get the structure updated as a result:
Quantity Date Item Color
18 0312 car green
10 0312 car red
3 0512 car red
Another example add:
-3 0512 car red
Result:
Quantity Date Item Color
18 0312 car green
10 0312 car red
If the last 3 columns have the same values the quantity column is updates with a new value.
What is the closest data structure and function in pandas which support that?
When adding DataFrames (+), pandas first aligns the indexes and the adds the values. So you aren't doing addition.
Instead you want use the index labels to select / drop.
First one:
df.loc[3] = [0312, 'car', 'green'] # or however you have that value stored.
Second one:
df = df.drop(3)
I'd encourage you to read the docs on indexing and selecting data

Categories

Resources