How do I create a column chart with an aggregated column - python

I have two columns customer_id and revenue and I'm trying to figure out how to use matplotlib (or seaborn) to create a histogram/bar/column chart that has an aggregated column on the right. Everytime I change the range it just cuts off those values above my max range. Instead I want there to be a bin that is the count of instances above that max value.
For the example chart linked below, if I define my range as 0-1558, I want there be a column that counts the instances of all values $1558 and above and display that as a column.
Example Chart

Cap the values above the limit:
df[df['revenue']>limit] = limit
Now, plot the histogram.

Same concept as #DYZ, but my code ended up being:
df.ix[df.revenue > limit, 'revenue'] = limit

Related

Dataframe value.counts() to barplot

I have a dataframe with multiple columns such as product name, reviews, origin, and etc.
Here, I want to create a barplot with only the data from "Origin" column.
To do this, I used the code:
origin = df['Origin'].value_counts()
With this, I was able to get a list of countries with corresponding frequencies (or counts). Now, I want to create a boxplot with each country on X-axis and counted frequencies on the Y-axis. Although the column for frequencies have a column label, I am unable to set the X-axis as the countries are merely saved as index. Would there be a better way to count the column "Origin" and make it into a barplot?
Thanks in advance.

Want to display only specific value in graph's x-axis , but its showing repeated values of columns of csv-file

I need to display only unique values on x-axis, but it is showing all the values in a specific column of the csv-file. Any suggestions please to fix this out?
df=pd.read_csv('//media//HOTEL MANAGEMENT.csv')
df.plot('Room_Type','Charges',color='g')
plt.show()
My assumption is that you are looking to plot the result of some aggregated data. e.g. Either:
The total charges per room type, or
The average charge per room type, or
The minimum/maximum charge per room type.
If so, you could so like:
df=pd.read_csv('//media//HOTEL MANAGEMENT.csv')
# And use any of the following:
df.groupby('Room_Type')['Charges'].sum().plot(color='g')
df.groupby('Room_Type')['Charges'].mean().plot(color='g')
df.groupby('Room_Type')['Charges'].min().plot(color='g')
df.groupby('Room_Type')['Charges'].max().plot(color='g')
Seeing that the x-axis may not necesarily be sequential, a comparative bar graph could be another way to plot.
df.groupby('Room_Type')['Charges'].mean().plot.bar(color=['r','g'])

Creating a line graph for a top X in a dataframe (Pandas)

I'm trying to make a line graph for my dataframe that has the names of 10 customers on the X axis and their amount of purchases they made on the Y axis.
I have over 100 customers in my data frame, so I created a new data frame that is grouped by customers and which shows the sum of their orders and I wish to only display the top 10 customers on my graph.
I have tried using
TopCustomers.nlargest(10, 'Company', keep='first')
But I run into the error nlargest() got multiple values for argument 'keep' and if I don't use keep, I get told it's a required argument.
TopCustomers is composed of TopCustomers = raw.groupby(raw['Company'])['Orders'].sum()
Sorting is not required at the moment, but it'd be good to know in advance.
On an additional Note: The list of customer's name is rather lengthy and, after playing with some dummy data, I see that the labels for the X axis are stacked on top of each other, is there a way to make it bigger so that all 10 are clearly visible? and maybe mark a dot where the X,Y meets?
we can do sort_values and tail
TopCustomers.sort_values().tail(10)

Plotting number of times customer visits app

I have a dataframe which has unique customer id and date.
My datframe looks like this
date objectId
15/07/18 "__gb5c9e15dfc004930b8ac9d5d1df1880e"
16/07/18 "__g0b2abb9da5d646eb930c1ce9bb6df5ef"
16/07/18 "__c5ff64e5448c44fabe26e88bc0e41497"
17/07/18 "__c7b0a5824a914d7198a328cdf35c95bf"
18/07/18 "__8929216e8d534569ae6fd6701c92fc4c"
19/07/18 "__gec079853a06748a79b4d101713c1e21d"
19/07/18 "__d7f24fa5909b43f4a5282877ed4eed3e"
19/07/18 "__ga523090706304454ba581d79f366816a"
19/07/18 "__d409d75e4207409b8ea030f69b70bf83"
19/07/18 "-g940dc0277b7f46c8b7d8de195a8fd975"
20/07/18 "__d7f24fa5909b43f4a5282877ed4eed3e"
20/07/18 "__ga523090706304454ba581d79f366816a"
21/07/18 "__d409d75e4207409b8ea030f69b70bf83"
21/07/18 "-g940dc0277b7f46c8b7d8de195a8fd975"
I want to plot a graph where I want to count how many customers visted once,twice and so on. y axis - number of times object id gets repeated
x axis - count of object id that gets repeated. I tried something like
date_df['objectId'].value_counts().plot(kind='bar')
This is no good dataset to plot, because you will most likely only get a bar height of one for most entries, and since there are a lot customers, you will not get a good overview at all.
Anyway, assuming the same customer gets the same ID regardless of the day, you can put all of the IDs in a list, sort the list and then plot a histogram with number of bins = number of unique entries in the list.
customer_list = sorted(date_df['objectId'].tolist())
This list is now ready to use as input for a histogram.
EDIT: your comment "It is giving different objectID's on x axis" is actually interesting, because that is the only possible output for this kind of plotted dataset - a histogram with number of occurances on y-axis and different IDs on the x-axis.
For the plotting, it would look like
import matplotlib.pyplot as plt
customer_list = sorted(date_df['objectId'].tolist())
plt.hist(customer_list, bins=len(set(customer_list)))
plt.show()
the set removes all duplicates, and len then gives you the number of unique items in the list. by setting the bins to that number, you get one histogram bar for every single customer.

producing a scatter plot from multi-level dataframe [pandas]

I have a big data frame, on which I've done a df.groupby(["event_type", "day"].count() and gotten the following multi-indexed df:
My aim is to produce a scatter plot that shows the number of occurrences of an event per day, sorted by event_type. So a scatter plot where the x axis is "day" and the y axis would be "id" from the above table (which is a count). But I don't know how to go about making it.
background: event_type is only 3 types. day is like 2 years of dates. "id" is id of things I'm tracking, but in the above .groupby() data frame, its actually the count of ids. I'd ideally like to get 3 separate lines plotted (one per event_type) of the id counts versus day of the year. Thanks!
I hope this will help:
a['date'] = pd.to_datetime(a['date'])
for name, group in a.groupby(['type','date']).count().groupby('type'):
plt.plot(group.reset_index().set_index('date')['v1'], marker=o, linestyle='', label=name)
plt.legend()
If you want normal plot instead of scatter, remove marker and linestyle arguments.
My DF looks like this:

Categories

Resources