I have a dataframe which has unique customer id and date.
My datframe looks like this
date objectId
15/07/18 "__gb5c9e15dfc004930b8ac9d5d1df1880e"
16/07/18 "__g0b2abb9da5d646eb930c1ce9bb6df5ef"
16/07/18 "__c5ff64e5448c44fabe26e88bc0e41497"
17/07/18 "__c7b0a5824a914d7198a328cdf35c95bf"
18/07/18 "__8929216e8d534569ae6fd6701c92fc4c"
19/07/18 "__gec079853a06748a79b4d101713c1e21d"
19/07/18 "__d7f24fa5909b43f4a5282877ed4eed3e"
19/07/18 "__ga523090706304454ba581d79f366816a"
19/07/18 "__d409d75e4207409b8ea030f69b70bf83"
19/07/18 "-g940dc0277b7f46c8b7d8de195a8fd975"
20/07/18 "__d7f24fa5909b43f4a5282877ed4eed3e"
20/07/18 "__ga523090706304454ba581d79f366816a"
21/07/18 "__d409d75e4207409b8ea030f69b70bf83"
21/07/18 "-g940dc0277b7f46c8b7d8de195a8fd975"
I want to plot a graph where I want to count how many customers visted once,twice and so on. y axis - number of times object id gets repeated
x axis - count of object id that gets repeated. I tried something like
date_df['objectId'].value_counts().plot(kind='bar')
This is no good dataset to plot, because you will most likely only get a bar height of one for most entries, and since there are a lot customers, you will not get a good overview at all.
Anyway, assuming the same customer gets the same ID regardless of the day, you can put all of the IDs in a list, sort the list and then plot a histogram with number of bins = number of unique entries in the list.
customer_list = sorted(date_df['objectId'].tolist())
This list is now ready to use as input for a histogram.
EDIT: your comment "It is giving different objectID's on x axis" is actually interesting, because that is the only possible output for this kind of plotted dataset - a histogram with number of occurances on y-axis and different IDs on the x-axis.
For the plotting, it would look like
import matplotlib.pyplot as plt
customer_list = sorted(date_df['objectId'].tolist())
plt.hist(customer_list, bins=len(set(customer_list)))
plt.show()
the set removes all duplicates, and len then gives you the number of unique items in the list. by setting the bins to that number, you get one histogram bar for every single customer.
Related
I need to display only unique values on x-axis, but it is showing all the values in a specific column of the csv-file. Any suggestions please to fix this out?
df=pd.read_csv('//media//HOTEL MANAGEMENT.csv')
df.plot('Room_Type','Charges',color='g')
plt.show()
My assumption is that you are looking to plot the result of some aggregated data. e.g. Either:
The total charges per room type, or
The average charge per room type, or
The minimum/maximum charge per room type.
If so, you could so like:
df=pd.read_csv('//media//HOTEL MANAGEMENT.csv')
# And use any of the following:
df.groupby('Room_Type')['Charges'].sum().plot(color='g')
df.groupby('Room_Type')['Charges'].mean().plot(color='g')
df.groupby('Room_Type')['Charges'].min().plot(color='g')
df.groupby('Room_Type')['Charges'].max().plot(color='g')
Seeing that the x-axis may not necesarily be sequential, a comparative bar graph could be another way to plot.
df.groupby('Room_Type')['Charges'].mean().plot.bar(color=['r','g'])
I'm trying to make a line graph for my dataframe that has the names of 10 customers on the X axis and their amount of purchases they made on the Y axis.
I have over 100 customers in my data frame, so I created a new data frame that is grouped by customers and which shows the sum of their orders and I wish to only display the top 10 customers on my graph.
I have tried using
TopCustomers.nlargest(10, 'Company', keep='first')
But I run into the error nlargest() got multiple values for argument 'keep' and if I don't use keep, I get told it's a required argument.
TopCustomers is composed of TopCustomers = raw.groupby(raw['Company'])['Orders'].sum()
Sorting is not required at the moment, but it'd be good to know in advance.
On an additional Note: The list of customer's name is rather lengthy and, after playing with some dummy data, I see that the labels for the X axis are stacked on top of each other, is there a way to make it bigger so that all 10 are clearly visible? and maybe mark a dot where the X,Y meets?
we can do sort_values and tail
TopCustomers.sort_values().tail(10)
I have a data frame with multiple features, including two categorical: 'race' (5 unique values) and 'income' (2 unique values: <=$50k and >$50k)
I've figured out how to do a cross-tabulation table between the two.
However, I can't figure out a short way on how to create a table or bar graph that shows what percentage of each of the five races falls in the <=$50k income group
The code below gives me a table where the rows are the individual races; the counts for each of the two categories of income; and the total counts for each race. I can't figure out how to add another column on the right that simply takes the count for <=$50k, divides by the total, and then lists the proportion
ct_race_income=pd.crosstab(adult_df.race, adult_df.income, margins=True)
Here's a bunch of code where I do it the long way: calculating each proportion and then creating a new dataframe for the purposes of making a bar chart. However, I want to code all of this in many fewer lines
total_white=len(adult_df[adult_df.race=="White"])
total_black=len(adult_df[adult_df.race=="Black"])
total_hisp=len(adult_df[adult_df.race=="Hispanic"])
total_asian=len(adult_df[adult_df.race=="Asian"])
total_amer_indian=len(adult_df[adult_df.race=="Amer-Indian"])
prop_white=(len(adult_df_lowincome[adult_df_lowincome.race=="White"])/total_white)
prop_black=(len(adult_df_lowincome[adult_df_lowincome.race=="Black"])/total_black)
prop_hisp=(len(adult_df_lowincome[adult_df_lowincome.race=="Hispanic"])/total_hisp)
prop_asian=(len(adult_df_lowincome[adult_df_lowincome.race=="Asian"])/total_asian)
prop_amer_indian=(len(adult_df_lowincome[adult_df_lowincome.race=="Amer-Indian"])/total_amer_indian)
prop_lower_income=pd.DataFrame()
prop_lower_income['Race']=["White","Black","Hispanic", "Asian", "American Indian"]
prop_lower_income['Ratio']=[prop_white, prop_black, prop_hisp, prop_asian, prop_amer_indian]
I have a big data frame, on which I've done a df.groupby(["event_type", "day"].count() and gotten the following multi-indexed df:
My aim is to produce a scatter plot that shows the number of occurrences of an event per day, sorted by event_type. So a scatter plot where the x axis is "day" and the y axis would be "id" from the above table (which is a count). But I don't know how to go about making it.
background: event_type is only 3 types. day is like 2 years of dates. "id" is id of things I'm tracking, but in the above .groupby() data frame, its actually the count of ids. I'd ideally like to get 3 separate lines plotted (one per event_type) of the id counts versus day of the year. Thanks!
I hope this will help:
a['date'] = pd.to_datetime(a['date'])
for name, group in a.groupby(['type','date']).count().groupby('type'):
plt.plot(group.reset_index().set_index('date')['v1'], marker=o, linestyle='', label=name)
plt.legend()
If you want normal plot instead of scatter, remove marker and linestyle arguments.
My DF looks like this:
I have two columns customer_id and revenue and I'm trying to figure out how to use matplotlib (or seaborn) to create a histogram/bar/column chart that has an aggregated column on the right. Everytime I change the range it just cuts off those values above my max range. Instead I want there to be a bin that is the count of instances above that max value.
For the example chart linked below, if I define my range as 0-1558, I want there be a column that counts the instances of all values $1558 and above and display that as a column.
Example Chart
Cap the values above the limit:
df[df['revenue']>limit] = limit
Now, plot the histogram.
Same concept as #DYZ, but my code ended up being:
df.ix[df.revenue > limit, 'revenue'] = limit