producing a scatter plot from multi-level dataframe [pandas] - python

I have a big data frame, on which I've done a df.groupby(["event_type", "day"].count() and gotten the following multi-indexed df:
My aim is to produce a scatter plot that shows the number of occurrences of an event per day, sorted by event_type. So a scatter plot where the x axis is "day" and the y axis would be "id" from the above table (which is a count). But I don't know how to go about making it.
background: event_type is only 3 types. day is like 2 years of dates. "id" is id of things I'm tracking, but in the above .groupby() data frame, its actually the count of ids. I'd ideally like to get 3 separate lines plotted (one per event_type) of the id counts versus day of the year. Thanks!

I hope this will help:
a['date'] = pd.to_datetime(a['date'])
for name, group in a.groupby(['type','date']).count().groupby('type'):
plt.plot(group.reset_index().set_index('date')['v1'], marker=o, linestyle='', label=name)
plt.legend()
If you want normal plot instead of scatter, remove marker and linestyle arguments.
My DF looks like this:

Related

Dataframe value.counts() to barplot

I have a dataframe with multiple columns such as product name, reviews, origin, and etc.
Here, I want to create a barplot with only the data from "Origin" column.
To do this, I used the code:
origin = df['Origin'].value_counts()
With this, I was able to get a list of countries with corresponding frequencies (or counts). Now, I want to create a boxplot with each country on X-axis and counted frequencies on the Y-axis. Although the column for frequencies have a column label, I am unable to set the X-axis as the countries are merely saved as index. Would there be a better way to count the column "Origin" and make it into a barplot?
Thanks in advance.

How do I combine dataframe (pandas) columns to plot confidence intervals in a Seaborn box plot?

My dataframe looks something like this:
I want to combine Column 1 and Column 2 by making a confidence interval (take the mean of Column 1 and Column 2, find the standard deviation, etc.) and then plot that using Seaborn box plots. Basically, I want to compare the values across the 4 rows (each row is an experimental subject) for each value (but each value has multiple measurements). In other words, I need to combine columns (replicate experiments) in the DataFrame. How would I go about doing this?
Here is what I'm hoping to get:
Thanks!

Plotting number of times customer visits app

I have a dataframe which has unique customer id and date.
My datframe looks like this
date objectId
15/07/18 "__gb5c9e15dfc004930b8ac9d5d1df1880e"
16/07/18 "__g0b2abb9da5d646eb930c1ce9bb6df5ef"
16/07/18 "__c5ff64e5448c44fabe26e88bc0e41497"
17/07/18 "__c7b0a5824a914d7198a328cdf35c95bf"
18/07/18 "__8929216e8d534569ae6fd6701c92fc4c"
19/07/18 "__gec079853a06748a79b4d101713c1e21d"
19/07/18 "__d7f24fa5909b43f4a5282877ed4eed3e"
19/07/18 "__ga523090706304454ba581d79f366816a"
19/07/18 "__d409d75e4207409b8ea030f69b70bf83"
19/07/18 "-g940dc0277b7f46c8b7d8de195a8fd975"
20/07/18 "__d7f24fa5909b43f4a5282877ed4eed3e"
20/07/18 "__ga523090706304454ba581d79f366816a"
21/07/18 "__d409d75e4207409b8ea030f69b70bf83"
21/07/18 "-g940dc0277b7f46c8b7d8de195a8fd975"
I want to plot a graph where I want to count how many customers visted once,twice and so on. y axis - number of times object id gets repeated
x axis - count of object id that gets repeated. I tried something like
date_df['objectId'].value_counts().plot(kind='bar')
This is no good dataset to plot, because you will most likely only get a bar height of one for most entries, and since there are a lot customers, you will not get a good overview at all.
Anyway, assuming the same customer gets the same ID regardless of the day, you can put all of the IDs in a list, sort the list and then plot a histogram with number of bins = number of unique entries in the list.
customer_list = sorted(date_df['objectId'].tolist())
This list is now ready to use as input for a histogram.
EDIT: your comment "It is giving different objectID's on x axis" is actually interesting, because that is the only possible output for this kind of plotted dataset - a histogram with number of occurances on y-axis and different IDs on the x-axis.
For the plotting, it would look like
import matplotlib.pyplot as plt
customer_list = sorted(date_df['objectId'].tolist())
plt.hist(customer_list, bins=len(set(customer_list)))
plt.show()
the set removes all duplicates, and len then gives you the number of unique items in the list. by setting the bins to that number, you get one histogram bar for every single customer.

Factorplot with multiindex dataframe

This is the dataframe I am working with:
(only the first two years don't have data for country 69 I will fix this). nkill being the number of killed for that year summed from the original long form dataframe.
I am trying to do something similar to this plot:
However, with the country code as a hue. I know there are similar posts but none have helped me solve this, thank you in advance.
By Hue I mean that in the seaborn syntactical use As pictured in this third picture. See in this example Hue creates a plot for every type of variable in that column. So if I had two country codes in the country column, for every year it would plot two bars (one for each country) side by side.
Just looking at the data it should be possible to directly use the hue argument.
But first you would need to create actual columns from the dataframe
df.reset_index(inplace=True)
Then something like
sns.barplot(x = "year", y="nkill", hue="country", data=df)
should give you the desired plot.

How do I create a column chart with an aggregated column

I have two columns customer_id and revenue and I'm trying to figure out how to use matplotlib (or seaborn) to create a histogram/bar/column chart that has an aggregated column on the right. Everytime I change the range it just cuts off those values above my max range. Instead I want there to be a bin that is the count of instances above that max value.
For the example chart linked below, if I define my range as 0-1558, I want there be a column that counts the instances of all values $1558 and above and display that as a column.
Example Chart
Cap the values above the limit:
df[df['revenue']>limit] = limit
Now, plot the histogram.
Same concept as #DYZ, but my code ended up being:
df.ix[df.revenue > limit, 'revenue'] = limit

Categories

Resources