How to create multi-line chart with duplicate index categorical data? - python

I have a DataFrame(df) in the below format. I want to create multi-line chart using this data.
Name Category Score Count
A Books 12025.4375 48
A Music 17893.25 4
A Movie 31796.37838 37
A Script 1560.4 5
A Art 973.125 8
B Music 1929 15
B Movie 3044.229167 48
B Script 3402.4 10
B Art 2450.125 8
C Books 14469.3 10
C Music 10488.78947 57
C Movie 1827.101695 59
C Script 7077 2
Expected Output:
I want unique Category at X-Axis.
Score at Y-Axis and Multiple lines representing multiple Name.
Count is just an additional data which is not needed for this graph.
I tried using the below syntax, which is not showing the output in expected format.
lines = df.line(x= 'Category',\
y=['Name','Score'],figsize=(20,10))
I tried multiple options and answers available here but seems like nothing is working for me.

First pivot data and then plot by DataFrame.plot, line is default value so should be omitted:
import matplotlib.pyplot as plt
df1 = df.pivot('Category','Name','Score')
df1.plot(figsize=(20,10))
#show values in x axis
plt.xticks(np.arange(len(df1.index)), df1.index)
plt.show()

Related

.groupby function in Python

I am trying to create a pie chart in Python. I have a dataset with 2137 responses to a question, with answer choices ranging from 1 to 5. I am trying to produce a pie chart with the percentage of responses for each answer choice, but when I run my code, it produce a pie chart of each respondent (so 2137 pieces of the pie. I am thinking I need to use the .groupby function, but I am not entirely sure on how to correctly do it.
df3 = pd.DataFrame(df, columns=['Q78']).groupby(['Q78'])
df3.plot.pie(subplots=True)
Here is what I have tried. (PS I am just starting to learn Python, so sorry if this is a dumb question!!)
One of possible solutions:
s = df.Q87
s.groupby(s).size().plot.pie(autopct='%1.1f%%');
To test my code I created a Datarame limited to just 8 answers:
Q87
0 A
1 B
2 C
3 D
4 E
5 A
6 B
7 A
and I got the following picture:

How to identify a pattern using Pandas on similar row names

I am importing an excel file with somewhat similar Vendor names and using agg function to add spend and then using sort function to sort the spend. Eventually, this data-frame is feeding onto a dynamic Bokeh plot.
I have vendor names which are minutely different due to the text format and my pandas data-frame is not recognizing this pattern when adding the spend. Despite the fact that its the same vendor I am not getting a holistic view of spend but missing some data and ultimately not getting counting in Bokeh plot.
Data
Vendor Site Spend
ABC INC A 300
ABC,Inc B 100
ABC,Inc. C 50
ABC,INC. D 10
Expected Result
All the data should add up to 460.
You could deal with punctuation, spaces, and caps vs lower before trying to get your sum but it will change the name of your Vendor in the output:
df.groupby([x.upper().replace(' ', '').replace(',','').replace('.','') for x in df['Vendor']])['Spend'].sum()
ABCINC 460
You could also modify the column name in place before calling the groupby:
df['Vendor'] = df['Vendor'].str.upper().str.replace(' ', '').str.replace(',','').str.replace('.','')
print(df.groupby('Vendor')['Spend'].sum())
The df now looks like:
Vendor Site Spend
0 ABCINC A 300
1 ABCINC B 100
2 ABCINC C 50
3 ABCINC D 10
and the output:
ABCINC 460

Pandas infrastructure data statistics plot with date per user

I am trying to display some infrastructure usage daily statistics with Pandas but I'm a beginner and can't figure it out after many hours of research.
Here's my data types per column:
Name object UserService object
ItemSize int64 ItemsCount int64
ExtractionDate datetime64[ns]
Each day I have a new extraction for each users, so I probably need to use the group_by before plotting.
Data sample:
Name UserService ItemSize ItemsCount ExtractionDate
1 xyzf_s xyfz 40 1 2018-12-12
2 xyzf1 xyzf 53 5 2018-12-12
3 xyzf2 xyzf 71 4 2018-12-12
4 xyzf3 xyzf 91 3 2018-12-12
14 vo12 vo 41 5 2018-12-12
One of the graph I am trying to display is as follow:
x axis should be the extraction date
y axis should be the items count (it's divided by 1000 so it's by thousands of items from 1 to 100)
Each line on the graph should represent a user evolution (to look at data spikes), I guess I would have to display the top 10 or 50 because it would be difficult to have a graph of 1500 users.
I'm also interested by any other way you would exploit those data to look for data increase and anomaly in data consumption.
Assuming the user is shown in the name columns and there is only one line per user per day, to get the plot you are explicitly asking for, you can use the following code:
# Limit to 10 users
users_to_plot = df.Name.unique()[:10]
for u in users_to_plot:
mask = (df['Name'] == u)
values = df[mask]
plt.plot('ExtractionDate','ItemsCount',data=values.sort_values('ExtractionDate'))
It's important to look at the data and think about what information you are trying to extract and what that looks like. It's probably worth exploring with some individuals first and getting an idea of what is the thing you are trying to identify. Think about what makes that unique and if you can make it pop on a graph.

How do you annotate a chart from a pivot-table dataframe column?

I have a dataset
a b c d
10-Apr-86 Jimmy 1 this is
11-Apr-86 Minnie 2 the way
12-Apr-86 Jimmy 3 the world
13-Apr-86 Minnie 4 ends
14-Apr-86 Jimmy 5 this is the
15-Apr-86 Eliot 6 way
16-Apr-86 Jimmy 7 the world ends
17-Apr-86 Eliot 8 not with a bang
18-Apr-86 Minnie 9 but a whimper
I want to make a chart in matplotlib that looks like this
I've figure out how to get just the dots (no annotations) using the following code:
df = (pd.read_csv('python.csv'))
df_wanted = pd.pivot_table(
df,
index='a',
columns='b',
values='c')
df_wanted.index = pd.to_datetime(df_wanted.index)
plt.scatter(df_wanted.index, df_wanted['Jimmy'])
plt.scatter(df_wanted.index,df_wanted['Minnie'])
plt.scatter(df_wanted.index,df_wanted['Eliot'])
I think that to annotate, I need a list of values (as demonstrated here ) on the final column of my pivot table
My problem is: how do I get that final column 'd' of the original dataset to become the final column of my pivot table?
I tried dat1 = pd.concat([df_wanted, df['d']], axis = 1) - but this created a new set of rows underneath the rows of my dataframe. I realized the axis wasn't the same, so I tried to make a new pivot table with the d column as values - but got the error message No numeric types to aggregate.
I tried df_wanted2.append(df['d']) - but this made a new column for every element in column d.
Any advice? Ultimately, I want to make it so the data labels appear when one rolls over the point with the mouse
In this specific case, it doesn't seem you need to set column d as the final column of your pivot table.
plt.scatter(df_wanted.index, df_wanted['Jimmy'])
plt.scatter(df_wanted.index,df_wanted['Minnie'])
plt.scatter(df_wanted.index,df_wanted['Eliot'])
plt.legend(loc=0)
for k, v in df.set_index('a').iterrows():
plt.text(k, v['c'], v['d']) # or: plt.annotate(xy=(k, v['c']), s=v['d'])

Tools to use for conditional density estimation in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have a large data set that contains 3 attributes per row: A,B,C
Column A: can take the values 1, 2, and 0.
Column B and C: can take any values.
I'd like to perform density estimation using histograms for P(A = 2 | B,C) and plot the results using python.
I do not need the code to do it, I can try and figure that on my own. I just need to know the procedures and the tools that should I use?
To answer your over-all question, we should go through different steps and answer different questions:
How to read csv file (or text data) ?
How to filter data ?
How to plot data ?
At each stage, you need to use some techniques and specific tools, you might also have different choices at different stages (You can look on the internet for different alternatives).
1- How to read csv file:
There is a built-in function to go through the csv file where you store your data. But most people recommend Pandas to deal with csv files.
After installing Pandas package, you can read your csv file using Read_CSV command.
import pandas as pd
df= pd.read_csv("file.csv")
As you didn't share the csv file, I will make a random dataset to explain the up-coming steps.
import pandas as pd
import numpy as np
t= [1,1,1,2,0,1,1,0,0,2,1,1,2,0,0,0,0,1,1,1]
df = pd.DataFrame(np.random.randn(20, 2), columns=list('AC'))
df['B']=t #put a random column with only 0,1,2 values, then insert it to the dataframe
Note: Numpy is a python-Package. It's helpful to work with mathematical operations. You don't primarily need it, but I mentioned it to clear confusion here.
In case you print df in this case, you will get as result:
A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
3 -0.405150 -1.111787 2
4 0.502283 1.586743 0
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
7 2.731756 0.563161 0
8 2.096459 1.323511 0
9 1.386778 -1.774599 2
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
12 -0.264265 1.216617 2
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1
2- - How to filter data:
There are different techniques to filter data. The easiest one is by selecting the name of column inside your dataframe + the condition. In our case, the criteria is selecting value "2" in column B.
l= df[df['B']==2]
print l
You can also use other ways such groupby, lambda to go through the data frame and apply different conditions to filter the data.
for key in df.groupby('B'):
print key
If you run the above-mentioned scripts you'll get:
For the first one: Only data where B==2
A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2
For the second one: Printing the results divided in groups.
(0, A C B
4 0.502283 1.586743 0
7 2.731756 0.563161 0
8 2.096459 1.323511 0
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0)
(1, A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1)
(2, A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2)
How to plot your data:
The simplest ways to plot your data is by using matplotlib
The easiest ways to plot data in columns B, is by running :
import random
import matplotlib.pyplot as plt
xbins=range(0,len(l))
plt.hist(df.B, bins=20, color='blue')
plt.show()
You'll get this result:
if you wanna plot the results combined, you should use different colors/techniques to make it useful.
import numpy as np
import matplotlib.pyplot as plt
a = df.A
b = df.B
c = df.C
t= range(20)
plt.plot(t, a, 'r--', b, 'bs--', c, 'g^--')
plt.legend()
plt.show()
You'll get as a result:
Plotting data is driven by a specific need. You can explore the different ways to plot data by going through the examples of marplotlib.org official website.
If you're looking for other tools that do slightly more sophisticated things than nonparametric density estimation with histograms, please check this link to the python repository or directly install the package with
pip install cde
In addition to an extensive documentation, the package implements
Nonparametric (conditional & neighborhood kernel density estimation)
semiparametric (least squares cde) and
parametric neural network-based methods (mixture density networks, kernel density estimation)
Also, the package allows to compute centered moments, statistical divergences (kl-divergence, hellinger, jensen-shannon), percentiles, expected shortfalls and data generating processes (arma-jump, jump-diffusion, GMMs etc.)
Disclaimer: I am one of the package developers.

Categories

Resources