I have dataset with 2 columns and I would like to show the variation of one feature according to the binary output value
data
id Column1 output
1 15 0
2 80 1
3 120 1
4 20 0
... ... ...
I would like to drop a plot with python where x-axis contains values of Column1 and y-axis contains the percent of getting positive values.
I know already that the form of my plot have the form of exponontial function where when column1 has smaller numbers I will get more positive output then when it have long values
exponential plot maybe need two list like this
try this
import matplotlib.pyplot as plt
# x-axis points list
xL = [5,10,15,20,25,30]
# y-axis points list
yL = [100,50,25,12,10]
plt.plot(xL, yL)
plt.axis([0, 35, 0, 200])
plt.show()
Related
I want to plot bar graph from the dataframe below.
df2 = pd.DataFrame({'URL': ['A','A','B','B','C','C'],
'X': [5,0,7,1,0,0],
'Y': [4,0,4,7,9,0],
'Z':[11,0,8,4,0,0]})
URL X Y Z
0 A 5 4 11
1 A 0 0 0
2 B 7 4 8
3 B 1 7 4
4 C 0 9 0
5 C 0 0 0
I want to plot bar graph in which I have URL counts on y-axis and X , Y, Z on x-axis with two bars for each.
One bar will show/count the number of non-zero values in each column and I did it (code at the end).
Second bar will show the count of duplicates in the URL column with the condition that While counting the duplicate values in URL column, At least one value in the corresponding column X should be non-zero like A comes two times in the URL, and in column X, we have one non-zero so it will count as one. In the case of B, both values in non-zero so we will count B as one as well but in the case of C, both values are zero in X so we will not count it as one. The same for Y and Z, and plot the result on the x-axis. I have managed to draw a bar graph for the first case (code below) but for the second bar. I'm unable to draw if anyone can help me in this case. Thank you. here's the code
df2.melt("URL").\
groupby("variable").\
agg(Keywords_count=("value", lambda x: sum(x != 0)),
dup = ("URL", "nunique")).\
plot(kind="bar")
I have a dataset with some rows containing singular answers and others having multiple answers. Like so:
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.
I could split them and add additional columns for each answer like:
year length Animation
0 1971 121 1
1 1971 121 2
2 1971 121 3
3 1939 71 1
4 1939 71 3 ...
and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?
EDIT: Actual data snippet
You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.
N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:
Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x
Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.
N.B.2 For sparse matrix output please see this post.
Sample Solution
Data & Preprocessing
import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap
df = pd.read_csv(io.StringIO("""
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
"""), sep=r"\s{2,}", engine="python")
# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)
Frequency Table
Note: grouping of length is ignored for simplicity
ct = pd.crosstab(df["Animation"], df["length"])
print(ct)
# Out[65]:
# length 7 70 71 121
# Animation
# 0 1 1 1 0
# 1 0 1 1 1
# 2 1 1 1 1
# 3 0 0 2 1
Visualization
ax = heatmap(ct, cmap="viridis",
yticklabels=df["Animation"].drop_duplicates().sort_values(),
xticklabels=df["length"].drop_duplicates().sort_values(),
)
ax.set_title("Title", fontsize=20)
plt.show()
Example Analysis
Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.
For example,
Q: What length does Animation=3 produces?
A: 66.7% chance to give 71
33.3% chance to give 121
otherwise unobserved
You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.
First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).
positions = df.Preferred_positions.str.split(',').sum().unique()
Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.
for position in positions:
df[position] = df.Preferred_positions.apply(
lambda x: 1 if position in x else 0
)
I have a dataframe like this and wish to make a frequency histogram.
1.19714
1.04872
0.188158
1
1.02339
1
1
1
0.38496
1.31858
1
1.33654
0.945736
1.00877
0.413445
0.810127
1
0.625
0.492857
0.187156
0.95847
I want to plot a frequency histogram, with x axis bins from -1 to 1. How can I do this in pandas?
pandas has a built-in histogram function.
Assuming your dataframe is called df:
import numpy as np
df.hist(bins=np.arange(-1,1,0.1))
I have this series:
data:
0 17
1 25
2 10
3 60
4 0
5 20
6 300
7 50
8 10
9 80
10 100
11 65
12 125
13 50
14 100
15 150
Name: 1, dtype: int64
I wanted to plot an histogram with variable bin size, so I made this:
filter_values = [0,25,50,60,75,100,150,200,250,300,350]
out = pd.cut(data."1", bins = filter_values)
counts = pd.value_counts(out)
print(counts)
My problem is that when I use counts.plot(kind="hist"), i have not the good label for x axis. I only get them by using a bargraph instead counts.plot(kind="bar"), but I can't get the right order then.
I tried to use xticks=counts.index.values[0] but it makes an error, and xticks=filter_values give an odd figure shape as the numbers go far beyond what the plot understand the bins to be.
I also tried counts.hist(), data.hist(), and counts.plot.hist without success.
I don't know how to plot correctly the categorical data from counts (it includes as index a pandas categorical index) so, I don't know which process I should apply, if there is a way to plot variable bins directly in data.hist() or data.plot(kind="hist") or data.plot.hist(), or if I am right to build counts, but then how to represent this correctly (with good labels on the xaxis and the right order, not a descending one as in the bar graph.
I've created a dummy dataframe which is similar to the one I'm using.
The dataframe consists of Fare prices, Cabin-type, and Survival (1 is alive, 0 = dead).
The first plot creates many graphs via factorplot, with each graph representing the Cabin type. The x-axis is represented by the Fare price and Y-axis is just a count of the number of occurrences at that Fare price.
What I then did was created another series, via groupby of [Cabin, Fare] and then proceeded to take the mean of the survival to get the survival rate at each Cabin and Fare price.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(dict(
Fare=[20, 10, 30, 40, 40, 10, 20, 30, 40 ,30, 20, 30, 30],
Cabin=list('AAABCDBDCDDDC'),
Survived=[1, 0, 0, 0 ,0 ,1 ,1 ,0 ,1 ,1 , 0, 1, 1]
))
g =sns.factorplot(x='Fare', col='Cabin', kind='count', data=df,
col_wrap=3, size=3, aspect=1.3, palette='muted')
plt.show()
x =df.groupby(['Cabin','Fare']).Survived.mean()
What I would like to do is, plot an lineplot on the count graph above, (so the x-axis is the same, and each graph is still represented by a Cabin-type), but I would like the y-axis to be the survival mean we calculated with the groupby series x in the code above, which when outputted would be the third column below.
Cabin Fare
A 10 0.000000
20 1.000000
30 0.000000
B 20 1.000000
40 0.000000
C 30 1.000000
40 0.500000
D 10 1.000000
20 0.000000
30 0.666667
The y-axis for the line plot should be on the right side, and the range I would like is [0, .20, .40, .60, .80, 1.0, 1.2]
I looked through the seaborn docs for a while, but I couldn't figure out how to properly do this.
My desired output looks something like this image. I'm sorry my writing looks horrible, I don't know how to use paint well. So the ticks and numbers are on the right side of each graph. The line plot will be connected via dots at each x,y point. So for Cabin A, the first x,y point is (10,0) with 0 corresponding to the right y-axis. The second point is (20,1) and so on.
Data operations:
Compute frequency counts:
df_counts = pd.crosstab(df['Fare'], df['Cabin'])
Compute means across the group and unstack it back to obtain a DF. The Nan's are left as they are and not replaced by zero's to show the break in the line plot or else they would be continuous which wouldn't make much sense here.
df_means = df.groupby(['Cabin','Fare']).Survived.mean().unstack().T
Prepare the x-axis labels as strings:
df_counts.index = df_counts.index.astype(str)
df_means.index = df_means.index.astype(str)
Plotting:
fig, ax = plt.subplots(1, 4, figsize=(10,4))
df_counts.plot.bar(ax=ax, ylim=(0,5), cmap=plt.cm.Spectral, subplots=True,
legend=None, rot=0)
# Use secondary y-axis(right side)
df_means.plot(ax=ax, secondary_y=True, marker='o', color='r', subplots=True,
legend=None, xlim=(0,4))
# Adjust spacing between subplots
plt.subplots_adjust(wspace=0.5, hspace=0.5)
plt.show()