I'm doing work involving stock market research and I wanted to create a crosstab to run a chi squared test on. I have stock market price change data as a data frame, and I wanted to create another crosstab based on counts by percentile of two of the columns. Ideally it'd look something like this:
0.25
0.5
0.75
1.0
0.25
12
45
13
12
0.5
2
27
9
15
0.75
14
11
89
23
1.0
10
52
11
7
Where for example the (.75,.5) entry is the count of data points that lie between the 0.5 and 0.75 percentiles for the first variable and the 0.25 and 0.5 percentiles for the second variable. obviously those numbers probably aren't actually possible but you get the point.
All I can think of so far is just doing it by brute force where you get each percentile for each variable individually and then get the counts for each and add them in manually to a table. Is there any shorter way of doing this?
Preparing a sample dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,2), columns=['A', 'B'])
The percentiles can be computed using the qcut. The 4 is the number of percentiles you want to split your variable.
df['A_binned'] = pd.qcut(df['A'], 4)
df['B_binned'] = pd.qcut(df['B'], 4)
Counts the number of records in each percentile
dff = df.groupby(by=['A_binned', 'B_binned']).count().reset_index()
Finally you can pivot the dataframe
dff.pivot_table(index='A_binned', columns = 'B_binned', values='A')
I have the following Dataframe(this table is just an example, the Types and sizes are more):
df = pd.DataFrame({
'type':['A','A','B','B','C','C','D','D'],
'size':['a','b','c','d','e','f','g','h'],
'Nx':[4.3,2.4,2.5,4.4,3.5,1.8,4.5,2.8],
'min':[0.5,2.5,0.7,3.2,0.51,2,0.3,3],
'max':[1.5,3.4,1.7,4.3,1.51,3,1.2,4]})
print(df)
ax=df.plot.bar(x='type',y='max',stacked=True,bottom=df['min'])
ax.plt(x='type',y='Nx')
This is the result:
type size Nx min max
0 A a 4.3 0.50 1.50
1 A b 2.4 2.50 3.40
2 B c 2.5 0.70 1.70
3 B d 4.4 3.20 4.30
4 C e 3.5 0.51 1.51
5 C f 1.8 2.00 3.00
6 D g 4.5 0.30 1.20
7 D h 2.8 3.00 4.00
how can i plot this data by having just one column for Type A, B,C.. And then plot scatter for Type,Nx to be like this:
You can add a new column called height equal to max - min since the plt.bar method takes a height parameter, then reindex the DataFrame by ['type','size']. Then loop through the levels of this multiindex DataFrame and plot a bar with a different color for each unique type and size combination.
This also requires you to define your own color palette. I chose a discrete color palette from plt.cm and mapped integer values to each color. As you are looping through each unique type and size, you can have a counter for the inner most loop to ensure that each bar within the same type has a different color.
NOTE: this does make the assumption that there aren't multiple rows with the same type and size.
To show this is generalizable, I added another bar of type 'D' and size 'i' and it appears as a distinct bar in the plot.
import pandas as pd
import matplotlib.pyplot as plt
## added a third size to type D
df = pd.DataFrame({
'type':['A','A','B','B','C','C','D','D','D'],
'size':['a','b','c','d','e','f','g','h','i'],
'Nx':[4.3,2.4,2.5,4.4,3.5,1.8,4.5,2.8,5.6],
'min':[0.5,2.5,0.7,3.2,0.51,2,0.3,3,4.8],
'max':[1.5,3.4,1.7,4.3,1.51,3,1.2,4,5.3]})
## create a height column for convenience
df['height'] = df['max'] - df['min']
df_grouped = df.set_index(['type','size'])
## create a list of as many colors as there are categories
cmap = plt.cm.get_cmap('Accent', 10)
## loop through the levels of the grouped DataFrame
for each_type, df_type in df_grouped.groupby(level=0):
color_idx=0
for each_size, df_type_size in df_type.groupby(level=1):
color_idx += 1
plt.bar(x=[each_type]*len(df_type_size), height=df_type_size['height'], bottom=df_type_size['min'], width=0.4,
edgecolor='grey', color=cmap(color_idx))
plt.scatter(x=[each_type]*len(df_type_size), y=df_type_size['Nx'], color=cmap(color_idx))
plt.ylim([0, 7])
plt.show()
I have dataset with 2 columns and I would like to show the variation of one feature according to the binary output value
data
id Column1 output
1 15 0
2 80 1
3 120 1
4 20 0
... ... ...
I would like to drop a plot with python where x-axis contains values of Column1 and y-axis contains the percent of getting positive values.
I know already that the form of my plot have the form of exponontial function where when column1 has smaller numbers I will get more positive output then when it have long values
exponential plot maybe need two list like this
try this
import matplotlib.pyplot as plt
# x-axis points list
xL = [5,10,15,20,25,30]
# y-axis points list
yL = [100,50,25,12,10]
plt.plot(xL, yL)
plt.axis([0, 35, 0, 200])
plt.show()
I have a dataframe like this and wish to make a frequency histogram.
1.19714
1.04872
0.188158
1
1.02339
1
1
1
0.38496
1.31858
1
1.33654
0.945736
1.00877
0.413445
0.810127
1
0.625
0.492857
0.187156
0.95847
I want to plot a frequency histogram, with x axis bins from -1 to 1. How can I do this in pandas?
pandas has a built-in histogram function.
Assuming your dataframe is called df:
import numpy as np
df.hist(bins=np.arange(-1,1,0.1))
I have a pandas dataframe of 434300 rows with the following structure:
x y p1 p2
1 8.0 1.23e-6 10 12
2 7.9 4.93e-6 10 12
3 7.8 7.10e-6 10 12
...
.
...
4576 8.0 8.85e-6 5 16
4577 7.9 2.95e-6 5 16
4778 7.8 3.66e-6 5 16
...
...
...
434300 ...
with the key point being that for every block of varying x,y data there are p1 and p2 that do not vary. Note that these blocks of constant p1,p2 are of varying length so it is not simply a matter of slicing the data every n rows.
I would like to plot the values p1 vs p2 in a graph, but would only like to plot the unique points.
If i do plot p1 vs p2 using:
In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 434300
I see that matplotlib is plotting each individual line of data which is to be expected.
What is the neatest way to plot only the unique points from columns p1 and p2?
Here is a csv of a small example dataset that has all of the important features of my dataset.
Just drop the duplicates and plot:
df.drop_duplicates(how='all', columns=['p1', 'p2'])[['p1', 'p2]].plot()
You can slice the p1 and p2 columns from the data frame and then drop duplicates before plotting.
sub_df = df[['p1','p2']].drop_duplicates()
fig, ax = plt.subplots(1,1)
ax.plot(sub_df['p1'],sub_df['p2'])
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('exampleData.csv')
d = data[['p1', 'p2']].drop_duplicates()
plt.plot(d['p1'], d['p2'], 'o')
plt.show()
After looking at this answer to a similar question in R (which is what the pandas dataframes are based on) I found the pandas function pandas.Dataframe.drop_duplicates. If we modify my example code as follows:
In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: df.drop_duplicates(subset=['p1','p2'],inplace=True)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 15
We see that this restricts df to only the unique points to be plotted. An important point is that you must pass a subset to drop_duplicates so that it only uses those columns to determine duplicate rows.