How can I plot a bubble chart from a dataframe that has been created from a pandas crosstab of another dataframe?
Imports;
import plotly as py
import plotly.graph_objects as go
from plotly.subplots import make_subplots
The crosstab was created using;
df = pd.crosstab(raw_data['Speed'], raw_data['Height'].fillna('n/a'))
The df contains mostly zeros, however where a number appears I want a point where the value controls the point size. I want to set the Index values as the x axis and the columns name values as the Y axis.
The df would look something like;
10 20 30 40 50
1000 0 0 0 0 5
1100 0 0 0 7 0
1200 1 0 3 0 0
1300 0 0 0 0 0
1400 5 0 0 0 0
I’ve tried using scatter & Scatter like this;
fig.add_trace(go.Scatter(x=df.index.values, y=df.columns.values, size=df.values,
mode='lines'),
row=1, col=3)
This returned a TypeError: 'Module' object not callable.
Any help is really appreciatted. Thanks
UPDATE
The answers below are close to what I ended up with, main difference being that I reference 'Speed' in the melt line;
df.reset_index()
df.melt(id_vars="Speed")
df.rename(columns={"index":"Engine Speed",
"variable":"Height",
"value":"Count"})
df[df!=0].dropna()
scale=1000
fig.add_trace(go.Scatter(x=df["Speed"], y=df["Height"],mode='markers',marker_size=df["Count"]/scale),
row=1, col=3)
This works however my main problem now is that the dataset is huge and plotly is really struggling to deal with it.
Update 2
Using Scattergl allows Plotly to deal with the large dataset very well!
If this is the case you can use plotly.express this is very similar to #Erik answer but shouldn't return errors.
import pandas as pd
import plotly.express as px
from io import StringIO
txt = """
10 20 30 40 50
1000 0 0 0 0 5
1100 0 0 0 7 0
1200 1 0 3 0 0
1300 0 0 0 0 0
1400 5 0 0 0 0
"""
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
df = df.reset_index()\
.melt(id_vars="index")\
.rename(columns={"index":"Speed",
"variable":"Height",
"value":"Count"})
fig = px.scatter(df, x="Speed", y="Height",size="Count")
fig.show()
UPDATE
In case you got error please check your pandas version with pd.__version__ and try to check line by line this
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
df = df.reset_index()
df = df.melt(id_vars="index")
df = df.rename(columns={"index":"Speed",
"variable":"Height",
"value":"Count"})
and report in which line it breaks.
I recommend to use tidy format to represent your data. We say a dataframe is tidy if and only if
Each row is an observation
Each column is a variable
Each value must have its own cell
To create a more tidy-dataframe you can do
df = pd.crosstab(raw_data["Speed"], raw_data["Height"])
df.reset_index(level=0, inplace=True)
df.melt(id_vars=["Speed", "Height"], value_vars=["Counts"])
Speed Height Counts
0 1000 10 2
1 1100 20 1
2 1200 10 1
3 1200 30 1
4 1300 40 1
5 1400 50 1
The next step is to do the actual plotting.
# when scale is increased bubbles will become larger
scale = 10
# create the scatter plot
scatter = go.Scatter(
x=df.Speed,
y=df.Height,
marker_size=df.counts*scale,
mode='markers')
fig = go.Figure(scatter)
fig.show()
This will create a plot as shown below.
Related
I prefer to use matrix multiplication for coding, because it's so much more efficient than iterating, but curious on how to do this if the dimensions are different.
I have two different dataframes
A:
Orig_vintage
Q12018
185
Q22018.
200
and B:
default_month
1
2
3
orig_vintage
Q12018
0
25
35
Q22018
0
15
45
Q32018
0
35
65
and I'm trying to divide A through columns of B, so the B dataframe becomes (note I've rounded random percentages):
default_month
1
2
3
orig_vintage
Q12018
0
.03
.04
Q22018
0
.04
.05
Q32018
0
.06
.07
But bottom line want to divide the monthly defaults by the total origination figure to get to a monthly default %.
first step is get data side by side with a right join()
then divide all columns by required value Divide multiple columns by another column in pandas
required value as I understand is sum, if join did not give a value.
import pandas as pd
import io
df1 = pd.read_csv(
io.StringIO("""Orig_vintage,Unnamed: 1\nQ12018,185\nQ22018,200\n"""), sep=","
)
df2 = pd.read_csv(
io.StringIO(
"""default_month,1,2,3\nQ12018,0.0,25.0,35.0\nQ22018,0.0,15.0,45.0\nQ32018,0.0,35.0,65.0\n"""
),
sep=",",
)
df1.set_index("Orig_vintage").join(df2.set_index("default_month"), how="right").pipe(
lambda d: d.div(d["Unnamed: 1"].fillna(d["Unnamed: 1"].sum()), axis=0)
)
default_month
Unnamed: 1
1
2
3
Q12018
1
0
0.135135
0.189189
Q22018
1
0
0.075
0.225
Q32018
nan
0
0.0909091
0.168831
This is a piece of my code, but I don't know how to get this array as new column Height to the original csv file in the format?
Date
Level
Height
01-01-2021
45
0
02-01-2021
43
0
03-01-2021
47
1
04-01-2021
46
0
.....
import pandas as pd
from scipy.signal import find_peaks
import matplotlib.pyplot as plt
bestand = pd.read_csv('file.csv', skiprows=1, usecols=[0, 1] , names=['date', 'level'])
bestand = bestand['level']
indices = find_peaks(bestand, height= 37, threshold=None, distance=None)
height = indices[1]['peak_heights']
print(height)
I think you want to assign a column named height that takes the value 1 when level is a peak according to find_peaks(). If so:
# Declare column full of zeros
bestand['height'] = 0
# Get row number of observations that are peaks
idx = find_peaks(x=bestand['level'], height=37)[0]
# Select rows in `idx` and replace their `height` with 1
bestand.iloc[idx, 2] = 1
Which returns this:
date level height
0 01-01-2021 45 0
1 02-01-2021 43 0
2 03-01-2021 47 1
3 04-01-2021 46 0
I'am not sure I understood your question.
You just want to save the results?
bastand['Heigth'] = indices
bastand.to_csv('file.csv')
I have grouped the dataset by month and date and I have added third column for count the data in each day.
Dataframe before
month day
0 1 1
1 1 1
2 1 1
..
3000 12 31
3001 12 31
3002 12 31
Dataframe now:
month day count
0 1 1 300
1 1 2 500
2 1 3 350
..
363 12 28 700
364 12 29 1300
365 12 30 1000
How to do subplot for each month , x will be the days and y will be the count
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df= pd.read_csv('/home/rand/Downloads/Flights.csv')
by_month= df.groupby(['month','day']).day.agg('count').to_frame('count').reset_index()
I'm beginner in data science field
Try this
fig, ax = plt.subplots()
ax.set_xticks(df['day'].unique())
df.groupby(["day", "month"]).mean()['count'].unstack().plot(ax=ax)
Above code will give you 12 lines representing each month in one plot. If you want to have 12 individual subplots for those months, try this:
fig = plt.figure()
for i in range(1,13):
df_monthly = df[df['month'] == i] # select dataframe with month = i
ax = fig.add_subplot(12,1,i) # add subplot in the i-th position on a grid 12x1
ax.plot(df_monthly['day'], df_monthly['count'])
ax.set_xticks(df_monthly['day'].unique()) # set x axis
I think you could use pandas.DataFrame.pivot to change the shape of your table to make it more convenient for the plot. So in your code you could do something like this:
plot_data= df.pivot(index='day', columns='month', values='count')
plot_data.plot()
plt.show()
This is assuming you have equal number of days in every month since in the sample you included, month 12 only has 30 days. More on pivot.
Try this:
df = pd.DataFrame({
'month': list(range(1, 13))*3,
'days': np.random.randint(1,11, 12*3),
'count': np.random.randint(10,20, 12*3)})
df.set_index(['month', 'days'], inplace=True)
df.sort_index()
df = df.groupby(level=[0, 1]).sum()
Code to plot it:
df.reset_index(inplace=True)
df.pivot(index='days', columns='month', values='count').fillna(0).plot()
I have a data frame similar to this
import pandas as pd
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
print(df) #printing dataframe.
I performed the groupby function on it to get the required output.
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
print(group_data)
now i want to plot the out using matplot lib. Please help me with it.. I am not able to figure how to start and what to do.
I want to plot using the counter value and something similar to bar graph
Try:
group_data = group_data.reset_index()
in order to get rid of the multiple index that the groupby() has created for you.
Your print(group_data) will give you this:
In [24]: group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
In [25]: print(group_data)
age data
1 ONE 3
THREE 1
TWO 1
2 ONE 2
TWO 1
3 ONE 1
TWO 1
Name: COUNTER, dtype: int64
Whereas, reseting will 'simplify' the new index:
In [26]: group_data = group_data.reset_index()
In [27]: group_data
Out[27]:
age data COUNTER
0 1 ONE 3
1 1 THREE 1
2 1 TWO 1
3 2 ONE 2
4 2 TWO 1
5 3 ONE 1
6 3 TWO 1
Then depending on what it is exactly that you want to plot, you might want to take a look at the Matplotlib docs
EDIT
I now read more carefully that you want to create a 'bar' chart.
If that is the case then I would take a step back and not use reset_index() on the groupby result. Instead, try this:
In [46]: fig = group_data.plot.bar()
In [47]: fig.figure.show()
I hope this helps
Try with this:
# This is a great tool to add plots to jupyter notebook
% matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Params get plot bigger
plt.rcParams["axes.labelsize"] = 16
plt.rcParams["xtick.labelsize"] = 14
plt.rcParams["ytick.labelsize"] = 14
plt.rcParams["legend.fontsize"] = 12
plt.rcParams["figure.figsize"] = [15, 7]
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
df['COUNTER'] = 1
group_data = df.groupby(['age','data']).sum()[['COUNTER']].plot.bar(rot = 90) # If you want to rotate labels from x axis
_ = group_data.set(xlabel = 'xlabel', ylabel = 'ylabel'), group_data.legend(['Legend']) # you can add labels and legend
I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4