I would like to create a graph that plots multiple lines onto one graph.
Here is an example dataframe (my actual dataframe is much larger):
df = pd.DataFrame({'first': [1, 2, 3], 'second': [1, 1, 5], 'Third' : [8,7,9], 'Person' : ['Ally', 'Bob', 'Jim']})
The lines I want plotted are rowwise i.e. a line for Ally, a line for Jim and a line for Bob
You can use the built-in plotting functions, as soon as the DataFrame has the right shape. The right shape in this case would be Person names as columns and the former columns as index. So all you have to do is to set Person as the index and transpose:
ax = df.set_index("Person").T.plot()
ax.set_xlabel("My label")
First you should set your name as the index then retrieve the values for each index:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'first': [1, 2, 3], 'second': [1, 1, 5], 'Third' : [8,7,9], 'Person' : ['Ally', 'Bob', 'Jim']})
df = df.set_index('Person')
for person in df.index:
val = df.loc[person].values
plt.plot(val, label = person)
plt.legend()
plt.show()
As for how you want to handle the first second third I let you judge by yourself
Related
In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]
I have a dataframe that has four columns of data. The first is the frame number of a video, the second and third are x and y positions of particles in the image, and the fourth frame is the number that has been assigned to the particle to keep track of them separately (example below):
from pandas import DataFrame
Data = {'frame': [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'x': ['First value', 'Second value',...],
'y': ['First value', 'Second value',...],
'particle': [0, 1, 2, 0, 1, 2, 0, 1, 2]
}
df = DataFrame(Data, columns = ['frame', 'x', 'y', 'particle'])
My dataframe also contains an index which has the same values as the frame column.
My goal is to first group by particle, then sort by frame number within the particle groups. There are many examples of this online, but the index seems to be causing some issues. I have tried this:
grouped_tracks1 = tracks1.sort_values('frame').groupby('particle').reset_index()
but that returns the following error:
'frame' is both an index level and a column label, which is ambiguous.
I have also tried sorting by index,
grouped_tracks1 = tracks1.sort_index().groupby('particle').reset_index()
however, this also gives an error and I would prefer just to sort by frame and completely ignore the index. At the end of the sorting and grouping, I would like to have a dataframe that is grouped by particle and sorted within those groups.
Any suggestions on how to fix this? Thanks!
The goal is to create a new column in a pandas column that stores the value of a KS D-statistic, df['ks']. The KS statistic is generated between two groups of columns in that dataframe, grp1 and grp2:
# sample dataframe
import pandas as pd
import numpy as np
dic = {'gene': ['x','y','z','n'],
'cell_a': [1, 5, 8,9],
'cell_b': [8, 5, 4,9],
'cell_c': [8, 6, 1,1],
'cell_d': [1, 2, 7,1],
'cell_e': [5, 7, 9,1],
}
df = pd.DataFrame(dic)
df.set_index('gene', inplace=True)
df['ks'] = np.nan
# sample groups
grp1 = ['cell_a','cell_b']
grp2 = ['cell_d','cell_e']
So the D-statistic for gene x would be stats.ks_2samp([1,5], [1,6])[0], gene y would be stats.ks_2samp([5,2], [1,7])[0], etc. Attempt is below:
# attempt 1 to fill in KS stat
for idx, row in df.iterrows():
df.ix[idx, 'ks'] = stats.ks_2samp(df[grp1], df[grp2])[0]
However, when I attempt to fill the ks series, I get the following error:
ValueError: object too deep for desired array
My question has two parts: 1) What does it mean for an object to be "too deep for an array", and 2) how can I accomplish the same thing without iteration?
The KS calculation in the loop was getting a "too deep" error because I needed to pass it a 1-D array for each distribution to test:
for idx, row in df.iterrows():
df.loc[idx, 'ks'] = stats.ks_2samp(df.loc[idx, grp1], (df.loc[idx, grp2]))[0]
My previous attempt used a 2-D array instead. That is what was causing it to be "too deep"
Using Pandas, python 3. Working in jupyter.
Ive made this graph below using the following code:
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind = 'bar', stacked = True, color = ['red', 'blue'], grid = False)
print(temp3)
And then tried to do the same, but with divisions for Gender. I wanted to make this:
So I wrote this code:
And made this monstrosity. I'm unfamiliar with pivot tables in pandas, and after reading documentation, am still confused. I'm assuming that aggfunc affects the values given, but not the indices. How can I separate the loan status so that it reads as different colors for 'Y' and 'N'?
Trying a method similar to the methods used for temp3 simply yields a key error:
temp3x = pd.crosstab(df['Credit_History'], df['Loan_Status', 'Gender'])
temp3x.plot(kind = 'bar', stacked = True, color = ['red', 'blue'], grid = False)
print(temp3)
How can I make the 'Y' and 'N' appear separately as they are in the first graph, but for all 4 bars instead of using just 2 bars?
You need to make a new column called Loan_status_word and then pivot.
loan_status_word = loan_status.map({0:'No', 1:'Yes'})
df.pivot_table(values='Loan_Status',
index=['Credit_History', 'Gender'],
columns = 'loan_status_word',
aggfunc ='size')
Try to format your data such that each item you want in your legend is in a single column.
df = pd.DataFrame(
[
[3, 1],
[4, 1],
[1, 4],
[1, 3]
],
pd.MultiIndex.from_product([(1, 0), list('MF')], names=['Credit', 'Gendeer']),
pd.Index(['Yes', 'No'], name='Loan Status')
)
df
Then you can plot
df.plot.bar(stacked=True)
Below is the code to achieve the desired result:
temp4=pd.crosstab([df['Credit_History'],df['Gender']],df['Loan_Status'])
temp4.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)
I have a Multiindexed DataFrame containing the explanatory variables df and a DataFrame containing the response variables df_Y
# Create DataFrame for explanatory variables
np.arrays = [['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
[1, 2, 3, 1, 2, 3]]
df = pd.DataFrame(np.random.randn(6,2),
index=pd.MultiIndex.from_tuples(zip(*np.arrays)),
columns=['X1', 'X2'])
# Create DataFrame for response variables
df_Y = pd.DataFrame([1, 2, 3], columns=['Y'])
I am able to perform regression on just the single level DataFrame with index foo
df_X = df.ix['foo'] # using only 'foo'
reg = linear_model.Ridge().fit(df_X, df_Y)
reg.coef_
Problem: However since the Y variables is the same for both levels foo and bar, so we can have twice as many regression samples if we also include bar.
What is the best way to reshape/collapse/unstack the multilevel DataFrame so we can make use of all the data for our regression? Other levels may have lesser rows that df_Y
Sorry for the confusing wording, I am unsure of the correct terms/phrasing
The first index can be dropped and then a join will work:
df.index = df.index.drop_level()
df = df.join(df_Y)