Plotting a CDF from a multiclass pandas dataframe - python

I understand the package empiricaldist provides a CDF function as per the documentation.
However, I find it tricky to plot my dataframe in the column has multiple values.
df.head()
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
| | trip_id | seconds_start | seconds_end | duration | distance | speed | acceleration | lat_start | lon_start | lat_end | lon_end | travelmode |
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
| 0 | 318410 | 1461743310 | 1461745298 | 1988 | 5121.49 | 2.58 | 0.00130 | 41.162687 | -8.615425 | 41.177888 | -8.597549 | car |
| 1 | 318411 | 1461749359 | 1461750290 | 931 | 1520.71 | 1.63 | 0.00175 | 41.177949 | -8.597074 | 41.177839 | -8.597574 | bus |
| 2 | 318421 | 1461806871 | 1461806941 | 70 | 508.15 | 7.26 | 0.10370 | 37.091240 | -8.211239 | 37.092322 | -8.206681 | foot |
| 3 | 318422 | 1461837354 | 1461838024 | 670 | 1207.39 | 1.80 | 0.00269 | 37.092082 | -8.205060 | 37.091659 | -8.206462 | car |
| 4 | 318425 | 1461852790 | 1461853845 | 1055 | 1470.49 | 1.39 | 0.00132 | 37.091628 | -8.202143 | 37.092095 | -8.205070 | foot |
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
Would like to plot CDF for the column travelmode for each travel mode.
groups = df.groupby('travelmode')
However, I don't really understand how this could be done from the documentation.

You can plot them in a loop like
import matplotlib.pyplot as plt
def decorate_plot(title):
''' Adds labels to plot '''
plt.xlabel('Outcome')
plt.ylabel('CDF')
plt.title(title)
for tm in df['travelmode'].unique():
for col in df.columns:
if col != 'travelmode':
# Create new figures for each plot
fig, ax = plt.subplots()
d4 = Cdf.from_seq(df[col])
d4.plot()
decorate_plot(f"{tm} - {col}")

Related

Plot multiple bar plots for multiple columns

I have a dataset that looks roughly like the table below.
I need to create a barplot for each column TS1 to TS5 that counts the number of each item in that column. The items are one of the following: NOT_SEEN NOT_ABLE HIGH_BAR and numerical values between 110 and 140 separated by 2 (so 110, 112, 114 etc).
I have found a way to do this which works fine but what I am asking is if there is a way to create a loop or something so I don't have to copy paste the same code 5 times (for the 5 columns)?
This is what I have tried and working:
num_range = list(range(110,140, 2))
OUTCOMES = ['NOT_SEEN', 'NOT_ABLE', 'HIGH_BAR']
OUTCOMES.extend([str(num) for num in num_range])
OUTCOMES = CategoricalDtype(OUTCOMES, ordered = True)
fig, ax =plt.subplots(2, 3, sharey=True)
fig.tight_layout(pad=3)
This below is what I copy 5 times and only change the title (Testing 1, Testing 2 etc) and TS1 TS2.. (in the first line).
df["outcomes"] = df["TS1"].astype(OUTCOMES)
bpt=sns.countplot(x= "outcomes", data=df, palette='GnBu', ax=ax[0,0])
plt.setp(bpt.get_xticklabels(), rotation=60, size=6, ha='right')
bpt.set(xlabel='')
bpt.set_title('Testing 1')
Then the following code is below the "5" instances of the above.
ax[1,2].set_visible(False)
plt.show()
I am sure there is a way to do this that is much better but I'm new to all this.
Also, I need to make sure the bars of the barplot are ordered going left to right as: NOT_SEEN NOT_ABLE HIGH_BAR and 110, 112, 114 etc
Using python 2.7 (not my choice unfortunately) and pandas 0.24.2.
+----+------+------+----------+----------+----------+----------+----------+
| ID | VIEW | YEAR | TS1 | TS2 | TS3 | TS4 | TS5 |
+----+------+------+----------+----------+----------+----------+----------+
| AA | NO | 2005 | | 134 | | HIGH_BAR | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | YES | 2015 | | | NOT_SEEN | | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | YES | 2010 | 118 | | | | NOT_ABLE |
+----+------+------+----------+----------+----------+----------+----------+
| BB | NO | 2020 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | YES | 2020 | | | | NOT_SEEN | |
+----+------+------+----------+----------+----------+----------+----------+
| AA | NO | 2010 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | NO | 2015 | | | | | 132 |
+----+------+------+----------+----------+----------+----------+----------+
| BB | YES | 2010 | | HIGH_BAR | | 140 | NOT_ABLE |
+----+------+------+----------+----------+----------+----------+----------+
| AA | YES | 2020 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | NO | 2010 | | | | 112 | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | YES | 2015 | | | NOT_ABLE | | HIGH_BAR |
+----+------+------+----------+----------+----------+----------+----------+
| BB | NO | 2020 | | | | 145 | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | NO | 2015 | | 110 | | | |
+----+------+------+----------+----------+----------+----------+----------+
| AA | YES | 2010 | HIGH_BAR | | | NOT_SEEN | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | YES | 2015 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| AA | NO | 2020 | | | | 118 | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | YES | 2015 | | 180 | NOT_ABLE | | |
+----+------+------+----------+----------+----------+----------+----------+
| BB | YES | 2020 | | NOT_SEEN | | | 126 |
+----+------+------+----------+----------+----------+----------+----------+
You can put plotting lines in a function and call it in a for loop automatically changing column, title and axis in each iteration:
fig, axes =plt.subplots(2, 3, sharey=True)
fig.tight_layout(pad=3)
def plotting(column, title, ax):
df["outcomes"] = df[column].astype(OUTCOMES)
bpt=sns.countplot(x= "outcomes", data=df, palette='GnBu', ax=ax)
plt.setp(bpt.get_xticklabels(), rotation=60, size=6, ha='right')
bpt.set(xlabel='')
bpt.set_title(title)
columns = ['TS1', 'TS2', 'TS3', 'TS4', 'TS5']
titles = ['Testing 1', 'Testing 2', 'Testing 3', 'Testing 4', 'Testing 5']
for column, title, ax in zip(columns, titles, axes.flatten()):
plotting(column, title, ax)
axes[1,2].set_visible(False)
plt.show()

Plotting for next row after the slice

I am plotting values of column X and FT according to column CN value in the following code
import matplotlib.pyplot as plt, plt.plot(X[CN==1],FT[CN==1]), plt.plot(X[CN==36],FT[CN==36])
and the data is given as
+-------+-----+----+-------+-------+
| X | N | CN | Vdiff | FT |
+-------+-----+----+-------+-------+
| 524 | 2 | 1 | 0.0 | 0.12. |
| 534 | 2 | 1 | 0.0 |0.134. |
| 525 | 2 | 1 | 0.0 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 15 | 14 | 0.0 |3.54. |
| 5913 | 15 | 14 | 0.1 |3.98. |
| 5923 | 0 | 15 | 0.0 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 0.0 |7.36 |
| 33029 | 7 | 36 | 0.0 |8.99 |
| 33023 | 7 | 36 | 0.1 |12.45 |
| 33114 | 0 | 37 | 0.0 |14.33 |
+-------+-----+----+-------+-------+
I am getting incomplete graphs so I need to use 1 next row in my plot. For example for the graph of CN==36 as plt.plot(X[CN==36],FT[CN==36]) I want to use first row of CN==37 in my plot. Note that CN values are repetitive.
I have to plot multiple graphs in this way so a general code above graphs will be appreciated.
Addition on request in comment: Check at the end of the circular shape they are not touching their edges so circle is incomplete. for example for aqua & green color cycles. I want complete cycles so I need 1 or 2 additonal rows in data to plot.

Calculate distance between coordinates in two different dataframes

I have two dataframes as follow:
df_customers.head()
| | customer_id | zip_code_prefix | coords |
|----+----------------------------------+-------------------+-------------------------------------------|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 14409 | (-20.509897499999997, -47.3978655) |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 9790 | (-23.72685273154166, -46.54574582941039) |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 1151 | (-23.527788191788307, -46.66030962184773) |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 8775 | (-23.49693002789165, -46.185351975305366) |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 13056 | (-22.98722237101393, -47.151072819246686) |
+----+----------------------------------+-------------------+-------------------------------------------+
df_sellers.head()
| | seller_id | zip_code_prefix | coords |
|----+----------------------------------+-------------------+--------------------------------------------|
| 0 | 3442f8959a84dea7ee197c632cb2df15 | 13023 | (-22.898536428530225, -47.063125168330544) |
| 1 | d1b65fc7debc3361ea86b5f14c68d2e2 | 13844 | (-22.382941116125448, -46.94664125419024) |
| 2 | ce3ad9de960102d0677a81f5d0bb7b2d | 20031 | (-22.91064096725142, -43.17650983181368) |
| 3 | c0f3eea2e14555b6faeea3dd58c1b1c3 | 4195 | (-23.657250175378767, -46.61075944811122) |
| 4 | 51a04a8a6bdcb23deccc82b0b80742cf | 12914 | (-22.971647510075705, -46.53361841170685) |
+----+----------------------------------+-------------------+--------------------------------------------+
I would like to calculate the difference between those coords columns with haversine library without having to merge those dataframes(There is a Many-to-Many relationship between them).
So what I am looking for is a way to merge both dataframes on the fly by column zip_code_prefix while using the haversine library to calculate coordinates distance in KM.
Is that possible?

Graph python similar to R

I have a table like this one: (Ignore the columns "Index" and "D")
+-------+-----------------------+----------+----------+----------+
| Index | Type | Male | Female | D |
+-------+-----------------------+----------+----------+----------+
| 44 | Life struggles | 2.097324 | 3.681356 | 1.584032 |
| 2 | Writing notes | 2.677262 | 3.354730 | 0.677468 |
| 18 | Empathy | 3.528117 | 4.083051 | 0.554933 |
| 12 | Criminal damage | 2.926650 | 2.374150 | 0.552501 |
| 20 | Giving | 2.650367 | 3.196944 | 0.546577 |
| 21 | Compassion to animals | 3.666667 | 4.178268 | 0.511602 |
| 33 | Mood swings | 2.965937 | 3.451613 | 0.485676 |
| 10 | Funniness | 3.574572 | 3.104907 | 0.469665 |
| 38 | Children | 3.354523 | 3.805415 | 0.450891 |
| 47 | Small - big dogs | 3.221951 | 2.801695 | 0.420256 |
+-------+-----------------------+----------+----------+----------+
and I am trying to do a similar graph :
I know how to do it in R but not in python
I tried this:
sns.stripplot(data=df,y="Male",color="Blue")
sns.stripplot(data=df,y="Female",color="red")
But I don't know how to continue. Does someone have am idea?
This is easily done with matplotlib, it is simply a scatter plot with categories as y-values.
plt.style.use('ggplot')
fig, ax = plt.subplots()
ax.plot(df['Male'],df['Type'],'o', color='xkcd:reddish', ms=10, label='Male')
ax.plot(df['Female'],df['Type'],'o', color='xkcd:teal', ms=10, label='Female')
ax.axvline(3,ls='-',color='k')
ax.set_xlim(1,5)
ax.set_xlabel('avg response')
ax.set_ylabel('Variable')
ax.legend(bbox_to_anchor=(0.5, 1.02), loc='lower center',
ncol=2, title='group')
fig.tight_layout()

calculate difference of column values for sets of row indices which are not successive in pandas

Say I have the following table:
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.72694 | 1.4742 | 0.32396 | 0.98535 | 1 | 0.83592 | 0.0046566 | 0.0039465 | 0.04779 | 0.12795 | 0.016108 | 0.0052323 | 0.00027477 | 1.1756 | 1 |
| 2 | 0.74173 | 1.5257 | 0.36116 | 0.98152 | 0.99825 | 0.79867 | 0.0052423 | 0.0050016 | 0.02416 | 0.090476 | 0.0081195 | 0.002708 | 7.48E-05 | 0.69659 | 1 |
| 3 | 0.76722 | 1.5725 | 0.38998 | 0.97755 | 1 | 0.80812 | 0.0074573 | 0.010121 | 0.011897 | 0.057445 | 0.0032891 | 0.00092068 | 3.79E-05 | 0.44348 | 1 |
| 4 | 0.73797 | 1.4597 | 0.35376 | 0.97566 | 1 | 0.81697 | 0.0068768 | 0.0086068 | 0.01595 | 0.065491 | 0.0042707 | 0.0011544 | 6.63E-05 | 0.58785 | 1 |
| 5 | 0.82301 | 1.7707 | 0.44462 | 0.97698 | 1 | 0.75493 | 0.007428 | 0.010042 | 0.0079379 | 0.045339 | 0.0020514 | 0.00055986 | 2.35E-05 | 0.34214 | 1 |
| 7 | 0.82063 | 1.7529 | 0.44458 | 0.97964 | 0.99649 | 0.7677 | 0.0059279 | 0.0063954 | 0.018375 | 0.080587 | 0.0064523 | 0.0022713 | 4.15E-05 | 0.53904 | 1 |
| 8 | 0.77982 | 1.6215 | 0.39222 | 0.98512 | 0.99825 | 0.80816 | 0.0050987 | 0.0047314 | 0.024875 | 0.089686 | 0.0079794 | 0.0024664 | 0.00014676 | 0.66975 | 1 |
| 9 | 0.83089 | 1.8199 | 0.45693 | 0.9824 | 1 | 0.77106 | 0.0060055 | 0.006564 | 0.0072447 | 0.040616 | 0.0016469 | 0.00038812 | 3.29E-05 | 0.33696 | 1 |
| 11 | 0.7459 | 1.4927 | 0.34116 | 0.98296 | 1 | 0.83088 | 0.0055665 | 0.0056395 | 0.0057679 | 0.036511 | 0.0013313 | 0.00030872 | 3.18E-05 | 0.25026 | 1 |
| 12 | 0.79606 | 1.6934 | 0.43387 | 0.98181 | 1 | 0.76985 | 0.0077992 | 0.011071 | 0.013677 | 0.057832 | 0.0033334 | 0.00081648 | 0.00013855 | 0.49751 | 1 |
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
I have two sets of row indices :
set1 = [1,3,5,8,9]
set2 = [2,4,7,10,10]
Note : Here, I have indicated the first row with index value 1. Length of both sets shall always be same.
What I am looking for is a fast and pythonic way to get the difference of column values for corresponding row indices, that is : difference of 1-2,3-4,5-7,8-10,9-10.
For this example, my resultant dataframe is the following:
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.01479 | 0.0515 | 0.0372 | 0.00383 | 0.00175 | 0.03725 | 0.0005857 | 0.0010551 | 0.02363 | 0.037474 | 0.0079885 | 0.0025243 | 0.00019997 | 0.47901 | 0 |
| 1 | 0.02925 | 0.1128 | 0.03622 | 0.00189 | 0 | 0.00885 | 0.0005805 | 0.0015142 | 0.004053 | 0.008046 | 0.0009816 | 0.00023372 | 0.0000284 | 0.14437 | 0 |
| 3 | 0.04319 | 0.1492 | 0.0524 | 0.00814 | 0.00175 | 0.05323 | 0.0023293 | 0.0053106 | 0.0169371 | 0.044347 | 0.005928 | 0.00190654 | 0.00012326 | 0.32761 | 0 |
| 3 | 0.03483 | 0.1265 | 0.02306 | 0.00059 | 0 | 0.00121 | 0.0017937 | 0.004507 | 0.0064323 | 0.017216 | 0.0016865 | 0.00042836 | 0.00010565 | 0.16055 | 0 |
| 1 | 0.05016 | 0.2007 | 0.09271 | 0.00115 | 0 | 0.06103 | 0.0022327 | 0.0054315 | 0.0079091 | 0.021321 | 0.0020021 | 0.00050776 | 0.00010675 | 0.24725 | 0 |
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
My resultant difference values are absolute here.
I cant apply diff(), since the row indices may not be consecutive.
I am currently achieving my aim via looping through sets.
Is there a pandas trick to do this?
Use loc based indexing -
df.loc[set1].values - df.loc[set2].values
Ensure that len(set1) is equal to len(set2). Also, keep in mind setX is a counter-intuitive name for list objects.
You need to select by data reindexing and then subtract:
df = df.reindex(set1) - df.reindex(set2).values
loc or iloc will raise a future warning, since passing list-likes to .loc or [] with any missing label will raise KeyError in the future.
In short, try the following:
df.iloc[::2].values - df.iloc[1::2].values
PS:
Or alternatively, if (like in your question the indices follow no simple rule):
df.iloc[set1].values - df.iloc[set2].values

Categories

Resources