I am plotting values of column X and FT according to column CN value in the following code
import matplotlib.pyplot as plt, plt.plot(X[CN==1],FT[CN==1]), plt.plot(X[CN==36],FT[CN==36])
and the data is given as
+-------+-----+----+-------+-------+
| X | N | CN | Vdiff | FT |
+-------+-----+----+-------+-------+
| 524 | 2 | 1 | 0.0 | 0.12. |
| 534 | 2 | 1 | 0.0 |0.134. |
| 525 | 2 | 1 | 0.0 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 15 | 14 | 0.0 |3.54. |
| 5913 | 15 | 14 | 0.1 |3.98. |
| 5923 | 0 | 15 | 0.0 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 0.0 |7.36 |
| 33029 | 7 | 36 | 0.0 |8.99 |
| 33023 | 7 | 36 | 0.1 |12.45 |
| 33114 | 0 | 37 | 0.0 |14.33 |
+-------+-----+----+-------+-------+
I am getting incomplete graphs so I need to use 1 next row in my plot. For example for the graph of CN==36 as plt.plot(X[CN==36],FT[CN==36]) I want to use first row of CN==37 in my plot. Note that CN values are repetitive.
I have to plot multiple graphs in this way so a general code above graphs will be appreciated.
Addition on request in comment: Check at the end of the circular shape they are not touching their edges so circle is incomplete. for example for aqua & green color cycles. I want complete cycles so I need 1 or 2 additonal rows in data to plot.
Related
I am using the code below to produce following result in Python and I want equivalent for this code on R.
here N is the column of dataframe data . CN column is calculated from values of column N with a specific pattern and it gives me following result in python.
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+
a short overview of my code is
data = pd.read_table(filename,skiprows=15,decimal=',', sep='\t',header=None,names=["Date ","Heure ","temps (s) ","X","Z"," LVDT V(mm) " ,"Force normale (N) ","FT","FN(N) ","TS"," NS(kPa) ","V (mm/min)","Vitesse normale (mm/min)","e (kPa)","k (kPa/mm) " ,"N " ,"Nb cycles normal" ,"Cycles " ,"Etat normal" ,"k imposÈ (kPa/mm)"])
data.columns = [col.strip() for col in data.columns.tolist()]
N = data[data.keys()[15]]
N = np.array(N)
data["CN"] = (data.N.shift().bfill() != data.N).astype(int).cumsum()
an example of data.head() is here
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| Index | Date | Heure | temps (s) | X | Z(mm) | LVDT V(mm) | Force normale (N) | FT | FN(N) | FT (kPa) | NS(kPa) | V (mm/min) | Vitesse normale (mm/min) | e (kPa) | k (kPa/mm) | N | Nb cycles normal | Cycles | Etat normal | k imposÈ (kPa/mm) | CN |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| 184 | 01/02/2022 | 12:36:52 | 402.163 | 6.910243 | 1.204797 | 0.001101 | 299.783665 | 31.494351 | 1428.988908 | 11.188704 | 505.825016 | 0.1 | 2.0 | 512.438828 | 50.918786 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 185 | 01/02/2022 | 12:36:54 | 404.288 | 6.907822 | 1.205647 | 4.9e-05 | 296.072718 | 31.162313 | 1404.195316 | 11.028167 | 494.97955 | 0.1 | -2.0 | 500.084986 | 49.685639 | 0.0 | 0.0 | Sort | Descend | 0.0 | 0 |
| 186 | 01/02/2022 | 12:36:56 | 406.536 | 6.907906 | 1.204194 | -0.000214 | 300.231424 | 31.586401 | 1429.123486 | 11.21895 | 505.750815 | 0.1 | 2.0 | 512.370164 | 50.914002 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 187 | 01/02/2022 | 12:36:58 | 408.627 | 6.910751 | 1.204293 | -0.000608 | 300.188686 | 31.754064 | 1428.979519 | 11.244542 | 505.624564 | 0.1 | 2.0 | 512.309254 | 50.906544 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 188 | 01/02/2022 | 12:37:00 | 410.679 | 6.907805 | 1.205854 | -0.000181 | 296.358074 | 31.563389 | 1415.224427 | 11.129375 | 502.464948 | 0.1 | 2.0 | 510.702313 | 50.742104 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
A one line cumsum trick solves it.
cumsum(c(0L, diff(df1$N) != 0))
#> [1] 0 1 1 2 2 3 3 4 4 4 5 5 6 7 8 9 10
all.equal(
cumsum(c(0L, diff(df1$N) != 0)),
df1$CN
)
#> [1] TRUE
Created on 2022-02-14 by the reprex package (v2.0.1)
Data
x <- "
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+"
df1 <- read.table(textConnection(x), header = TRUE, sep = "|", comment.char = "+")[2:3]
Created on 2022-02-14 by the reprex package (v2.0.1)
I want to add a column which has values not equal to column N after N=31 has reached and then plot it like
plt.plot(X[N==1],FT[N==1]), plt.plot(X[new_col==63],FT[new_col==63])
The data is following
+-------+-----+----+-------+-------+
| X | N | CN | Vdiff | FT |
+-------+-----+----+-------+-------+
| 524 | 2 | 1 | 0.0 | 0.12. |
| 534 | 2 | 1 | 0.0 |0.134. |
| 525 | 2 | 1 | 0.0 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 0.0 |3.54. |
| 5913 | 31 | 29 | 0.1 |3.98. |
| 5923 | 0 | 29 | 0.0 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 0.0 |7.36 |
| 33029 | 7 | 36 | 0.0 |8.99 |
| 33023 | 7 | 43 | 0.1 |12.45 |
| 33114 | 0 | 43 | 0.0 |14.33 |
+-------+-----+----+-------+-------+
The solution I want is
+-------+-----+----+-------+------+
| X | N | CN | new_col | FT |
+-------+-----+----+-------+------+
| 524 | 2 | 1 | 2 | 0.12. |
| 534 | 2 | 1 | 2 |0.134. |
| 525 | 2 | 1 | 2 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 31 |3.54. |
| 5913 | 31 | 29 | 31 |3.98. |
| 5923 | 0 | 29 | 32 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 45 |7.36 |
| 33029 | 7 | 36 | 45 |8.99 |
| 33023 | 7 | 43 | 45 |12.45 |
| 33114 | 0 | 43 | 46 |14.33 |
+-------+-----+----+-------+-------+
Note that values in new_col should be also repetitive like values in N and should not change in every new row.
Is this the ouput you need? We cannot simply groupby by N as it has repetitive, non-adjacent values, as we need to preserve the order. We count here the condition where N is changed compared to its own previous value.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
"""X|N|CN|Vdiff|FT
524|2|1|0.0|0.12
534|2|1|0.0|0.134
525|2|1|0.0|0.154
5976|31|14|0.0|3.54
5913|31|29|0.1|3.98
5923|0|29|0.0|3.87
33001|7|36|0.0|7.36
33029|7|36|0.0|8.99
33023|7|43|0.1|12.45
33114|0|43|0.0|14.33"""), sep="|")
# works in pandas 1.2
#>>> df["new_val"] = df.eval("C = N.shift().bfill() != N")["C"].astype(int).cumsum()
# works in older pandas
>>> df["new_val"] = (df.N.shift().bfill() != df.N).astype(int).cumsum()
>>> df
X N CN Vdiff FT new_val
0 524 2 1 0.0 0.120 0
1 534 2 1 0.0 0.134 0
2 525 2 1 0.0 0.154 0
3 5976 31 14 0.0 3.540 1
4 5913 31 29 0.1 3.980 1
5 5923 0 29 0.0 3.870 2
6 33001 7 36 0.0 7.360 3
7 33029 7 36 0.0 8.990 3
8 33023 7 43 0.1 12.450 3
9 33114 0 43 0.0 14.330 4
So I have following dask dataframe grouped by Problem column.
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 5 | 2 | 15 | 38 |
| A | 15 | 2 | 15 | 23 |
| B | 11 | 6 | 10 | 54 |
| B | 10 | 6 | 10 | 48 |
| B | 18 | 6 | 10 | 79 |
| C | 50 | 8 | 25 | 120 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
The goal is to create a new dataframe with all rows where the Cost values is minimal for this particular Problem group. So we want following result:
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 15 | 2 | 15 | 23 |
| B | 10 | 6 | 10 | 48 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
How can I achieve this result, i already tried using idxmin() as mentioned in another question on here, but then I get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
What if you create another dataframe that is grouped by Problem and Cost.min()? Let's say the new column is called cost_min.
df1 = df.groupby('Problem')['Cost'].min().reset_index()
Then, merge back this new cost_min column back to the dataframe.
df2 = pd.merge(df, df1, how='left', on='Problem')
From there, do something like:
df_new = df2.loc[df2['Cost'] == df2['cost_min']]
Just wrote some pseudocode, but I think that all works with Dask.
I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply the "Hot Deck Imputation" method.
I cannot find any Python functions or packages online that takes the column of a dataframe and fills missing values with the "Hot Deck Imputation" method.
I did, however, see this GitHub project and did not find it useful.
The following is an example of some of my data (assume this is a pandas dataframe):
| age | sex | bmi | anesthesia score | pain level |
|-----|-----|------|------------------|------------|
| 78 | 1 | 40.7 | 3 | 0 |
| 55 | 1 | 25.3 | 3 | 0 |
| 52 | 0 | 25.4 | 3 | 0 |
| 77 | 1 | 44.9 | 3 | 3 |
| 71 | 1 | 26.3 | 3 | 0 |
| 39 | 0 | 28.2 | 2 | 0 |
| 82 | 1 | 27 | 2 | 1 |
| 70 | 1 | 37.9 | 3 | 0 |
| 71 | 1 | NA | 3 | 1 |
| 53 | 0 | 24.5 | 2 | NA |
| 68 | 0 | 34.7 | 3 | 0 |
| 57 | 0 | 30.7 | 2 | 0 |
| 40 | 1 | 22.4 | 2 | 0 |
| 73 | 1 | 34.2 | 2 | 0 |
| 66 | 1 | NA | 3 | 1 |
| 55 | 1 | 42.6 | NA | NA |
| 53 | 0 | 37.5 | 3 | 3 |
| 65 | 0 | 31.6 | 2 | 2 |
| 36 | 0 | 29.6 | 1 | 0 |
| 60 | 0 | 25.7 | 2 | NA |
| 70 | 1 | 30 | NA | NA |
| 66 | 1 | 28.3 | 2 | 0 |
| 63 | 1 | 29.4 | 3 | 2 |
| 70 | 1 | 36 | 3 | 2 |
I would like to apply a Python function that would allow me to input a column as a parameter and return the column with the missing values replaced with imputed values using the "Hot Deck Imputation" method.
I am using this for the purpose of statistical modeling with models such as linear and logistic regression using Statsmodels.api. I am not using this for Machine Learning.
Any help would be much appreciated!
You can use ffill that uses last observation carried forward (LOCF) Hot Code Imputation.
#...
df.fillna(method='ffill', inplace=True)
Scikit-learn impute offers KNN, Mean, Max and other imputing methods. (https://scikit-learn.org/stable/modules/impute.html)
# sklearn '>=0.22.x'
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
DF['imputed_x'] = imputer.fit_transform(DF[['bmi']])
print(DF['imputed_x'])
I understand the package empiricaldist provides a CDF function as per the documentation.
However, I find it tricky to plot my dataframe in the column has multiple values.
df.head()
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
| | trip_id | seconds_start | seconds_end | duration | distance | speed | acceleration | lat_start | lon_start | lat_end | lon_end | travelmode |
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
| 0 | 318410 | 1461743310 | 1461745298 | 1988 | 5121.49 | 2.58 | 0.00130 | 41.162687 | -8.615425 | 41.177888 | -8.597549 | car |
| 1 | 318411 | 1461749359 | 1461750290 | 931 | 1520.71 | 1.63 | 0.00175 | 41.177949 | -8.597074 | 41.177839 | -8.597574 | bus |
| 2 | 318421 | 1461806871 | 1461806941 | 70 | 508.15 | 7.26 | 0.10370 | 37.091240 | -8.211239 | 37.092322 | -8.206681 | foot |
| 3 | 318422 | 1461837354 | 1461838024 | 670 | 1207.39 | 1.80 | 0.00269 | 37.092082 | -8.205060 | 37.091659 | -8.206462 | car |
| 4 | 318425 | 1461852790 | 1461853845 | 1055 | 1470.49 | 1.39 | 0.00132 | 37.091628 | -8.202143 | 37.092095 | -8.205070 | foot |
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
Would like to plot CDF for the column travelmode for each travel mode.
groups = df.groupby('travelmode')
However, I don't really understand how this could be done from the documentation.
You can plot them in a loop like
import matplotlib.pyplot as plt
def decorate_plot(title):
''' Adds labels to plot '''
plt.xlabel('Outcome')
plt.ylabel('CDF')
plt.title(title)
for tm in df['travelmode'].unique():
for col in df.columns:
if col != 'travelmode':
# Create new figures for each plot
fig, ax = plt.subplots()
d4 = Cdf.from_seq(df[col])
d4.plot()
decorate_plot(f"{tm} - {col}")