So I have values given below. Here Index is frame numbers and A is the value related to that frame number.
Index A
15 21.0
21 0.0
28 18.0
35 0.0
43 21.0
52 0.0
55 nan
60 nan
63 nan
64 nan
69 16.0
70 nan
72 15.0
79 nan
82 nan
91 0.0
94 8.0
99 0.0
100 0.0
106 15.0
113 0.0
119 nan
123 0.0
133 22.0
141 nan
142 10.0
148 0.0
152 8.0
154 nan
158 16.0
Using the above values, I am trying to plot a graph which gives me Maxima and Minima. Now there are some maxima and minima values that are very close and I want to combine it to one single point. I have attached an image of the graph.
Edit 1:- My expected output is, as shown in the 4th plot in the figure below, I want it to be a continuous graph and combine very close minima and maxima values to one value.
Image of the Graph
Edit 2:-
My maxima values are as follows. I want to combine Index 69 and 72 as one single point as they are very close.
Index A
15 21.0
28 18.0
43 21.0
55 nan
63 nan
69 16.0
72 15.0
82 nan
94 8.0
106 15.0
119 nan
133 22.0
142 10.0
152 8.0
158 16.0
Related
I have a couple of data frames given this way :
38 47 7 20 35
45 76 63 96 24
98 53 2 87 80
83 86 92 48 1
73 60 26 94 6
80 50 29 53 92
66 90 79 98 46
40 21 58 38 60
35 13 72 28 6
48 76 51 96 12
79 80 24 37 51
86 70 1 22 71
52 69 10 83 13
12 40 3 0 30
46 50 48 76 5
Could you please tell me how it is possible to add them to a list of dataframes?
Thanks a lot!
First convert values to one DataFrame with separator misisng values (converted from blank lines):
df = pd.read_csv(file, header=None, skip_blank_lines=False)
print (df)
0 1 2 3 4
0 38.0 47.0 7.0 20.0 35.0
1 45.0 76.0 63.0 96.0 24.0
2 98.0 53.0 2.0 87.0 80.0
3 83.0 86.0 92.0 48.0 1.0
4 73.0 60.0 26.0 94.0 6.0
5 NaN NaN NaN NaN NaN
6 80.0 50.0 29.0 53.0 92.0
7 66.0 90.0 79.0 98.0 46.0
8 40.0 21.0 58.0 38.0 60.0
9 35.0 13.0 72.0 28.0 6.0
10 48.0 76.0 51.0 96.0 12.0
11 NaN NaN NaN NaN NaN
12 79.0 80.0 24.0 37.0 51.0
13 86.0 70.0 1.0 22.0 71.0
14 52.0 69.0 10.0 83.0 13.0
15 12.0 40.0 3.0 0.0 30.0
16 46.0 50.0 48.0 76.0 5.0
And then in list comprehension create smaller DataFrames in list:
dfs = [g.iloc[1:].astype(int).reset_index(drop=True)
for _, g in df.groupby(df[0].isna().cumsum())]
print (dfs[1])
0 1 2 3 4
0 80 50 29 53 92
1 66 90 79 98 46
2 40 21 58 38 60
3 35 13 72 28 6
4 48 76 51 96 12
I have a randomly generated 10*10 dataset and I need to replace 10% of dataset randomly with NaN.
import pandas as pd
import numpy as np
Dataset = pd.DataFrame(np.random.randint(0, 100, size=(10, 10)))
Try the following method. I had used this when I was setting up a hackathon and needed to inject missing data for the competition. -
You can use np.random.choice to create a mask of the same shape as the dataframe. Just make sure to set the percentage of the choice p for True and False values where True represents the values that will be replaced by nans.
Then simply apply the mask using df.mask
import pandas as pd
import numpy as np
p = 0.1 #percentage missing data required
df = pd.DataFrame(np.random.randint(0,100,size=(10,10)))
mask = np.random.choice([True, False], size=df.shape, p=[p,1-p])
new_df = df.mask(mask)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 50.0 87 NaN 14 78.0 44.0 19.0 94 28 28.0
1 NaN 58 3.0 75 90.0 NaN 29.0 11 47 NaN
2 91.0 30 98.0 77 3.0 72.0 74.0 42 69 75.0
3 68.0 92 90.0 90 NaN 60.0 74.0 72 58 NaN
4 39.0 51 NaN 81 67.0 43.0 33.0 37 13 40.0
5 73.0 0 59.0 77 NaN NaN 21.0 74 55 98.0
6 33.0 64 0.0 59 27.0 32.0 17.0 3 31 43.0
7 75.0 56 21.0 9 81.0 92.0 89.0 82 89 NaN
8 53.0 44 49.0 31 76.0 64.0 NaN 23 37 NaN
9 65.0 15 31.0 21 84.0 7.0 24.0 3 76 34.0
EDIT:
Updated my answer for the exact 10% values that you are looking for. It uses itertools and sample to get a set of indexes to mask, and then sets them to nan values. Should be exact as you expected.
from itertools import product
from random import sample
p = 0.1
n = int(df.shape[0]*df.shape[1]*p) #Calculate count of nans
#Sample exactly n indexes
ids = sample(list(product(range(df.shape[0]), range(df.shape[1]))), n)
idx, idy = list(zip(*ids))
data = df.to_numpy().astype(float) #Get data as numpy
data[idx, idy]=np.nan #Update numpy view with np.nan
#Assign to new dataframe
new_df = pd.DataFrame(data, columns=df.columns, index=df.index)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 52.0 50.0 24.0 81.0 10.0 NaN NaN 75.0 14.0 81.0
1 45.0 3.0 61.0 67.0 93.0 NaN 90.0 34.0 39.0 4.0
2 1.0 NaN NaN 71.0 57.0 88.0 8.0 9.0 62.0 20.0
3 78.0 3.0 82.0 1.0 75.0 50.0 33.0 66.0 52.0 8.0
4 11.0 46.0 58.0 23.0 NaN 64.0 47.0 27.0 NaN 21.0
5 70.0 35.0 54.0 NaN 70.0 82.0 69.0 94.0 20.0 NaN
6 54.0 84.0 16.0 76.0 77.0 50.0 82.0 31.0 NaN 31.0
7 71.0 79.0 93.0 11.0 46.0 27.0 19.0 84.0 67.0 30.0
8 91.0 85.0 63.0 1.0 91.0 79.0 80.0 14.0 75.0 1.0
9 50.0 34.0 8.0 8.0 10.0 56.0 49.0 45.0 39.0 13.0
I am trying to sort a dataframe where some rows are all NaN. I want to fill these using ffill. I'm currently trying this although i feel like it's a mismatch of a few commands
df.loc[df['A'].isna(), :] = df.fillna(method='ffill')
This gives an error:
AttributeError: 'NoneType' object has no attribute 'fillna'
but I want to filter the NaNs I fill using ffill if one of the columns is NaN. i.e.
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 NaN NaN NaN NaN NaN
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 NaN NaN NaN NaN NaN
So I would only like to fill a row IFF the value of A is NaN, whilst leaving C,0 and D,0 as NaN. Giving the below dataframe
A B C D E
0 45 88 NaN NaN 3
1 62 34 2 86 NaN
2 85 65 11 31 5
3 85 65 11 31 5
4 90 38 34 93 8
5 0 94 45 10 10
6 58 NaN 23 60 11
7 10 32 5 15 11
8 10 32 5 15 11
So just to clarify, the ONLY rows that get replaced with ffill are 3,8 and the reason is because the value of column A in rows 3 and 8 are NaN
Thanks
---Update---
When I'm debugging and evaluate the expression : df.loc[df['A'].isna(), :]
I get
3 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
So I assume whats happening here is, I then attempt ffill on this new dataframe only containing 3 and 8 and obviously i cant ffill NaNs with NaNs.
Change values only to those row that start with nan
df.loc[df['A'].isna(), :] = df.ffill().loc[df['A'].isna(), :]
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
Try using a mask to identify the relevant rows where column A is null. The take those same rows from the forward filled dataframe.
mask = df['A'].isnull()
df.loc[mask, :] = df.ffill().loc[mask, :]
>>> df
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
you just want to fill (DataFrame.ffill ) where(DataFrame.where) df['A'] is nan and the rest leave it as before (df):
df=df.ffill().where(df['A'].isna(),df)
print(df)
A B C D E
0 45.0 88.0 NaN NaN 3.0
1 62.0 34.0 2.0 86.0 NaN
2 85.0 65.0 11.0 31.0 5.0
3 85.0 65.0 11.0 31.0 5.0
4 90.0 38.0 34.0 93.0 8.0
5 0.0 94.0 45.0 10.0 10.0
6 58.0 NaN 23.0 60.0 11.0
7 10.0 32.0 5.0 15.0 11.0
8 10.0 32.0 5.0 15.0 11.0
I have a data frame that has 5 columns named as '0','1','2','3','4'
small_pd
Out[53]:
0 1 2 3 4
0 93.0 94.0 93.0 33.0 0.0
1 92.0 94.0 92.0 33.0 0.0
2 92.0 93.0 92.0 33.0 0.0
3 92.0 94.0 20.0 33.0 76.0
I want to use row-wise the input above to feed a function that does the following. I give as example for the first and second row
firstrow:
takeValue[0,0]-takeValue[0,1]+takeValue[0,2]-takeValue[0,3]+takeValue[0,4]
secondrow:
takeValue[1,0]-takeValue[1,1]+takeValue[1,2]-takeValue[1,3]+takeValue[1,4]
for the third row onwards and then assign all those results as an extra column.
small_pd['extracolumn']
Is there a way to avoid a typical for loop in python and do it in a much better way?
Can you please advice me?
Thanks a lot
Alex
You can use pd.apply
df = pd.DataFrame(data={"0":[93,92,92,92],
"1":[94,94,93,94],
"2":[93,92,92,20],
"3":[33,33,33,33],
"4":[0,0,0,76]})
def calculation(row):
return row["0"]-row["1"]+row["2"]-row["3"]+row["4"]
df['extracolumn'] = df.apply(calculation,axis=1)
print(df)
0 1 2 3 4 result
0 93 94 93 33 0 59
1 92 94 92 33 0 57
2 92 93 92 33 0 58
3 92 94 20 33 76 61
Dont use apply, because loops under the hood, so slow.
Get pair and unpair columns by indexing by DataFrame.iloc, sum them and then subtract for vectorized, so fast solution:
small_pd['extracolumn'] = small_pd.iloc[:, ::2].sum(1) - small_pd.iloc[:, 1::2].sum(1)
print (small_pd)
0 1 2 3 4 extracolumn
0 93.0 94.0 93.0 33.0 0.0 59.0
1 92.0 94.0 92.0 33.0 0.0 57.0
2 92.0 93.0 92.0 33.0 0.0 58.0
3 92.0 94.0 20.0 33.0 76.0 61.0
Verify:
a = small_pd.iloc[0,0]-small_pd.iloc[0,1]+small_pd.iloc[0,2]-
small_pd.iloc[0,3]+small_pd.iloc[0,4]
b = small_pd.iloc[1,0]-small_pd.iloc[1,1]+small_pd.iloc[1,2]-
small_pd.iloc[1,3]+small_pd.iloc[1,4]
print (a, b)
59.0 57.0
I have a few csv files which contain a pair of bearings for many locations. I am trying to expand the values to include every number between the bearing pairs for each location and export the variable lengths as a csv in the same format.
Example:
df = pd.read_csv('bearing.csv')
Data structure:
A B C D E
0 0 94 70 67 84
1 120 132 109 152 150
Ideal result is a variable length multidimensional array:
A B C D E
0 0 94 70 67 84
1 1 95 71 68 85
2 3 96 72 69 86
...
n 120 132 109 152 150
I am looping through each column and getting the range of the pair of values, but I am struggling when trying to overwrite the old column with the new range of values.
for col in bear:
min_val = min(bear[col])
max_val = max(bear[col])
range_vals = range(min(bear[col]), max(bear[col])+1)
bear[col] = range_vals
I am getting the following error:
ValueError: Length of values does not match length of index
You can use dict comprehension with min and max in DataFrame contructor, but you get a lot NaN in the end of columns:
df = pd.DataFrame({col: pd.Series(range(df[col].min(),
df[col].max() + 1)) for col in df.columns })
print (df)
print (df)
A B C D E
0 0 94.0 70.0 67.0 84.0
1 1 95.0 71.0 68.0 85.0
2 2 96.0 72.0 69.0 86.0
3 3 97.0 73.0 70.0 87.0
4 4 98.0 74.0 71.0 88.0
5 5 99.0 75.0 72.0 89.0
6 6 100.0 76.0 73.0 90.0
7 7 101.0 77.0 74.0 91.0
8 8 102.0 78.0 75.0 92.0
9 9 103.0 79.0 76.0 93.0
10 10 104.0 80.0 77.0 94.0
11 11 105.0 81.0 78.0 95.0
12 12 106.0 82.0 79.0 96.0
13 13 107.0 83.0 80.0 97.0
14 14 108.0 84.0 81.0 98.0
15 15 109.0 85.0 82.0 99.0
16 16 110.0 86.0 83.0 100.0
17 17 111.0 87.0 84.0 101.0
18 18 112.0 88.0 85.0 102.0
19 19 113.0 89.0 86.0 103.0
20 20 114.0 90.0 87.0 104.0
21 21 115.0 91.0 88.0 105.0
22 22 116.0 92.0 89.0 106.0
23 23 117.0 93.0 90.0 107.0
24 24 118.0 94.0 91.0 108.0
25 25 119.0 95.0 92.0 109.0
26 26 120.0 96.0 93.0 110.0
27 27 121.0 97.0 94.0 111.0
28 28 122.0 98.0 95.0 112.0
29 29 123.0 99.0 96.0 113.0
.. ... ... ... ... ...
91 91 NaN NaN NaN NaN
92 92 NaN NaN NaN NaN
93 93 NaN NaN NaN NaN
94 94 NaN NaN NaN NaN
95 95 NaN NaN NaN NaN
96 96 NaN NaN NaN NaN
97 97 NaN NaN NaN NaN
98 98 NaN NaN NaN NaN
99 99 NaN NaN NaN NaN
100 100 NaN NaN NaN NaN
101 101 NaN NaN NaN NaN
102 102 NaN NaN NaN NaN
103 103 NaN NaN NaN NaN
104 104 NaN NaN NaN NaN
105 105 NaN NaN NaN NaN
106 106 NaN NaN NaN NaN
107 107 NaN NaN NaN NaN
108 108 NaN NaN NaN NaN
109 109 NaN NaN NaN NaN
110 110 NaN NaN NaN NaN
111 111 NaN NaN NaN NaN
112 112 NaN NaN NaN NaN
113 113 NaN NaN NaN NaN
114 114 NaN NaN NaN NaN
115 115 NaN NaN NaN NaN
116 116 NaN NaN NaN NaN
117 117 NaN NaN NaN NaN
118 118 NaN NaN NaN NaN
119 119 NaN NaN NaN NaN
120 120 NaN NaN NaN NaN
If you have only few columns, is possible use:
df = pd.DataFrame({'A': pd.Series(range(df.A.min(), df.A.max() + 1)),
'B': pd.Series(range(df.B.min(), df.B.max() + 1))})
EDIT:
If min value is in first row and the max in last, you can use iloc:
df = pd.DataFrame({col: pd.Series(range(df[col].iloc[0],
df[col].iloc[-1] + 1)) for col in df.columns })
Timings:
In [3]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].iloc[0], df[col].iloc[-1] + 1)) for col in df.columns }) )
1000 loops, best of 3: 1.75 ms per loop
In [4]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].min(), df[col].max() + 1)) for col in df.columns }) )
The slowest run took 5.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.18 ms per loop