I have a data frame like:
A B datetime
10 NaN 12-03-2020 04:43:11
NaN 20 13-03-2020 04:43:11
NaN NaN 14-03-2020 04:43:11
NaN NaN 15-03-2020 04:43:11
NaN NaN 16-03-2020 04:43:11
NaN 50 17-03-2020 04:43:11
20 NaN 18-03-2020 04:43:11
NaN 30 19-03-2020 04:43:11
NaN NaN 20-03-2020 04:43:11
30 30 21-03-2020 04:43:11
40 NaN 22-03-2020 04:43:11
NaN 10 23-03-2020 04:43:11
The code which I'm using is :
df['timegap_in_min'] = np.where( ((df['A'].notna()) &(df[['B','c']].shift(-1).notna())),df['Datetime'].shift(-1) - df['timestamp'], np.nan)
df['timegap_in_min'] = df['timegap_in_min'].astype('timedelta64[h]')
The required output is:
A B datetime prev_timegap next_timegap
10 NaN 12-03-2020 04:43:11 NaN 24
NaN 20 13-03-2020 04:43:11 NaN NaN
NaN NaN 14-03-2020 04:43:11 NaN NaN
NaN NaN 15-03-2020 04:43:11 NaN NaN
NaN NaN 16-03-2020 04:43:11 NaN NaN
NaN 50 17-03-2020 04:43:11 NaN NaN
20 NaN 18-03-2020 04:43:11 24 24
NaN 30 19-03-2020 04:43:11 NaN NaN
NaN NaN 20-03-2020 04:43:11 NaN NaN
30 30 21-03-2020 04:43:11 24 24
40 NaN 22-03-2020 04:43:11 24 24
NaN 10 23-03-2020 04:43:11 NaN NaN
Someone help me in correcting my logic.
Just slightly customize your codes to the following codes:
df['prev_timegap'] = np.where( ((df['A'].notna()) & (df['B'].shift(1).notna())), abs(df['datetime'].shift(1) - df['datetime']), np.nan)
df['prev_timegap'] = df['prev_timegap'].astype('timedelta64[h]')
df['next_timegap'] = np.where( ((df['A'].notna()) & (df['B'].shift(-1).notna())), abs(df['datetime'].shift(-1) - df['datetime']), np.nan)
df['next_timegap'] = df['next_timegap'].astype('timedelta64[h]')
This should give you the desired result based on your description of requirement in the question title. Anyway, the test result would be slightly different from the tabulated required output at the following row:
A B datetime prev_timegap next_timegap
...
30 30 21-03-2020 04:43:11 24 24
...
...
instead, the result is:
A B datetime prev_timegap next_timegap
...
30 30 21-03-2020 04:43:11 NaN NaN
...
...
This result is based on you mentioned that "previous time gap and next time gap" (assuming "previous" and "next" are referring to different timeframe, i.e. different 'datetime').
Note that in your sample df, on column A with value 30, the corresponding value in B on the previous date and next date are both NaN. Hence, should we show NaN instead ?
In case your requirement includes showing some values for the 2 time gaps when there are non NaN values in both "current" datetime of Column A and B, we may need to further enhance the codes above.
Related
i'm having a trouble to code into finding in between values in pandas dataframe.
the dataframe:
value
30
NaN
NaN
25
NaN
20
NaN
NaN
NaN
NaN
15
...
the formula is like this:
value before nan - ((value before nan - value after nan)/div by no. of nan in between the values)
example of expected value should be like this:
30 - (30-25)/2 = 27.5
27.5 - (27.5-25)/1 = 25
so the expected dataframe will look like this:
value
expected value
30
30
NaN
27.5
NaN
25
25
25
NaN
20
20
20
NaN
18.75
NaN
17.5
NaN
16.25
NaN
15
15
15
...
...
IIUC, you can generalize your formula into two parts:
Any nan right before a non-nan is just same as that number
{value-before-nan} - ({value-before-nan} - {value-after-nan})/1 = {value-after-nan}
Rest of nan are linear interpolation.
So you can use bfill with interpolate:
df.bfill(limit=1).interpolate()
Output:
value
0 30.00
1 27.50
2 25.00
3 25.00
4 20.00
5 20.00
6 18.75
7 17.50
8 16.25
9 15.00
10 15.00
I have a csv file that I can read and print
reference radius diameter length sfcefin pltol mitol sfcetrement
0 jnl1 15 30.0 35 Rz2 0.0 -0.03 Stellite Spray
1 jnl2 28 56.0 50 NaN NaN NaN NaN
2 jnl3 10 20.0 25 NaN NaN NaN NaN
3 jnlfce1 15 NaN 15 NaN NaN NaN NaN
4 jnlfce2 28 NaN 13 NaN NaN NaN NaN
5 jnlfce3 28 NaN 18 NaN NaN NaN NaN
6 jnlfce4 10 NaN 10 NaN NaN NaN NaN
I have managed to isolate and print a specific row using
df1 = df[df['reference'].str.contains(feature)]
reference radius diameter length sfcefin pltol mitol sfcetrement
1 jnl2 28 56.0 50 NaN NaN NaN NaN
I now want to select the radius column and put the value into a variable
I have tried the similar technique on the output of the df1 but with no success
value = df1[df1['radius']]
print(value)
Has anyone any more suggestions?
You can use .loc and simply do:
value = df.loc[df1.reference.str.contains(feature), 'radius']
Given a dataframe with row and column multiindex, how would you copy a row index "object" and manipulate a specific index value on a chosen level? Ultimately I would like to add a new row to the dataframe with this manipulated index.
Taking this dataframe df as an example:
col_index = pd.MultiIndex.from_product([['A','B'], [1,2,3,4]], names=['cInd1', 'cInd2'])
row_index = pd.MultiIndex.from_arrays([['2010','2011','2009'],['a','r','t'],[45,34,35]], names=["rInd1", "rInd2", 'rInd3'])
df = pd.DataFrame(data=None, index=row_index, columns=col_index)
df
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
I would like to take the index of the first row, manipulate the "rInd2" value and use this index to insert another row.
Pseudo code would be something like this:
#Get Index
idx = df.index[0]
#Manipulate Value
idx[1] = "L" #or idx["rInd2"]
#Make new row with new index
df.loc[idx, slice(None)] = None
The desired output would look like this:
cInd1 A B
cInd2 1 2 3 4 1 2 3 4
rInd1 rInd2 rInd3
2010 a 45 NaN NaN NaN NaN NaN NaN NaN NaN
2011 r 34 NaN NaN NaN NaN NaN NaN NaN NaN
2009 t 35 NaN NaN NaN NaN NaN NaN NaN NaN
2010 L 45 NaN NaN NaN NaN NaN NaN NaN NaN
What would be the most efficient way to achieve this?
Is there a way to do the same procedure with column index?
Thanks
I have a dataset that looks like below:
Zn Pb Ag Cu Mo Cr Ni Co Ba
87 7 0.02 42 2 57 38 14 393
70 6 0.02 56 2 27 29 20 404
75 5 0.02 69 2 44 23 17 417
70 6 0.02 54 1 20 19 12 377
I want to create a pandas dataframe out of this dataset. I have written the function below:
def correlation_iterated(raw_data,element_concentration):
columns = element_concentration.split()
df1 = pd.DataFrame(columns=columns)
data1=[]
selected_columns = raw_data.loc[:, element_concentration.split()].columns
for i in selected_columns:
for j in selected_columns:
# another function that takes 'i' and 'j' and returns 'a'
zipped1 = zip([i], a)
data1.append(dict(zipped1))
df1 = df1.append(data1,True)
print(df1)
This function is supposed to do the calculations for each element and create a 9 by 9 pandas dataframe and store each calculation in each cell. But I get the following:
Zn Pb Ag Cu Mo Cr Ni Co Ba
0 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN
1 0.460611 NaN NaN NaN NaN NaN NaN NaN NaN
2 0.127904 NaN NaN NaN NaN NaN NaN NaN NaN
3 0.276086 NaN NaN NaN NaN NaN NaN NaN NaN
4 -0.164873 NaN NaN NaN NaN NaN NaN NaN NaN
.. ... .. .. .. .. .. .. .. ...
76 NaN NaN NaN NaN NaN NaN NaN NaN 0.113172
77 NaN NaN NaN NaN NaN NaN NaN NaN 0.027251
78 NaN NaN NaN NaN NaN NaN NaN NaN -0.036409
79 NaN NaN NaN NaN NaN NaN NaN NaN 0.041396
80 NaN NaN NaN NaN NaN NaN NaN NaN 1.000000
[81 rows x 9 columns]
which is basically calculating the results of the first column and storing them in just the first column, then doing the calculations and appending new rows to the column. How can I program the code in a way that appends new calculations to the next column when finished with one column? I want sth like this:
Zn Pb Ag Cu Mo Cr Ni Co Ba
0 1.000000 0.460611 ...
1 0.460611 1.000000 ...
2 0.127904 0.111559 ...
3 0.276086 0.303925 ...
4 -0.164873 -0.190886 ...
5 0.402046 0.338073 ...
6 0.174774 0.096724 ...
7 0.165760 -0.005301 ...
8 -0.043695 0.174193 ...
[9 rows x 9 columns]
Could you not just do something like this:
def correlation_iterated(raw_data,element_concentration):
columns = element_concentration.split()
data = {}
selected_columns = raw_data.loc[:,columns].columns
for i in selected_columns:
temp = []
for j in selected_columns:
# another function that takes 'i' and 'j' and returns 'a'
temp.append(a)
data[i] = temp
df = pd.DataFrame(data)
print(df)
Used code and file: https://github.com/CaioEuzebio/Python-DataScience-MachineLearning/tree/master/SalesLogistics
I am working on an analysis using pandas. Basically I need to sort the orders by quantity of products, and containing the same products.
Example: I have order 1 and order 2, both have product A and product B. Using the product list and product quantity as a key I will create a pivot that will index this combination of products and return me the order who own the same products.
The general objective of the analysis is to obtain a dataframe as follows:
dfFinal
listProds Ordens NumProds
[prod1,prod2,prod3] 1 3
2
3
[prod1,prod3,prod5] 7 3
15
25
[prod5] 8 1
3
So far the code looks like this.
Setting the 'Order' column as index so that the first pivot is made.
df1.index=df1['Ordem']
df3 = df1.assign(col=df1.groupby(level=0).Produto.cumcount()).pivot(columns='col', values='Produto')
With this pivot I get the dataframe below.
df3 =
col 0 1 2 3 4 5 6 7 8 9 ... 54 55 56 57 58 59 60 61 62 63
Ordem
10911KD YIZ12FF-A YIZ12FF-A YIIE2FF-A YIR72FF-A YIR72FF-A YIR72FF-A NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
124636 HYY32ZY-A HYY32ZY-A NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1719KD5 YI742FF-A YI742FF-A YI742FF-A YI742FF-A NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22215KD YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI6E2FF-A YI6E2FF-A YI6E2FF-A NaN ... NaN NaN NaN NaN NaN
When I finish running the code, NaN values appear, and I need to remove them from the lines so that I don't influence the analysis I'm doing.