Pandas: Iterate by two column for each iteration - python

Does anyone know how to iterate a pandas Dataframe with two columns for each iteration?
Say I have
a b c d
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
So something like
for x, y in ...:
correlation of x and y
So output will be
corr_ab corr_bc corr_cd
0.1 0.3 -0.4

You can use zip with indexing for tuples, create dictionary of one element lists with Series.corr and f-strings for columns names and pass to DataFrame constructor:
L = {f'corr_{col1}{col2}': [df[col1].corr(df[col2])]
for col1, col2 in zip(df.columns, df.columns[1:])}
df = pd.DataFrame(L)
print (df)
corr_ab corr_bc corr_cd
0 0.860108 0.61333 0.888523

You can use df.corr to get the correlation of the dataframe. You then use mask to avoid repeated correlations. After that you can stack your new dataframe to make it more readable. Assuming you have data like this
0 1 2 3 4
0 11 6 17 2 3
1 3 12 16 17 5
2 13 2 11 10 0
3 8 12 13 18 3
4 4 3 1 0 18
Finding the correlation,
corrData = data.corr(method='pearson')
We get,
0 1 2 3 4
0 1.000000 -0.446023 0.304108 -0.136610 -0.674082
1 -0.446023 1.000000 0.563112 0.773013 -0.258801
2 0.304108 0.563112 1.000000 0.494512 -0.823883
3 -0.136610 0.773013 0.494512 1.000000 -0.545530
4 -0.674082 -0.258801 -0.823883 -0.545530 1.000000
Masking out repeated correlations,
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
We get
0 1 2 3 4
0 NaN -0.446023 0.304108 -0.136610 -0.674082
1 NaN NaN 0.563112 0.773013 -0.258801
2 NaN NaN NaN 0.494512 -0.823883
3 NaN NaN NaN NaN -0.545530
4 NaN NaN NaN NaN NaN
Stacking the correlated data
dataCorr = dataCorr.stack().reset_index()
The stacked data will look as shown
level_0 level_1 0
0 0 1 -0.446023
1 0 2 0.304108
2 0 3 -0.136610
3 0 4 -0.674082
4 1 2 0.563112
5 1 3 0.773013
6 1 4 -0.258801
7 2 3 0.494512
8 2 4 -0.823883
9 3 4 -0.545530

Related

Closest non equal row in a column in Pandas dataframe

I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN

Pandas countif based on multiple conditions, result in new column

How can I add a field that returns 1/0 if the value in any specified column in not NaN?
Example:
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'val1': [2,2,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2],
'val2': [7,0.2,5,8,np.nan,1,0,np.nan,1,1],
})
display(df)
mycols = ['val1', 'val2']
# if entry in mycols != np.nan, then df[row, 'countif'] =1; else 0
Desired output dataframe:
We do not need countif logic in pandas , try notna + any
df['out'] = df[['val1','val2']].notna().any(1).astype(int)
df
Out[381]:
id val1 val2 out
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1
Using iloc accessor filtre last two columns. Check if the sum of not NaNs in each row is more than zero. Convert resulting Boolean to integers.
df['countif']=df.iloc[:,1:].notna().sum(1).gt(0).astype(int)
id val1 val2 countif
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1

How to transform weekly data to daily for specific columns using Python

I am a newbie at python and programming in general. I hope the following question is well explained.
I have a big dataset, with 80+ columns and some of these columns have only data on a weekly basis. I would like transform these columns to have values on a daily basis by simply dividing the weekly value by 7 and attributing the result to the value itself and the 6 other days of that week.
This is what my input dataset looks like:
date col1 col2 col3
02-09-2019 14 NaN 1
09-09-2019 NaN NaN 2
16-09-2019 NaN 7 3
23-09-2019 NaN NaN 4
30-09-2019 NaN NaN 5
07-10-2019 NaN NaN 6
14-10-2019 NaN NaN 7
21-10-2019 21 NaN 8
28-10-2019 NaN NaN 9
04-11-2019 NaN 14 10
11-11-2019 NaN NaN 11
..
This is what the output should look like:
date col1 col2 col3
02-09-2019 2 NaN 1
09-09-2019 2 NaN 2
16-09-2019 2 1 3
23-09-2019 2 1 4
30-09-2019 2 1 5
07-10-2019 2 1 6
14-10-2019 2 1 7
21-10-2019 3 1 8
28-10-2019 3 1 9
04-11-2019 3 2 10
11-11-2019 3 2 11
..
I can´t come up with a solution, but here is what I thought might work:
def convert_to_daily(df):
for column in df.columns.tolist():
if column.isna(): # if true
for line in range(len(df[column])):
# check if value is not empty and
succeeded by an 6 empty values or some
better logic
# I don´t know how to do that.
I believe you need select columns contains at least one missing value, forward filling missing values and divide by 7:
m = df.isna().any()
df.loc[:, m] = df.loc[:, m].ffill(limit=7).div(7)
print (df)
date col1 col2 col3
0 02-09-2019 2.0 NaN 1
1 09-09-2019 2.0 NaN 2
2 16-09-2019 2.0 1.0 3
3 23-09-2019 2.0 1.0 4
4 30-09-2019 2.0 1.0 5
5 07-10-2019 2.0 1.0 6
6 14-10-2019 2.0 1.0 7
7 21-10-2019 3.0 1.0 8
8 28-10-2019 3.0 1.0 9
9 04-11-2019 3.0 2.0 10
10 11-11-2019 3.0 2.0 11

pandas filling nan with previous row value multiplied with another column

I have dataframe for which I want to fill nan with values from previous rows mulitplied with pct_change column
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 nan 0.5
4 nan 1.3
5 nan 2
6 5 3
so for 3rd row 10*0.5 = 5 and use that filled value to fill next rows if its nan.
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 5 0.5
4 6.5 1.3
5 13 2
6 5 3
I have used this
while df['col_to_fill'].isna().sum() > 0:
df.loc[df['col_to_fill'].isna(), 'col_to_fill'] = df['col_to_fill'].shift(1) * df['pct_change']
but Its taking too much time as its only filling those row whos previous row are nonnan in one loop.
Try with cumprod after ffill
s = df.col_to_fill.ffill()*df.loc[df.col_to_fill.isna(),'pct_change'].cumprod()
df.col_to_fill.fillna(s, inplace=True)
df
Out[90]:
col_to_fill pct_change
0 1.0 NaN
1 2.0 1.0
2 10.0 0.5
3 5.0 0.5
4 6.5 1.3
5 13.0 2.0
6 5.0 3.0

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Categories

Resources