Assuming you have a conventional pandas dataframe
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
There I would like to calculate the mean between each row.
The above pandas dataframe looks like following:
>>>df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Is there an elegant solution to now calculate the mean between each two lines in order to get the following output?
>>>df2_mean
a b c
0 1 2 3
1 2.5 3.5 4.5
2 4 5 6
3 5.5 6.5 7.5
4 7 8 9
Use DataFrame.rolling with mean and concat to original:
df = pd.concat([df2.rolling(2).mean().dropna(how='all'), df2]).sort_index(ignore_index=True)
print (df)
a b c
0 1.0 2.0 3.0
1 2.5 3.5 4.5
2 4.0 5.0 6.0
3 5.5 6.5 7.5
4 7.0 8.0 9.0
Related
I am trying to add rows to a DataFrame interpolating values in a column by group, and fill with missing all other columns. My data looks something like this:
import pandas as pd
import random
random.seed(42)
data = {'group':['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c' ],
'value' : [1, 2, 5, 3, 4, 5, 7, 4, 7, 9],
'other': random.sample(range(1, 100), 10)}
df = pd.DataFrame(data)
print(df)
group value other
0 a 1 82
1 a 2 15
2 a 5 4
3 b 3 95
4 b 4 36
5 b 5 32
6 b 7 29
7 c 4 18
8 c 7 14
9 c 9 87
What I am trying to achieve is something like this:
group value other
a 1 82
a 2 15
a 3 NaN
a 4 NaN
a 5 NaN
b 3 95
b 4 36
b 5 32
b 6 NaN
b 7 29
c 4 18
c 5 NaN
c 6 NaN
c 7 14
c 8 NaN
c 9 87
For example, group a has a range from 1 to 5, b from 3 to 7, and c from 4 to 9.
The issue I'm having is that each group has a different range. I found something that works assuming a single range for all groups. This could work using the global min and max and dropping extra rows in each group, but since my data is fairly large adding many rows per group quickly becomes unfeasible.
>>> df.groupby('group').apply(lambda x: x.set_index('value').reindex(np.arange(x['value'].min(), x['value'].max() + 1))).drop(columns='group').reset_index()
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0
We group on the group column and then re-index each group with the range from the min to the max of the value column
One option is with the complete function from pyjanitor, which can be helpful in exposing explicitly missing rows (and can be helpful as well in abstracting the reshaping process):
# pip install pyjanitor
import pandas as pd
import janitor
new_value = {'value' : lambda df: range(df.min(), df.max()+1)}
# expose the missing values per group via the `by` parameter
df.complete(new_value, by='group', sort = True)
group value other
0 a 1 82.0
1 a 2 15.0
2 a 3 NaN
3 a 4 NaN
4 a 5 4.0
5 b 3 95.0
6 b 4 36.0
7 b 5 32.0
8 b 6 NaN
9 b 7 29.0
10 c 4 18.0
11 c 5 NaN
12 c 6 NaN
13 c 7 14.0
14 c 8 NaN
15 c 9 87.0
I am relatively new to python and I am wondering how I can merge these two tables and preserve both their values?
Consider these two tables:
df = pd.DataFrame([[1, 3], [2, 4],[2.5,1],[5,6],[7,8]], columns=['A', 'B'])
A B
1 3
2 4
2.5 1
5 6
7 8
df2 = pd.DataFrame([[1],[2],[3],[4],[5],[6],[7],[8]], columns=['A'])
A
1
2
...
8
I want to obtain the following result:
A B
1 3
2 4
2.5 1
3 NaN
4 NaN
5 6
6 NaN
7 8
8 NaN
You can see that column A includes all values from both the first and second dataframe in an ordered manner.
I have attempted:
pd.merge(df,df2,how='outer')
pd.merge(df,df2,how='right')
But the former does not result in an ordered dataframe and the latter does not include rows that are unique to df.
Let us do concat then drop_duplicates
out = pd.concat([df2,df]).drop_duplicates('A',keep='last').sort_values('A')
Out[96]:
A B
0 1.0 3.0
1 2.0 4.0
2 2.5 1.0
2 3.0 NaN
3 4.0 NaN
3 5.0 6.0
5 6.0 NaN
4 7.0 8.0
7 8.0 NaN
I have 3 dataframes like this,
df = pd.DataFrame([[1, 3], [2, 4], [3,6], [4,12], [5,18]], columns=['A', 'B'])
df2 = pd.DataFrame([[1, 5], [2, 6], [3,9]], columns=['A', 'C'])
df3 = pd.DataFrame([[4, 15, "hello"], [5, 19, "yes"]], columns=['A', 'C', 'D'])
They look like this,
df
A B
0 1 3
1 2 4
2 3 6
3 4 12
4 5 18
df2
A C
0 1 5
1 2 6
2 3 9
df3
A C D
0 4 15 hello
1 5 19 yes
my merges, first merge,
f_merge = pd.merge(df, df2, on='A',how='left')
second merge,(first_merge with df3)
s_merge = pd.merge(f_merge, df3, on='A', how='left')
I get the output like this,
A B C_x C_y D
0 1 3 5.0 NaN NaN
1 2 4 6.0 NaN NaN
2 3 6 9.0 NaN NaN
3 4 12 NaN 15.0 hello
4 5 18 NaN 19.0 yes
I need like this,
A B C D
0 1 3 5.0 NaN
1 2 4 6.0 NaN
2 3 6 9.0 NaN
3 4 12 15.0 hello
4 5 18 19.0 yes
How can I achieve this output? Any suggestion would be great.
Concat df2 and df3 before merging.
new_df = pd.merge(df, pd.concat([df2, df3], ignore_index=True), on='A')
new_df
Out:
A B C D
0 1 3 5 NaN
1 2 4 6 NaN
2 3 6 9 NaN
3 4 12 15 hello
4 5 18 19 yes
We can do combine_first
df.set_index('A',inplace=True)
df2.set_index('A').combine_first(df).combine_first(df3.set_index('A'))
B C D
A
1 3.0 5.0 NaN
2 4.0 6.0 NaN
3 6.0 9.0 NaN
4 12.0 15.0 hello
5 18.0 19.0 yes
I have a pandas dataframe like
a b c
0 0.5 10 7
1 1.0 6 6
2 2.0 1 7
3 2.5 6 -5
4 3.5 9 7
and I would like to fill the missing columns with respect to the column 'a' on the basis of a certain step. In this case, given the step of 0.5, I would like to fill the 'a' column with the missing values, that is 1.5 and 3.0, and set the other columns to null, in order to obtain the following result.
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
Which is the cleanest way to do this with pandas or other libraries like numpy or scipy?
Thanks!
Create array by numpy.arange, then create index by set_index and last reindex with reset_index:
step= .5
idx = np.arange(df['a'].min(), df['a'].max() + step, step)
df = df.set_index('a').reindex(idx).reset_index()
print (df)
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
One simple way to achieve that is to first create the index you want and then merge the remaining of the information on it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.5, 1, 2, 2.5, 3.5],
'b': [10, 6, 1, 6, 9],
'c': [7, 6, 7, -5, 7]})
ls = np.arange(df.a.min(), df.a.max(), 0.5)
new_df = pd.DataFrame({'a':ls})
new_df = new_df.merge(df, on='a', how='left')
I have a Pandas dataframe that looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Dummy_Var": [1]*12,
"B": [6, 143.3, 143.3, 143.3, 3, 4, 93.9, 93.9, 93.9, 2, 2, 7],
"C": [4.1, 23.2, 23.2, 23.2, 4.3, 2.5, 7.8, 7.8, 2, 7, 7, 7]})
B C Dummy_Var
0 6.0 4.1 1
1 143.3 23.2 1
2 143.3 23.2 1
3 143.3 23.2 1
4 3.0 4.3 1
5 4.0 2.5 1
6 93.9 7.8 1
7 93.9 7.8 1
8 93.9 2.0 1
9 2.0 7.0 1
10 2.0 7.0 1
11 7.0 7.0 1
Whenever the same numbers show up consecutively three times or more in a row, that data should be replaced with NAN. So the result should be:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
I have written a function that does that:
def non_sense_remover(df, examined_columns, allowed_repeating):
def count_each_group(grp, column):
grp['Count'] = grp[column].count()
return grp
for col in examined_columns:
sel = df.groupby((df[col] != df[col].shift(1)).cumsum()).apply(count_each_group, column=col)["Count"] > allowed_repeating
df.loc[sel, col] = np.nan
return df
df = non_sense_remover(df, ["B", "C"], 2)
However, my real dataframe has 2M rows and 18 columns! It is very very slow to run this function on 2M rows. Is there a more efficient way to do this? Am I missing something? Thanks in advance.
Constructing a boolean mask in this situation will be far more efficient than a solution based on apply(), particularly for large datasets. Here is an approach:
cols = df[['B', 'C']]
mask = (cols.shift(-1) == cols) & (cols.shift(1) == cols)
df[mask | mask.shift(1).fillna(False) | mask.shift(-1).fillna(False)] = np.nan
Edit:
For a more general approach, replacing sequences of length N with NaN, you could do something like this:
from functools import reduce
from operator import or_, and_
def replace_sequential_duplicates_with_nan(df, N):
mask = reduce(and_, [cols.shift(i) == cols.shift(i + 1)
for i in range(N - 1)])
full_mask = reduce(or_, [mask.shift(-i).fillna(False)
for i in range(N)])
df[full_mask] = np.nan
We using groupby + mask
m=df[['B','C']]
df[['B','C']]=m.mask(m.apply(lambda x : x.groupby(x.diff().ne(0).cumsum()).transform('count'))>2)
df
Out[1245]:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
From this link, it appears that using apply/transform (in your case, apply) is causing the biggest bottleneck here. The link I referenced goes into much more detail about why this is and how to solve it