I'm pretty new to Pandas and programming in general but I've always been able to find the answer to any problem through google until now. Sorry about the not terribly descriptive question, hopefully someone can come up with something clearer.
I'm trying to group data together, perform functions on that data, update a column and then use the data from that column on the next group of data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(9),columns=['A'])
df['B'] = [1,1,1,2,2,3,3,3,3]
df['C'] = np.nan
df['D'] = np.nan
df.loc[0:2,'C'] = 500
Giving me
A B C D
0 0.825828 1 500.0 NaN
1 0.218618 1 500.0 NaN
2 0.902476 1 500.0 NaN
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
The 500 in column C is the initial condition. I want to group the data by column B and perform the following function on the first group
def function1(row):
return row['A']*row['C']/6
giving me
A B C D
0 0.825828 1 500.0 68.818971
1 0.218618 1 500.0 18.218145
2 0.902476 1 500.0 75.206313
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then want to sum the first three values in D and add them to the last value in C and making this value the group 2 value
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 NaN
4 0.513505 2 662.243429 NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then perform function1 on group 2 and repeat until I end up with this
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 49.946896
4 0.513505 2 662.243429 56.677505
5 0.089975 3 768.867830 11.529874
6 0.282479 3 768.867830 36.198113
7 0.774286 3 768.867830 99.220591
8 0.408501 3 768.867830 52.347246
The dataframe will consist of hundreds of rows. I've been trying various groupby, apply combinations but I'm completely stumped.
Thanks
Here is a solution:
df['D'] = df['A'] * df['C']/6
for i in df['B'].unique()[1:]:
df.loc[df['B']==i, 'C'] = df['D'].sum()
df.loc[df['B']==i, 'D'] = df['A'] * df['C']/6
You can use numpy.unique() for the selction. In your code this might look somehow like this:
import numpy as np
import math
unique, indices, counts = np.unique(df['B'], return_index=True, return_counts=True)
for i in range(len(indices)):
for j in range(len(counts)):
row = df[indices[i]+j]
if math.isnan(row['C']):
row['C'] = df.loc[indices[i-1], 'D']
# then call your function
function1(row)
Related
Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Without using apply (because dataframe is too big), how I can get the previous not NaN value of a specific column to use in a calc ?
For example, this dataframe:
df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])
A B C
0 A 1 100.0
1 B 2 NaN
2 C 3 NaN
3 D 4 182.0
4 E 5 NaN
I need to calc the difference, in the column 'C' of the line 3 with the line 0.
The number of NaN values between the values is variable, then .shift() maybe is not applicable here (I think)
I need some like: df['D'] = df.C - df.C[previous_not_nan] (in the line 3 will be 82.
dropna + diff
df['D'] = df['C'].dropna().diff()
A B C D
0 A 1 100.0 NaN
1 B 2 NaN NaN
2 C 3 NaN NaN
3 D 4 182.0 82.0
4 E 5 NaN NaN
I have a multi-columned dataframe which holds several numerical values that are the same. It looks like the following:
A B C D
0 1 1 10 1
1 1 1 20 2
2 1 5 30 3
3 2 2 40 4
4 2 3 50 5
This is great, however, I need to make A the index and B the column. The problem is that the column is aggregated and is averaged for every identical value of B.
df = DataFrame({'A':[1,1,1,2,2],
'B':[1,1,5,2,3],
'C':[10,20,30,40,50],
'D':[1,2,3,4,5]})
transposed_df = df.pivot_table(index=['A'], columns=['B'])
Instead of keeping 10 and 20 across B1, it averages the two to 15.
C D
B 1 2 3 5 1 2 3 5
A
1 15.0 NaN NaN 30.0 1.5 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
Is there any way I can Keep column B the same and display every value of C and D using Pandas, or am I better off writing my own function to do this? Also, it is very important that the index and column stay the same because only one of each number can exist.
EDIT: This is the desired output. I understand that this exact layout probably isn't possible, but it shows that 10 and 20 need to both be in column 1 and index 1.
C D
B 1 2 3 5 1 2 3 5
A
1 10.0,20.0 NaN NaN 30.0 1.0,2.0 NaN NaN 3.0
2 NaN 40.0 50.0 NaN NaN 4.0 5.0 NaN
I have question similar to a previous post. I want to replace missing values in A with B if B is not-missing. I've used a toy dataset.
#Create sample dataset
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
df[df < 0] = 'NaN'
print(df)
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
#Replace NaN in A with B if B is not NaN
df['A'] = np.where(pd.isnull(df['A']) & pd.notnull(df['B']) == 0, df['B']*1, df['A'])
print(df)
obs A B
0 0.478943 0.478943
1 NaN NaN
2 1.39341 1.39341
3 0.281746 0.281746
4 1.24643 1.24643
5 NaN NaN
6 0.228913 0.228913
7 0.886429 0.886429
8 NaN NaN
9 NaN NaN
This code does the job. But why do I need pd.notnull(df['B']) == 0? If I write:
pd.notnull(df['B'])
instead, the code does not work correctly. The output from that is:
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
I'm trying to understand the flaw in my logic. Any other simple intuitive code will be appreciated.
I basically need to do this simple operation for a very large dataset (100m obs+) so looking for a fast way (in terms of computer processing time) to do it. Thanks in advance.
Replace 'NaN' with np.nan and apply fillna on column A using column B
df = df.replace('NaN', np.nan)
df.A.fillna(df.B, inplace=True)
Output:
A B
0 0.478943 0.478943
1 NaN NaN
2 1.965781 1.393406
3 0.092908 0.281746
4 0.769023 1.246435
5 1.007189 NaN
6 0.274992 0.228913
7 1.352917 0.886429
8 NaN NaN
9 1.669025 NaN
Here is a dataframe
a b c d
nan nan 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
nan nan 2 3
I want to replace the observations in both columns 'a' and 'b' where both of them are NaNs with 0s. Rows 2 and 5 in columns 'a' and 'b' have both both NaN, so I want to replace only those rows with 0's in those matching NaN columns.
so my output must be
a b c d
0 0 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
0 0 2 3
There might be a easier builtin function in Pandas, but this one should work.
df[['a', 'b']] = df.ix[ (np.isnan(df.a)) & (np.isnan(df.b)), ['a', 'b'] ].fillna(0)
Actually the solution from #Psidom much easier to read.
You can create a boolean series based on the conditions on columns a/b, and then use loc to modify corresponding columns and rows:
df.loc[df[['a','b']].isnull().all(1), ['a','b']] = 0
df
# a b c d
#0 0.0 0.0 3 5
#1 NaN 1.0 2 3
#2 1.0 NaN 4 5
#3 2.0 3.0 7 9
#4 0.0 0.0 2 3
Or:
df.loc[df.a.isnull() & df.b.isnull(), ['a','b']] = 0