I have a data source (csv file) which is in this shape:
Sample raw data is as follow:
id stage D1 D2 D3 D4 D5 D6
1 base A
1 s1 2 2 4 5
1 s2 3 3 6 7
2 base AA
2 s1 5 3 4 3
2 s2 3 3 2 4
2 s3 2 2 3 6
3 base B
3 s1 4 4 4 5
4 base BC
The first line is an ID and all rows with the same ID are related to the same experiment.
I need to make it flat and change the shape of it when I read it in Pandas to this shape:
id stage D1 D2 D3_s1 D4_s1 D5_s1 D6_s1 D3_s2 D4_s2 D5_s2 D6_s2 D3_s3 D4_s3 D5_s3 D6_s3
1 base A 2 2 4 5 3 3 6 7
2 base AA 5 3 4 3 3 3 2 4 2 2 3 6
3 base B 4 4 4 5
4 base BC
What is the best way to do this in Python?
As a C/C++ programmer, I started using several loops to go over each cell and create a new dataframe with the required shape (Still not successful!).
I believe there should be a better way rather than iterating over all rows and cols.
My questions:
What is the best way to do this in Python?
How can I find that D2 is blank and can drop it?
Assuming you already read the data into a DataFrame:
Split it into 2 dataframes: base (containing rows with stage = base) and other
Unstack the second dataframe and change the column names
Recombine the two
The code
is_base = df['stage'] == 'base'
base = df.loc[is_base, 'id':'D2'].set_index('id')
other = df.loc[~is_base, ['id','stage','D3','D4','D5','D6']].set_index(['id', 'stage'])
other = other.unstack()
other.columns = other.columns.get_level_values(0) + '_' + other.columns.get_level_values(1)
# Reset index if needed
final = pd.merge(base, other, left_index=True, right_index=True)
As you're a C++ programmer, you'll be happy to know that a lot of the core functions in pandas are actually written in C++ for performance reasons
We can use two filters and a MultiIndex by unstacking.
s = df1[df1['stage'].ne('base')]
s1 = s.set_index(['id','stage']).stack().unstack([-1,-2])
s1.columns = [f'{x}_{y}' for x,y in s1.columns]
# to match your output we flatten the multi index.
print(s1)
D1_s1 D2_s1 D3_s1 D4_s1 D1_s2 D2_s2 D3_s2 D4_s2 D1_s3 D2_s3 D3_s3 D4_s3
id
1 2 2 4 5 3 3 6 7 NaN NaN NaN NaN
2 5 3 4 3 3 3 2 4 2 2 3 6
3 4 4 4 5 NaN NaN NaN NaN NaN NaN NaN NaN
then we filter on the base value and join based on the id column.
df2 = df1.loc[df1['stage'].eq('base'), ['id','stage','D1','D2']].set_index('id').join(s1)
as for dropping D2 if its blank a simple if will do.
if df2['D2'].isna().all():
df2 = df2.drop('D2',1)
print(df2)
stage D1 D1_s1 D2_s1 D3_s1 D4_s1 D1_s2 D2_s2 D3_s2 D4_s2 D1_s3 D2_s3 \
id
1 base A 2 2 4 5 3 3 6 7 NaN NaN
2 base AA 5 3 4 3 3 3 2 4 2 2
3 base B 4 4 4 5 NaN NaN NaN NaN NaN NaN
4 base BC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
D3_s3 D4_s3
id
1 NaN NaN
2 3 6
3 NaN NaN
4 NaN NaN
You should turn it to numpy array and then flatten it and reshape it. like this:
data=pd.read_csv(#Your CSV File Name).values
data=data.flatten()
data.reshape(#Your New Shape)
Related
Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
I have a dataset like this.
A B C A2
1 2 3 4
5 6 7 8
and I want to combine A and A2.
A B C
1 2 3
5 6 7
4
8
how can I combine two columns?
Hope for help. Thank you.
I don't think it is possible directly. But you can do it with a few lines of code:
df = pd.DataFrame({'A':[1,5],'B':[2,6],'C':[3,7],'A2':[4,8]})
df_A2 = df[['A2']]
df_A2.columns = ['A']
df = pd.concat([df.drop(['A2'],axis=1),df_A2])
You will get this if you print df:
A B C
0 1 2.0 3.0
1 5 6.0 7.0
0 4 NaN NaN
1 8 NaN NaN
You could append the last columns after renaming it:
df.append(df[['A2']].set_axis(['A'], axis=1)).drop(columns='A2')
it gives as expected:
A B C
0 1 2.0 3.0
1 5 6.0 7.0
0 4 NaN NaN
1 8 NaN NaN
if the index is not important to you:
import pandas as pd
pd.concat([df[['A','B','C']], df[['A2']].rename(columns={'A2': 'A'})]).reset_index(drop=True)
I have several DataFrames (DataFrames have the same index and column structure). The problem is that there are NaN values in these dataframes.
I want to replace these NaN values by mean value of other's DataFrames' corresponding values.
For exapmle let's look at 3 dataframes.
DataFrame1 with 1:M2 NaN :
M1 M2 M3
0 1 1 2
1 8 NaN 9
2 4 2 7
3 9 6 3
DataFrame 2 with NaN value at 0:M3:
M1 M2 M3
0 2 3 NaN
1 1 1 6
2 1 2 9
3 4 6 2
DataFrame3:
M1 M2 M3
0 1 4 2
1 2 9 1
2 1 6 5
3 1 NaN 4
So we replace NaN in first DataFrame by 5 (9+1)/2. Second NaN should be replaced by 2 because (2+2)/2, third by 6 and so on.
Is there any good and elegant way to do it?
This is one way using numpy.nanmean.
avg = np.nanmean([df1.values, df2.values, df3.values], axis=0)
for df in [df1, df2, df3]:
df[df.isnull()] = avg
df = df.astype(int)
Note: since np.nan is float, we convert explicitly back to int.
We can concat , then using groupby fillna , after split should get what you need
s=pd.concat([df1,df2,df3],keys=[1,2,3])
s=s.groupby(level=1).apply(lambda x : x.fillna(x.mean()))
df1,df2,df3=[x.reset_index(level=0,drop=True) for _,x in s.groupby(level=0)]
df1
Out[1737]:
M1 M2 M3
0 1 1.0 2.0
1 8 5.0 9.0
2 4 2.0 7.0
3 9 6.0 3.0
Here is a dataframe
a b c d
nan nan 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
nan nan 2 3
I want to replace the observations in both columns 'a' and 'b' where both of them are NaNs with 0s. Rows 2 and 5 in columns 'a' and 'b' have both both NaN, so I want to replace only those rows with 0's in those matching NaN columns.
so my output must be
a b c d
0 0 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
0 0 2 3
There might be a easier builtin function in Pandas, but this one should work.
df[['a', 'b']] = df.ix[ (np.isnan(df.a)) & (np.isnan(df.b)), ['a', 'b'] ].fillna(0)
Actually the solution from #Psidom much easier to read.
You can create a boolean series based on the conditions on columns a/b, and then use loc to modify corresponding columns and rows:
df.loc[df[['a','b']].isnull().all(1), ['a','b']] = 0
df
# a b c d
#0 0.0 0.0 3 5
#1 NaN 1.0 2 3
#2 1.0 NaN 4 5
#3 2.0 3.0 7 9
#4 0.0 0.0 2 3
Or:
df.loc[df.a.isnull() & df.b.isnull(), ['a','b']] = 0
I'm pretty new to Pandas and programming in general but I've always been able to find the answer to any problem through google until now. Sorry about the not terribly descriptive question, hopefully someone can come up with something clearer.
I'm trying to group data together, perform functions on that data, update a column and then use the data from that column on the next group of data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(9),columns=['A'])
df['B'] = [1,1,1,2,2,3,3,3,3]
df['C'] = np.nan
df['D'] = np.nan
df.loc[0:2,'C'] = 500
Giving me
A B C D
0 0.825828 1 500.0 NaN
1 0.218618 1 500.0 NaN
2 0.902476 1 500.0 NaN
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
The 500 in column C is the initial condition. I want to group the data by column B and perform the following function on the first group
def function1(row):
return row['A']*row['C']/6
giving me
A B C D
0 0.825828 1 500.0 68.818971
1 0.218618 1 500.0 18.218145
2 0.902476 1 500.0 75.206313
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then want to sum the first three values in D and add them to the last value in C and making this value the group 2 value
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 NaN
4 0.513505 2 662.243429 NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then perform function1 on group 2 and repeat until I end up with this
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 49.946896
4 0.513505 2 662.243429 56.677505
5 0.089975 3 768.867830 11.529874
6 0.282479 3 768.867830 36.198113
7 0.774286 3 768.867830 99.220591
8 0.408501 3 768.867830 52.347246
The dataframe will consist of hundreds of rows. I've been trying various groupby, apply combinations but I'm completely stumped.
Thanks
Here is a solution:
df['D'] = df['A'] * df['C']/6
for i in df['B'].unique()[1:]:
df.loc[df['B']==i, 'C'] = df['D'].sum()
df.loc[df['B']==i, 'D'] = df['A'] * df['C']/6
You can use numpy.unique() for the selction. In your code this might look somehow like this:
import numpy as np
import math
unique, indices, counts = np.unique(df['B'], return_index=True, return_counts=True)
for i in range(len(indices)):
for j in range(len(counts)):
row = df[indices[i]+j]
if math.isnan(row['C']):
row['C'] = df.loc[indices[i-1], 'D']
# then call your function
function1(row)