Stacking columns in Pandas - python

I'm trying to set create a new column on my DataFrame grouping two existing columns
import pandas as pd
import numpy as np
DATA=pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
DATA['index']=np.arange(5)
DATA.set_index('index', inplace=True)
The output is something like this
'A' 'B'
index
0 -0.003635 -0.644897
1 -0.617104 -0.343998
2 1.270503 -0.514588
3 -0.053097 -0.404073
4 -0.056717 1.870671
I would like to have an extra column 'C' that has an np.array with the elements of 'A' and 'B' for the corresponding row. In the real case, 'A' and 'B' are already 1D np.arrays, but of different lengths. I would like to make a longer array with all the elements stacked or concatenated.
Thanks

If columns a and b contains numpy arrays, you could apply hstack across rows:
import pandas as pd
import numpy as np
num_rows = 10
max_arr_size = 3
df = pd.DataFrame({
"a": [np.random.rand(max_arr_size) for _ in range(num_rows)],
"b": [np.random.rand(max_arr_size) for _ in range(num_rows)],
})
df["c"] = df.apply(np.hstack, 1)
assert all(row.a.size + row.b.size == row.c.size for _, row in df.iterrows())

DATA['C'] = DATA.apply(lambda x: np.array([x.A, x.B]), axis=1)
pandas requires all rows to be of the same length so the problem of uneven pandas series shouldn't be present

Related

How to create a new column in pandas dataframe based on a condition?

I have a data frame with the following columns:
d = {'find_no': [1, 2, 3], 'zip_code': [32351, 19207, 8723]}
df = pd.DataFrame(data=d)
When there are 5 digits in the zip_code column, I want to return True. When there are not 5 digits, I want to return the "find_no".
Sample output would have the results in an added column to the dataframe, corresponding to the row it's referencing.
You could try np.where:
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, True, df['find_no'])
Only downside with this approach is that NumPy will convert your True values to 1's, which could be confusing. An approach to keep the values you want is to do
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, 'True', df['find_no'].astype(str))
The downside here being that you lose the meaning of those values by casting them to strings. I guess it all depends on what you're hoping to accomplish.

Assign value based on conditional of two multiindex columns in Pandas

The objective is to create a new multiindex column (stat) based on the condition of the column (A and B)
Condition for A
CONDITION_A='n'if A<0 else 'p'
and
Condition for B
CONDITION_B='l'if A<0 else 'g'
Currently, the idea is to separately analyse condition A and B, and combine the analysis to obtain the column stat as below, and finally append back to the main dataframe.
However, I wonder whether there is a way to maximise the line of code to achieve similar objective
The expected output
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_1 = df.loc[:,idx[:,'A']]<0
appenddf=mask_1.replace({True:'N',False:'P'}).rename(columns={'A':'iii'},level=1)
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'l',False:'g'}).rename(columns={'A':'iv'},level=1)
# combine the multiindex
stat_comparison=[''.join(i) for i in zip(appenddf["iii"],appenddf_2["iv"])]
You can try concatinating both df's:
s=pd.concat([appenddf,appenddf_2],axis=1)
cols=pd.MultiIndex.from_product([s.columns.get_level_values(0),['stat']])
out=pd.concat([s.loc[:,(x,slice(None))].agg('_'.join,axis=1) for x in s.columns.get_level_values(0).unique()],axis=1,keys=cols)
output of out:
One Two
stat stat
0 P_g P_l
1 N_l N_l
2 N_l N_g
3 P_g P_l
4 N_l P_l

Adding a new row to a pandas data frame when columns have different data type?

I have a 2-column pandas data frame, initialized with df = pd.DataFrame([], columns = ["A", "B"]). Column A needs to be of type float, and column B is of type datetime.datetime. I need to add my first values to it (i.e. new rows), but I can't seem to figure out how to do it. I can't do new_row = [x, y] then append it since x and y are not of the same type. How should I go about adding these rows? Thank you.
import pandas as pd
from numpy.random import rand
Option 1 - make new row as a DF and append to previous:
df = pd.DataFrame([], columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df2=pd.DataFrame( columns = ["A", "B"],data=[[rand(),T]])
df=df.append(df2)
Or, Option 2 - create empty DF and then index:
df = pd.DataFrame(index=range(5), columns = ["A", "B"])
T=pd.datetime(2000,1,1)
df.iloc[0,:]=[rand(),T]

Average a half of number of a grouped column in a dataframe

Below I defined two dataframes, input and output.
df_input:
Column A which is to be grouped.
Column B is a sort of index withing groups of A, enumeration.
Column C contains elements to be summed up.
df_output:
Column D with the calculated average
Within each group of A take average over elements of C, but only the first half of those. If the number of elements is odd ceil it.
This is just a simplified problem of some huge dataset. And this is how I did solve it so far.
import pandas as pd
import numpy as np
df_input = pd.DataFrame({"A": [2,2,3,3,3,4,4,4,4],
"B": [2,1,1,3,2,4,2,1,3],
"C": [1,1,2,2,2,4,4,4,4]})
df_output = pd.DataFrame({"A": [2,2,3,3,3,4,4,4,4],
"B": [2,1,1,3,2,4,2,1,3],
"C": [1,1,2,2,2,4,4,4,4],
"D": [1,1,2,2,2,4,4,4,4]})
df = df_input.copy()
df.sort_values(by=['A', 'B'], inplace=True)
df['E'] = np.ceil(df['A'] / 2) # this is number of elements to sum up withing group 'A'
df['G'] = df['C'] / df['E'] # this is element to be included or not inside the sum
df['H'] = np.where(df['E'] >= df['B'], df['G'], 0)
df['D'] = df.groupby('A')['H'].transform('sum')
I'm hoping to get this done in a more neat and one-liner(s) type of way...please :)

pandas sample based on criteria

I would like to use pandas sample function but with a criteria without grouping or filtering data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print df.sample(n=100)
This will sample 100 rows, but what if i want to sample 50 rows containing 0 to 50 rows containing 1 in df['a'].
You can use the == operator to make a list* of boolean values. And when said list is put into the getter ([]) it will filter the values. If you want to, you can use n=50 to create a sample size of 50 rows.
New code
df[df['a']==1].sample(n=50)
Full code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print(df[df['a']==1].sample(n=50))
*List isn't literally a list in this context, but it is a great word for explaining how it works. It's a technically a DataFrame that maps rows to a true/false value.
More obscure DataFrame sampling
If you want to sample all 50 where a is 1 or 0:
print(df[(df['a']==1) | (df['a']==0)].sample(n=50))
And if you want to sample 50 of each:
df1 = df[df['a']==1].sample(n=50)
df0 = df[df['a']==0].sample(n=50)
print(pd.concat([df1,df0]))

Categories

Resources