How to prepend a zero vector to a Pandas DataFrame? - python

I have a pandas variable X which has a shape of (14931, 381).
That's 14,931 examples, with each example having 381 features. I want to add 483 features (each with a zero) value to each example, except I want them to be before the 381 existing ones
How can this be done?

Create a DataFrame of zeros and call pd.concat.
v = pd.DataFrame(0, index=df.index, columns=range(483))
df = pd.concat([v, df], axis=1)

For demonstration purpose let's set up a smaller DataFrame
(7 rows and 2 columns, with feature (column) names f1, f2, ...):
df = pd.DataFrame(data={'f1': [ 1, 4, 6, 5, 7, 2, 3 ],
'f2': [ 4, 6, 5, 0, 2, 3, 2 ]})
Then, let's create a DataFrame filled with zeroes, to be
prepended to df (3 columns instead of your 483):
zz = pd.DataFrame(data=np.zeros((df.shape[0], 3), dtype=int),
columns=[ 'p' + str(n + 1) for n in range(3) ], index=df.index)
As you can see:
I named the "new" columns as p1, p2 and so on,
the index is a copy of the index in df (it will be important
at the next stage).
And the last step is to join these 2 DataFrames and substitute under
df:
df = zz.join(df)
The last step for you is to change the number of added columns to the
proper value.

Related

Loop Over every Nth item in Dictionary

can anyone advise how to loop over every Nth item in a dictionary?
Essentially I have a dictionary of dataframes and I want to be able to create a new dictionary based on every 3rd dataframe item (including the first) based on index positioning of the original. Once I have this I would like to concatenate the dataframes together.
So for example if I have 12 dataframes , I would like the new dataframe to contain the first,fourth,seventh,tenth etc..
Thanks in advance!
if the dict is required, you may use tuple of dict keys:
custom_dict = {
'first': 1,
'second': 2,
'third': 3,
'fourth': 4,
'fifth': 5,
'sixth': 6,
'seventh': 7,
'eighth': 8,
'nineth': 9,
'tenth': 10,
'eleventh': 11,
'twelveth': 12,
}
for key in tuple(custom_dict)[::3]:
print(custom_dict[key])
then, you may call pandas.concat:
df = pd.concat(
[
custom_dict[key]
for key in tuple(custom_dict)[::3]
],
# =========================================================================
# axis=0 # To Append One DataFrame to Another Vertically
# =========================================================================
axis=1 # To Append One DataFrame to Another Horisontally
)
assuming custom_dict[key] returns pandas.DataFrame, not int as in my code above.
What you ask it a bit strange. Anyway, you have two main options.
convert your dictionary values to list and slice that:
out = pd.concat(list(dfs.values())[::3])
output:
a b c
0 x x x
0 x x x
0 x x x
0 x x x
slice your dictionary keys and generate a subdictionary:
out = pd.concat({k: dfs[k] for k in list(dfs)[::3]})
output:
a b c
df1 0 x x x
df4 0 x x x
df7 0 x x x
df10 0 x x x
Used input:
dfs = {f'df{i+1}': pd.DataFrame([['x']*3], columns=['a', 'b', 'c']) for i in range(12)}

Pandas Correlation One Column to Many Columns Group by range of the column

Assuming I have a data frame similar to the below (actual data frame has million observations), how would I get the correlation between signal column and list of return columns, then group by the Signal_Up column?
I tried the pandas corrwith function but it does not give me the correlation grouping for the signal_up column
df[['Net_return_at_t_plus1', 'Net_return_at_t_plus5',
'Net_return_at_t_plus10']].corrwith(df['Signal_Up']))
I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
Data and desired result is given below.
Desired Result
Data
Using simple dataframe below:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 6, 7],
'v2': [2, 2, 4, 2, 4, 4],
'v3': [3, 3, 2, 9, 2, 5],
'v4': [4, 5, 1, 4, 2, 5]})
(1st interpretation) one way to get correlations of one variable with the other columns is:
correlations = df.corr().unstack().sort_values(ascending=False) # Build correlation matrix
correlations = pd.DataFrame(correlations).reset_index() # Convert to dataframe
correlations.columns = ['col1', 'col2', 'correlation'] # Label it
correlations.query("col1 == 'v2' & col2 != 'v2'") # Filter by variable
# output of this code will give correlation of column v2 with all the other columns
(2nd interpretation) one way to get correlations of column v1 with column v3, v4 after grouping by column v2 is using this one line:
df.groupby('v2')[['v1', 'v3', 'v4']].corr().unstack()['v1']
In your case, v2 is 'Signal_Up', v1 is 'signal' and v3, v4 columns proxy 'Net_return_at_t_plusX' columns.
I am able to get the correlations by individual category of Signal_Up column by using “groupby” function. However, I am not able to apply “corr” function to more than two columns.
So, I had to use “concat” function to combine all of them.
a = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus1']].corr().unstack().iloc[:,1]
b = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus5']].corr().unstack().iloc[:,1]
c = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus10']].corr().unstack().iloc[:,1]
dfCorr = pd.concat([a, b, c], axis=1)

concatenate in place in sub function with pandas concat function?

I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...
This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60

Using Array or Series to select from multiple columns

I have a counter column which contains an integer. Based on that integer I would like to pick one of consecutive columns in my dataframe.
I tried using .apply(lambda x: ..., axis =1) but my solution there requires an extra if for each column I want to pick from.
df2 = pd.DataFrame(np.array([[1, 2, 3, 0 ], [4, 5, 6, 2 ], [7, 8, 9, 1]]),columns=['a', 'b', 'c','d'])
df2['e'] = df.iloc[:,df2['d']]
This code doesn't work because iloc only wants one item in that position and not 3 (df2['d']= [0,2,1]).
What I would like it to do is give me the 0th item in the first row the 2nd item in the second row and the 1st item in the third row. so
df2['e'] = [1,6,8]
You are asking for something similar to fancy indexing in numpy. In pandas, it is lookup. Try this:
df2.lookup(df2.index, df2.columns[df2['d']])
Out[86]: array([1, 6, 8])

python 1:1 stratified sampling per each group

How can a 1:1 stratified sampling be performed in python?
Assume the Pandas Dataframe df to be heavily imbalanced. It contains a binary group and multiple columns of categorical sub groups.
df = pd.DataFrame({'id':[1,2,3,4,5], 'group':[0,1,0,1,0], 'sub_category_1':[1,2,2,1,1], 'sub_category_2':[1,2,2,1,1], 'value':[1,2,3,1,2]})
display(df)
display(df[df.group == 1])
display(df[df.group == 0])
df.group.value_counts()
For each member of the main group==1 I need to find a single match of group==0 with.
A StratifiedShuffleSplit from scikit-learn will only return a random portion of data, not a 1:1 match.
If I understood correctly you could use np.random.permutation:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'group': [0, 1, 0, 1, 0], 'sub_category_1': [1, 2, 2, 1, 1],
'sub_category_2': [1, 2, 2, 1, 1], 'value': [1, 2, 3, 1, 2]})
# create new column with an identifier for a combination of categories
columns = ['sub_category_1', 'sub_category_2']
labels = df.loc[:, columns].apply(lambda x: ''.join(map(str, x.values)), axis=1)
values, keys = pd.factorize(labels)
df['label'] = labels.map(dict(zip(keys, values)))
# build distribution of sub-categories combinations
distribution = df[df.group == 1].label.value_counts().to_dict()
# select from group 0 only those rows that are in the same sub-categories combinations
mask = (df.group == 0) & (df.label.isin(distribution))
# do random sampling
selected = np.ravel([np.random.permutation(group.index)[:distribution[name]] for name, group in df.loc[mask].groupby(['label'])])
# display result
result = df.drop('label', axis=1).iloc[selected]
print(result)
Output
group id sub_category_1 sub_category_2 value
4 0 5 1 1 2
2 0 3 2 2 3
Note that this solution assumes the size of the each possible sub_category combination of group 1 is less than the size of the corresponding sub-group in group 0. A more robust version involves using np.random.choice with replacement:
selected = np.ravel([np.random.choice(group.index, distribution[name], replace=True) for name, group in df.loc[mask].groupby(['label'])])
The version with choice does not have the same assumption as the one with permutation, although it requires at least one element for each sub-category combination.

Categories

Resources