Add column to pandas without headers

Add column to pandas without headers - python

How does one append a column of constant values to a pandas dataframe without headers? I want to append the column at the end.
With headers I can do it this way:
df['new'] = pd.Series([0 for x in range(len(df.index))], index=df.index)

Each not empty DataFrame has columns, index and some values.
You can add default column value and create new column filled by scalar:
df[len(df.columns)] = 0
Sample:
df = pd.DataFrame({0:[1,2,3],
1:[4,5,6]})
print (df)
0 1
0 1 4
1 2 5
2 3 6
df[len(df.columns)] = 0
print (df)
0 1 2
0 1 4 0
1 2 5 0
2 3 6 0
Also for creating new column with name the simpliest is:
df['new'] = 1

Related

python pandas group by and aggregate columns

I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?

Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0

How to create pandas dummies based on column values

I would like to create dummies based on column values...
This is what the df looks like
I want to create this
This is so far my approach
import pandas as pd
df =pd.read_csv('test.csv')
v =df.Values
v_set=set()
for line in v:
line=line.split(',')
for x in line:
if x!="":
v_set.add(x)
else:
continue
for val in v_set:
df[val]=''
By the above code I am able to create columns in my df like this
How do I go about updating the row values to create dummies?
This is where I am having problems.
Thanks in advance.

You could use pandas.Series.str.get_dummies. This will alllow you to split the column directly with a delimiter.
df = pd.concat([df.ID, df.Values.str.get_dummies(sep=",")], axis=1)
ID 1 2 3 4
0 1 1 1 0 0
1 2 0 0 1 1
df.Values.str.get_dummies(sep=",") will generate
1 2 3 4
0 1 1 0 0
1 0 0 1 1
Then, we do a pd.concat to glue to df together.

Adding rows to a Dataframe to unify the length of groups

I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.

You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0

How to create a new pandas dataframe from old dataframe using a list of column names

I have a pandas dataframe with several columns. Bulk of the column names can be looped. So I have made an array of the column names like this:
ycols = ['{}_{}d pred'.format(ticker, i) for i in range(hm_days)]
Now I want to make a new pandas dataframe with only these columns having the index of the parent dataframe. How to do this?

Ok, So you want to create a new dataframe with new column names, with the existing index of the original dataframe.
For some dataframe:
old_df = pd.DataFrame({'x':[0,1,2,3],'y':[10,9,8,7]})
>>>
x y
0 0 10
1 1 9
2 2 8
3 3 7
columns = list(old_df)
>>>
['x', 'y']
You can specify your new columns by doing:
y_cols = ['x_pred','y_pred']
>>> ['x_pred','y_pred']
Here, y_cols is the list of your new column names. In your code, you would replace this step with ycols = ['{}_{}d pred'.format(ticker, i) for i in range(hm_days)].
To get the new columns, you create new columns with a placeholder variable (in this case 0, as it looks like you are using numeric data), with the same index as your old dataframe:
# Iterate over all columns names in y_cols
for i in y_cols:
old_df[i]=0
>>> old_df:
x y x_pred y_pred
0 0 10 0 0
1 1 9 0 0
2 2 8 0 0
3 3 7 0 0
Finally, slice your dataframe to get your new dataframe with new column names, maintaining the index of the old dataframe.
df_new = old_df[y_cols]
>>>
x_pred y_pred
0 0 0
1 0 0
2 0 0
3 0 0
This works even if you have a named index:
x y x_pred y_pred
Date
0 0 10 0 0
1 1 9 0 0
2 2 8 0 0
3 3 7 0 0
df_new = old_df[y_cols]
x_pred y_pred
Date
0 0 0
1 0 0
2 0 0
3 0 0

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.

You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add column to pandas without headers - python

How does one append a column of constant values to a pandas dataframe without headers? I want to append the column at the end. With headers I can do it this way: df['new'] = pd.Series([0 for x in range(len(df.index))], index=df.index)

Related

python pandas group by and aggregate columns

How to create pandas dummies based on column values

Adding rows to a Dataframe to unify the length of groups

How to create a new pandas dataframe from old dataframe using a list of column names

Add columns to pandas dataframe containing max of each row, AND corresponding column name

Categories

Resources