Adding rows to a Dataframe to unify the length of groups - python

I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.

You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0

Related

pandas, compute a value per group?

I'm trying to go from df to df2
I'm grouping by review_meta_id, age_bin then calculate a ctr from sum(click_count)/ sum(impression_count)
In [69]: df
Out[69]:
review_meta_id age_month impression_count click_count age_bin
0 3 4 10 3 1
1 3 10 5 2 2
2 3 20 5 3 3
3 3 8 9 2 2
4 4 9 9 5 2
In [70]: df2
Out[70]:
review_meta_id ctr age_bin
0 3 0.300000 1
1 3 0.285714 2
2 3 0.600000 3
3 4 0.555556 2
import pandas as pd
bins = [0, 5, 15, 30]
labels = [1,2,3]
l = [dict(review_meta_id=3, age_month=4, impression_count=10, click_count=3), dict(review_meta_id=3, age_month=10, impression_count=5, click_count=2), dict(review_meta_id=3, age_month=20, impression_count=5, cli\
ck_count=3), dict(review_meta_id=3, age_month=8, impression_count=9, click_count=2), dict(review_meta_id=4, age_month=9, impression_count=9, click_count=5)]
df = pd.DataFrame(l)
df['age_bin'] = pd.cut(df['age_month'], bins=bins, labels=labels)
grouped = df.groupby(['review_meta_id', 'age_bin'])
Is there an elegant way of doing the following?
data = []
for name, group in grouped:
ctr = group['click_count'].sum() / group['impression_count'].sum()
review_meta_id, age_bin = name
data.append(dict(review_meta_id=review_meta_id, ctr=ctr, age_bin=age_bin))
df2 = pd.DataFrame(data)
You can first aggregate goth columns by sum, then divide columns with DataFrame.pop for use and remove columns and last convert MultiIndex to columns with remove rows with missing values by DataFrame.dropna:
df2 = df.groupby(['review_meta_id', 'age_bin'])[['click_count','impression_count']].sum()
df2['ctr'] = df2.pop('click_count') / df2.pop('impression_count')
df2 = df2.reset_index().dropna()
print (df2)
review_meta_id age_bin ctr
0 3 1 0.300000
1 3 2 0.285714
2 3 3 0.600000
4 4 2 0.555556
you can use apply function after you grouping the dataframe by 'review_meta_id', 'age_bin' in order to calculate 'ctr', the result will be a pandas series in order to convert it to a dataframe we use reset_index() and provide name='ctr', The name of the column corresponding to the Series values.
def divide_two_cols(df_sub):
return df_sub['click_count'].sum() / float(df_sub['impression_count'].sum())
df2 = df.groupby(['review_meta_id', 'age_bin']).apply(divide_two_cols).reset_index(name='ctr')
new_df

How do I sort a Pandas dataframe Excel import?

I have imported the following Excel file but would like to sort it based on Frequency descending, but then with 'Other','No data' and 'All' (the total) at the bottom in that order. Is this possible?
table1 = pd.read_excel("table1.xlsx")
table1
Use:
df = pd.DataFrame({
'generalenq':list('abcdef'),
'percentage':[1,3,5,7,1,0],
'frequency':[5,3,6,9,2,4],
})
df.loc[0, 'generalenq'] = 'All'
df.loc[2, 'generalenq'] = 'No data'
df.loc[3, 'generalenq'] = 'Other'
print (df)
generalenq percentage frequency
0 All 1 5
1 b 3 3
2 No data 5 6
3 Other 7 9
4 e 1 2
5 f 0 4
First create dictionary for ordering by some integers. Then create mask by membership with Series.isin and sorting non matched rows selected with ~ for invert mask with boolean indexing:
d = {'Other':0,'No data':1,'All':2}
mask = df['generalenq'].isin(list(d.keys()))
df1 = df[~mask].sort_values('frequency', ascending=False)
print (df1)
generalenq percentage frequency
5 f 0 4
1 b 3 3
4 e 1 2
Then filter matched rows by mask and create helper column for sorting by mapped dict:
df2 = df[mask].assign(new = lambda x: x['generalenq'].map(d)).sort_values('new').drop('new', 1)
print (df2)
generalenq percentage frequency
3 Other 7 9
2 No data 5 6
0 All 1 5
And last join together by concat:
df = pd.concat([df1, df2], ignore_index=True)
print (df)
generalenq percentage frequency
0 f 0 4
1 b 3 3
2 e 1 2
3 Other 7 9
4 No data 5 6
5 All 1 5

Append data with one column to existing dataframe

I want append a list of data to a dataframe such that the list will appear in a column ie:
#Existing dataframe:
[A, 20150901, 20150902
1 4 5
4 2 7]
#list of data to append to column A:
data = [8,9,4]
#Required dataframe
[A, 20150901, 20150902
1 4 5
4 2 7
8, 0 0
9 0 0
4 0 0]
I am using the following:
df_new = df.copy(deep=True)
#I am copying and deleting data as column names are type Timestamp and easier to reuse them
df_new.drop(df_new.index, inplace=True)
for item in data_list:
df_new = df_new.append([{'A':item}], ignore_index=True)
df_new.fillna(0, inplace=True)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
But doing this in a loop is inefficient plus I get this warning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Any ideas on how to overcome this error and append 2 dataframes in one go?
I think need concat new DataFrame with column A, then reindex if want same order of columns and last replace missing values by fillna:
data = [8,9,4]
df_new = pd.DataFrame({'A':data})
df = (pd.concat([df, df_new], ignore_index=True)
.reindex(columns=df.columns)
.fillna(0, downcast='infer'))
print (df)
A 20150901 20150902
0 1 4 5
1 4 2 7
2 8 0 0
3 9 0 0
4 4 0 0
I think, you could do something like this.
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame({'A':[8,9,4]})
df.append(df2).fillna(0)
A B
0 1 2.0
1 3 4.0
0 8 0.0
1 9 0.0
2 4 0.0
maybe you can do it in this way:
new = pd.DataFrame(np.zeros((3, 3))) #Create a new zero dataframe:
new[0]=[8,9,4] #add values
existed_dataframe.append(new) #and merge both dataframes

python pandas changing several columns in dataframe based on one condition

I am new in Python and Pandas. I worked with SAS. In SAS I can use IF statement with "Do; End;" to update values of several columns based on one condition.
I tried np.where() clause but it updates only one column. The "apply(function, ...)" also updates only one column. Positioning extra update statement inside the function body didn't help.
Suggestions?
You can select which columns you want to alter, then use .apply():
df = pd.DataFrame({'a': [1,2,3],
'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df[['a','b']].apply(lambda x: x+1)
a b
0 2 5
1 3 6
2 4 7
This link may help:
You could use:
for col in df:
df[col] = np.where(df[col] == your_condition, value_if, value_else)
eg:
a b
0 0 2
1 2 0
2 1 1
3 2 0
for col in df:
df[col] = np.where(df[col]==0,12, df[col])
Output:
a b
0 12 2
1 2 12
2 1 1
3 2 12
Or if you want apply the condition only on some columns, select them in the for loop:
for col in ['a','b']:
or just in this way:
df[['a','b']] = np.where(df[['a','b']]==0,12, df[['a','b']])

Add column to pandas without headers

How does one append a column of constant values to a pandas dataframe without headers? I want to append the column at the end.
With headers I can do it this way:
df['new'] = pd.Series([0 for x in range(len(df.index))], index=df.index)
Each not empty DataFrame has columns, index and some values.
You can add default column value and create new column filled by scalar:
df[len(df.columns)] = 0
Sample:
df = pd.DataFrame({0:[1,2,3],
1:[4,5,6]})
print (df)
0 1
0 1 4
1 2 5
2 3 6
df[len(df.columns)] = 0
print (df)
0 1 2
0 1 4 0
1 2 5 0
2 3 6 0
Also for creating new column with name the simpliest is:
df['new'] = 1

Categories

Resources