aggregating across a dictionary containing dataframes - python

i have the following dictionary which contains dataframes as values, each always having the same number of columns (1) with the same title
test = {'A': pd.DataFrame(np.random.randn(10), index=range(10),columns=['values']),
'B': pd.DataFrame(np.random.randn(6), index=range(6),columns=['values']),
'C': pd.DataFrame(np.random.randn(11), index=range(11),columns=['values'])}
from this, i would like to create a single dataframe where the index values are the key values of the dictionary (so A,B,C) and the columns are the union of the current index values across all dictionaries (so in this case 0,1,2,3...10). the values of this dataframe would be the corresponding 'values' from the dataframe corresponding to each row, and where blank, NaN
is there a handy way to do this?

IIUC, use pd.concat, keys, and unstack:
pd.concat([test[i] for i in test], keys=test.keys()).unstack(1)['values']
Better yet,
pd.concat(test).unstack(1)['values']
Output:
0 1 2 3 4 5 6 \
A -0.029027 -0.530398 -0.866021 1.331116 0.090178 1.044801 -1.586620
C 1.320105 1.244250 -0.162734 0.942929 -0.309025 -0.853728 1.606805
B -1.683822 1.015894 -0.178339 -0.958557 -0.910549 -1.612449 NaN
7 8 9 10
A -1.072210 1.654565 -1.188060 NaN
C 1.642461 -0.137037 -1.416697 -0.349107
B NaN NaN NaN NaN

dont over complicate things:
just use concat and transpose
pd.concat(test, axis=1).T
0 1 2 3 4 5 \
A values -0.592711 0.266518 -0.774702 0.826701 -2.642054 -0.366401
B values -0.709410 -0.463603 0.058129 -0.054475 -1.060643 0.081655
C values 1.384366 0.662186 -1.467564 0.449142 -1.368751 1.629717
6 7 8 9 10
A values 0.431069 0.761245 -1.125767 0.614622 NaN
B values NaN NaN NaN NaN NaN
C values 0.988287 -1.508384 0.214971 -0.062339 -0.011547
if you were dealing with series instead of 1 column DataFrame it would make more sense to begin with...
test = {'A': pd.Series(np.random.randn(10), index=range(10)),
'B': pd.Series(np.random.randn(6), index=range(6)),
'C': pd.Series(np.random.randn(11), index=range(11))}
pd.concat(test,axis=1).T
0 1 2 3 4 5 6 \
A -0.174565 -2.015950 0.051496 -0.433199 0.073010 -0.287708 -1.236115
B 0.935434 0.228623 0.205645 -0.602561 1.860035 -0.921963 NaN
C 0.944508 -1.296606 -0.079339 0.629038 0.314611 -0.429055 -0.911775
7 8 9 10
A -0.704886 -0.369263 -0.390684 NaN
B NaN NaN NaN NaN
C 0.815078 0.061458 1.726053 -0.503471

Related

Pandas: How to use (df.groupby) in a lambda formula

The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!
groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')
Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN

Pandas: Sum multiple columns, but write NaN if any column in that row is NaN or 0

I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)

Python: Create New Column Equal to the Sum of all Columns Starting from Column Number 9

I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156

How do I combine two columns within a dataframe in Pandas?

Say I have two columns, A and B, in my dataframe:
A B
1 NaN
2 5
3 NaN
4 6
I want to get a new column, C, which fills in NaN cells in column B using values from column A:
A B C
1 NaN 1
2 5 5
3 NaN 3
4 6 6
How do I do this?
I'm sure this is a very basic question, but as I am new to Pandas, any help will be appreciated!
You can use combine_first:
df['c'] = df['b'].combine_first(df['a'])
Docs: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.combine_first.html
You can use where which is a vectorized if/else:
df['C'] = df['A'].where(df['B'].isnull(), df['B'])
A B C
0 1 NaN 1
1 2 5 5
2 3 NaN 3
3 4 6 6
df['c'] = df['b'].fillna(df['a'])
So what .fillna will do is it will fill all the Nan values in the data frame
We can pass any value to it
Here we pass the value df['a']
So this method will put the corresponding values of 'a' into the Nan values of 'b'
And the final answer will be in 'c'

Merge a lot of DataFrames together, without loop and not using concat

I have >1000 DataFrames, each have >20K rows and several columns, need to be merge by a certain common column, the idea can be illustrated by this:
data1=pd.DataFrame({'name':['a','c','e'], 'value':[1,3,4]})
data2=pd.DataFrame({'name':['a','d','e'], 'value':[3,3,4]})
data3=pd.DataFrame({'name':['d','e','f'], 'value':[1,3,5]})
data4=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4]})
#some or them may have more or less columns that the others:
#data5=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4], 'score':[1,3,4]})
final_data=data1
for i, v in enumerate([data2, data3, data4]):
if i==0:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('_0', '_%s'%(i+1)))
#in real case right_on may be = columns other than 'name'
#dependents on the dataframe, but this requirement can be
#ignored in this minimal example.
else:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('', '_%s'%(i+1)))
Result:
name value_0 value_1 value value_3
0 a 1 3 NaN NaN
1 c 3 NaN NaN NaN
2 e 4 4 3 NaN
3 d NaN 3 1 0
4 f NaN NaN 5 3
5 g NaN NaN NaN 4
[6 rows x 5 columns]
It works, but anyway this can be done without a loop?
Also, why the column name of the second to last column is not value_2?
P.S.
I know that in this minimal example, the result can also be achieved by:
pd.concat([item.set_index('name') for item in [data1, data2, data3, data4]], axis=1)
But In the real case due to the way how the dataframes were constructed and the information stored in the index columns, this is not an ideal solution without additional tricks. So, let's not consider this route.
Does it even make sense to merge it, then? What's wrong with a panel?
> data = [data1, data2, data3, data4]
> p = pd.Panel(dict(zip(map(str, range(len(data))), data)))
> p.to_frame().T
major 0 1 2
minor name value name value name value
0 a 1 c 3 e 4
1 a 3 d 3 e 4
2 d 1 e 3 f 5
3 d 0 f 3 g 4
# and just for kicks
> p.transpose(2, 0, 1).to_frame().reset_index().pivot_table(values='value', rows='name', cols='major')
major 0 1 2 3
name
a 1 3 NaN NaN
c 3 NaN NaN NaN
d NaN 3 1 0
e 4 4 3 NaN
f NaN NaN 5 3
g NaN NaN NaN 4

Categories

Resources