I have a list of Pandas dataframes where I want to add together the rows that have the same index. For example, say we have two dataframes where their indexes are unordered:
Column1 Column2
Item1 1 4
Item3 2 5
Item2 3 6
Column1 Column2
Item1 1 3
Item2 2 4
Is there a way to add these two dataframes together by index to get the following result with Item3 included? Because a simple df1 + df2 will return the first two lines correctly, but Item3 will end up having NaNs. Having the results become floats is fine.
# What I want to calculate
Column1 Column2
Item1 2 7
Item2 5 10
Item3 2 5
# What actually calculates
Column1 Column2
Item1 2.0 7.0
Item2 5.0 10.0
Item3 NaN NaN
You can try
df_final=(df1.set_index('item').add(df2.set_index('item'), fill_value=0).reset_index())
#float to int
df_final = df_final.astype(str).replace('\.0','',regex=True).replace(['nan','None'], np.nan)
print(df_final)
output
item col1 col2
0 it1 2 7
1 it2 5 10
2 it3 2 5
Related
I have a dataframe as such:
Col1 Col2 Col3.... Col64 Col1 Volume Col2 Volume....Col64 Volume.... Col1 Value Col2 Value...Col 64 Value
2 3 4 5 5 7 9 3 5
3 4 5 11 8 6 5 6 5
5 3 4 6 10 11 5 3 4
I want to multiply Col1 with Col1 Volume and then divide by Col1 Value and place the value in a new column called 'Col1 result'
similarly multiply Col2 with Col2 Volume and then divide by Col2 Value and place the value in a new column called 'Col2 result'
I wish to do this for every row of those columns.
Output should be as such and these columns should be appended to the existing dataframe.
Col1 Result Col2 Result
3.33 4.2
6 4.8
16.6 8.25
...
How can I perform this operation? It also has to be 1 to 1 multiplication, that is only the first row of Col1 should be multiplied with Col1 Volume and divided by first row of Col1 Value.
Doing it manually would take a lot of time.
Use DataFrame.filter for get all columns with Volume and Value with $ for end of string, remove substrings and then filter df by columns from df1, multiple and divide columns with DataFrame.add_suffix, replace missing columns 0 and append to original DataFrame:
df1 = df.filter(regex='Volume$').rename(columns=lambda x: x.replace(' Volume',''))
df2 = df.filter(regex='Value$').rename(columns=lambda x: x.replace(' Value',''))
df = df.join(df[df1.columns].mul(df1).div(df2).add_suffix(' Result').fillna(0))
print (df)
Col1 Col2 Col3 Col64 Col1 Volume Col2 Volume Col64 Volume \
0 2 3 4 5 5 7 9
1 3 4 5 11 8 6 5
Col1 Value Col2 Value Col64 Value Col1 Result Col2 Result Col64 Result
0 3 5 7 3.333333 4.2 6.428571
1 6 5 7 4.000000 4.8 7.857143
I have two dataframes, first one is:
col1 col2 col3
1 14 2 6
2 12 3 3
3 9 4 2
Second one is:
col4 col5 col6
2 14 2 6
3 12 3 3
I want to concatenate them and get the index values from second one and row values from the first one.
The result will be like this:
col1 col2 col3
2 12 3 3
3 9 4 2
My solution was pd.concat([df2, df1, axis=1)]).drop(df2, axis=1) but I believe there is more efficient way to do this.
You can use index from df2 with loc function on df1:
df1.loc[df2.index]
Output:
col1 col2 col3
2 12 3 3
3 9 4 2
I have a text file that looks like
item1 value1 0
item1 value2 0
item1 value3 0
item2 value1 0
item1 value2 0
item1 value3 0
I'd like to get a pandas dataframe where I have each value as a column, and each item as a row.
E.g.
item | value1 | value2 | value3 | value4...
item1 | 0 | 0 | 0 | NaN
I know how to do it by iterating over the dataframe, but I thought there could be a way to avoid iteration (as that's anti-pattern) through groupby perhaps?
What you look for is called pivoting in pandas nomenclature. Here is a link to pandas-pivot documentation.
You simply have to do this:
df.pivot(index="item", columns="value", values="zero_col")
Change the names according to your dataframe column names.
Edit
I tested it locally and at least in the general case seems to work. Although, as #tdy suggested some cleaning up after the pivot operation might be necessary to fit your use case.
A snippet:
c = {"items": np.arange(5), "values": np.arange(5), "zero_cols": np.zeros(5)}
df = pd.DataFrame(c, columns=["items", "values", "zero_cols"])
df.pivot(index="items", columns="values", values="zero_cols")
Here is the result:
values 0 1 2 3 4
items
0 0.0 NaN NaN NaN NaN
1 NaN 0.0 NaN NaN NaN
2 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 NaN
4 NaN NaN NaN NaN 0.0
It seems #kalgoritmi's answer works fine for you, but it breaks on my end given your sample data. I'm not sure if this is a versioning issue (I'm on pandas 1.2.3). In any case this might be useful for others.
If there are duplicate pairs, pivoting immediately will throw a duplicate index ValueError:
>>> df = pd.DataFrame({'item': ['item1','item1','item1','item2','item1','item1'], 'value': ['value1','value2','value3']*2, 'number': 0})
item value number
0 item1 value1 0
1 item1 value2 0
2 item1 value3 0
3 item2 value1 0
4 item1 value2 0
5 item1 value3 0
>>> df.pivot(index='item', columns='value', values='number')
ValueError: Index contains duplicate entries, cannot reshape
One workaround is to aggregate the duplicate pairs before pivoting, e.g. with mean():
>>> df = df.groupby(['item', 'value'], as_index=False).mean()
item value number
0 item1 value1 0
1 item1 value2 0
2 item1 value3 0
3 item2 value1 0
>>> df.pivot(index='item', columns='value', values='number')
value value1 value2 value3
item
item1 0.0 0.0 0.0
item2 0.0 NaN NaN
So I have a dataframe and I would like to be able to compare each value with other values in its row and column at the same time. For example, I have something like this
Col1 Col2 Col3 NumCol
Row1 1 4 7 16
Row2 2 5 8 13
Row3 3 6 9 30
NumRow 28 14 10
For each value that isn't in the NumRow or NumCol, I would like to compare the NumCol and NumRow values in the same column/row as it.
I would like it to return the value of the first instance where NumCol is larger than NumRow in each row.
So the result would be this:
Row1 4
Row2 8
Row3 3
I have no clue on how to even begin this, but is there a way to do this elegantly without using for loops to loop through the whole dataframe to find these values?
First we flatten the dataframe (here df is your original dataframe):
df2 = (df.fillna('NumRow')
.set_index('NumCol')
.transpose()
.set_index('NumRow')
.stack()
.reset_index(name='value')
)
df2
output
NumRow NumCol value
0 28 16.0 1
1 28 13.0 2
2 28 30.0 3
3 14 16.0 4
4 14 13.0 5
5 14 30.0 6
6 10 16.0 7
7 10 13.0 8
8 10 30.0 9
now for each row of the new dataframe df2 we have the corresponding number from NumRow, corresponding number from NumCol, and the number from within the 'body' of the original dataframe df
Next we apply the condition, group by NumCol and within each group find the first row where the condition is satisfied. We report corresponding value:
df3 = (df2.assign(cond = df2['NumCol']>df2['NumRow'])
.groupby('NumCol')
.apply(lambda d:d[d['cond']].iloc[0])['value']
)
df3.index = df3.index.map(dict(zip(df['NumCol'],df.index)))
df3.sort_index()
Output
NumCol
Row1 4
Row2 8
Row3 3
Name: value, dtype: int64
I have a dataframe like this:
df1
col1 col2 col3 col4
1 2 A S
3 4 A P
5 6 B R
7 8 B B
I have another data frame:
df2
col5 col6 col3
9 10 A
11 12 R
I want to join these two data frame if any value of col3 and col4 of df1 matches with col3 values of df2 it will join.
the final data frame will look like:
df3
col1 col2 col3 col5 col6
1 2 A 9 10
3 4 A 9 10
5 6 R 11 12
If col3 value presents in df2 then it will join via col3 values else it will join via col4 values if it presents in col3 values of df2
How to do this in most efficient way using pandas/python?
Use double merge with default inner join, for second filter out rows matched in df3, last concat together:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2[~df2['col3'].isin(df1['col3'])], on='col3'))
df = pd.concat([df3, df4],ignore_index=True)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9 10
1 3 4 A 9 10
2 5 6 R 11 12
EDIT: Use left join and last combine_first:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3', how='left')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2, on='col3', how='left'))
df = df3.combine_first(df4)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9.0 10.0
1 3 4 A 9.0 10.0
2 5 6 B 11.0 12.0
3 7 8 B NaN NaN