Pandas pivot dataframe where values are a function of two columns - python

Suppose we have a dataframe
df = pd.DataFrame({'A': [1, 2], 'A_values': [3, 4], 'B': [5, 6], 'B_values': [7, 8]})
I want to sort of pivot this dataframe so that we have columns A and B suffixed with their values as index and column names in a new dataframe (i.e. index = ['A_1', 'A_2'], columns = ['B_5', 'B_6']), and the values of this new dataframe would be a result of a function on columns A_values and B_values. Suppose the function is a simple sum. For A = 1 we have A_values = 3, for B = 5 we have B_values = 7, therefore in the new dataframe at the intersection of A_1 and B_5 we would have 3+7=10. Complete resulting dataframe below:
df_pivoted = pd.DataFrame([[3+7, 3+8], [4+7, 4+8]], index = ['A_1', 'A_2'], columns = ['B_5', 'B_6'])
After searching for a while I did not find the functionality in .pivot_table() that would allow to pass a function of multiple columns as values. Maybe there is a more suitable method for this situation? Any help is appreciated.

We can do filter with index , then merge (which also know as cross join in SQL) with new key and pivot . here I am using groupby with unstack which is equal to pivot_table
A=df.filter(like='A').assign(key=1)
B=df.filter(like='B').assign(key=1)
s=A.merge(B)
s=s.assign(value=s.A_values+s.B_values).groupby(['A','B'])['value'].sum().unstack()
s
B 5 6
A
1 10 11
2 11 12

Related

Fastest way of filter index values based on list values from multiple columns in Pandas Dataframe?

I have a data frame like below. It has two columns column1,column2 from these two columns I want to filter few values(combinations of two lists) and get the index. Though I wrote the logic for it. It is will be too slow for filtering from a larger data frame. Is there any faster way to filter the data and get the list of indexes?
Data frame:-
import pandas as pd
d = {'col1': [11, 20,90,80,30], 'col2': [30, 40,50,60,90]}
df = pd.DataFrame(data=d)
print(df)
col1 col2
0 11 30
1 20 40
2 90 50
3 80 60
4 30 90
l1=[11,90,30]
l2=[30,50,90]
final_result=[]
for i,j in zip(l1,l2):
res=df[(df['col1']==i) & (df['col2']==j)]
final_result.append(res.index[0])
print(final_result)
[0, 2, 4]
You can just use underlying numpy array and create the boolean indexing:
mask=(df[['col1', 'col2']].values[:,None]==np.vstack([l1,l2]).T).all(-1).any(1)
# mask
# array([ True, False, True, False, True])
df.index[mask]
# prints
# Int64Index([0, 2, 4], dtype='int64')
you can use:
condition_1=df['col1'].astype(str).str.contains('|'.join(map(str, l1)))
condition_2=df['col2'].astype(str).str.contains('|'.join(map(str, l2)))
final_result=df.loc[ condition_1 & condition_2 ].index.to_list()
here is one way to do it. Merging the two DF and filtering where value exists in both DF
# create a DF of the list you like to match with
df2=pd.DataFrame({'col1': l1, 'col2': l2})
# merge the two DF
df3=df.merge(df2, how='left',
on=['col1', 'col2'], indicator='foundIn')
# filterout rows that are in both
out=df3[df3['foundIn'].eq('both')].index.to_list()
out
[0, 2, 4]

How to render a column name as a single cell in midst of multilevel columns in pandas?

I'm working on multilevel indexes in columns. I've to send these tables. For sending tables, I'm using df.to_html(). The picture below is where i am now. foo is the index which i've converted to column.
While converting to column, I want it to occupy both cells so it can look nice.This is what i want to achieve.
The code looks like this.
df = pd.DataFrame([[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]],index=['M1','M2','M3'])
df.columns = pd.MultiIndex.from_product([['x', 'y'], ['a', 'b']])
ind = df.index
df.reset_index(drop=True,inplace=True)
df.insert(0,'foo',ind)
With the code you provide, foois not set as the index of the dataframe.
Anyway, you could add this after your current code in order to correct the header of your dataframe before converting it to html:
df = df.rename(axis=1, level=0, mapper={"foo": ""}).rename(
axis=1, level=1, mapper={"": "foo"}
)
df.to_html(index=False)
This way, the html version of your dataframe renders the desired way:
x y
foo a b a b
M1 1 2 3 4
M2 1 2 3 4
M3 1 2 3 4

Python - count number of elements that are equal between two columns of two dataframes

I have two dataframes: df1, df2
that contain two columns, col1 and col2. I would like to calculate the number of elements in column col1 of df1 that are equal to col2 of df2. How can I do that?
You can use Series.isin df1.col1.isin(df2.col2).sum():
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col2': [1, 3, 5, 7]})
nb_comon_elements = df1.col1.isin(df2.col2).sum()
assert nb_comon_elements == 3
Be cautious depending on your use case because:
df1 = pd.DataFrame({'col1': [1, 1, 1, 2, 7]})
df1.col1.isin(df2.col2).sum()
Would return 4 and not 2, because all 1 from df1.col1 are present in df2.col2. If that's not the expected behaviour you could drop duplicates from df1.col1 before testing the intersection size:
df1.col1.drop_duplicates().isin(df2.col2).sum()
Which in this example would return 2.
To better understand why this is happening you can have look at what .isin is returning:
df1['isin df2.col2'] = df1.col1.isin(df2.col2)
Which gives:
col1 isin df2.col2
0 1 True
1 1 True
2 1 True
3 2 False
4 7 True
Now .sum() adds up the booleans from column isin df2.col2 (a total of 4 True).
I assume you're using pandas.
One way is to simply use pd.merge and merge on the second column, and return the length of that column.
pd.merge(df1, df2, on="column_to_merge")
Pandas does an inner merge by default.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Pandas: Creating DataFrame from Series

My current code is shown below - I'm importing a MAT file and trying to create a DataFrame from variables within it:
mat = loadmat(file_path) # load mat-file
Variables = mat.keys() # identify variable names
df = pd.DataFrame # Initialise DataFrame
for name in Variables:
B = mat[name]
s = pd.Series (B[:,1])
So within the loop, I can create a series of each variable (they're arrays with two columns - so the values I need are in column 2)
My question is how do I append the series to the dataframe? I've looked through the documentation and none of the examples seem to fit what I'm trying to do.
Here is how to create a DataFrame where each series is a row.
For a single Series (resulting in a single-row DataFrame):
series = pd.Series([1,2], index=['a','b'])
df = pd.DataFrame([series])
For multiple series with identical indices:
cols = ['a','b']
list_of_series = [pd.Series([1,2],index=cols), pd.Series([3,4],index=cols)]
df = pd.DataFrame(list_of_series, columns=cols)
For multiple series with possibly different indices:
list_of_series = [pd.Series([1,2],index=['a','b']), pd.Series([3,4],index=['a','c'])]
df = pd.concat(list_of_series, axis=1).transpose()
To create a DataFrame where each series is a column, see the answers by others. Alternatively, one can create a DataFrame where each series is a row, as above, and then use df.transpose(). However, the latter approach is inefficient if the columns have different data types.
No need to initialize an empty DataFrame (you weren't even doing that, you'd need pd.DataFrame() with the parens).
Instead, to create a DataFrame where each series is a column,
make a list of Series, series, and
concatenate them horizontally with df = pd.concat(series, axis=1)
Something like:
series = [pd.Series(mat[name][:, 1]) for name in Variables]
df = pd.concat(series, axis=1)
Nowadays there is a pandas.Series.to_frame method:
Series.to_frame(name=NoDefault.no_default)
Convert Series to DataFrame.
Parameters
nameobject, optional: The passed name should substitute for the series name (if it has one).
Returns
DataFrame: DataFrame representation of Series.
Examples
s = pd.Series(["a", "b", "c"], name="vals")
s.to_frame()
I guess anther way, possibly faster, to achieve this is
1) Use dict comprehension to get desired dict (i.e., taking 2nd col of each array)
2) Then use pd.DataFrame to create an instance directly from the dict without loop over each col and concat.
Assuming your mat looks like this (you can ignore this since your mat is loaded from file):
In [135]: mat = {'a': np.random.randint(5, size=(4,2)),
.....: 'b': np.random.randint(5, size=(4,2))}
In [136]: mat
Out[136]:
{'a': array([[2, 0],
[3, 4],
[0, 1],
[4, 2]]), 'b': array([[1, 0],
[1, 1],
[1, 0],
[2, 1]])}
Then you can do:
In [137]: df = pd.DataFrame ({name:mat[name][:,1] for name in mat})
In [138]: df
Out[138]:
a b
0 0 0
1 4 1
2 1 0
3 2 1
[4 rows x 2 columns]

Categories

Resources