Using condition of a dataframe in pandas.where of another dataframe [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes: df1 has data and df2 is kind of like a map for the data. (They are both the same size and are 2D).
I would like to use pandas.where (or any method that isn't too convoluted) to replace the values of df1 based of the condition of the same cell in df2.
For instance, if df2 is equal to 0, I want to set the same cell in df1 also to 0. How do I do this?
When I try the following I get an error:
df3 = df1.where(df2 == 0, other = 0)

import pandas as pd
df = pd.DataFrame()
df_1 = pd.DataFrame()
df['a'] = [1,2,3,4,5]
df_1['b'] = [5,6,7,8,0]
This will give a sample df:
Now implement a loop, using range or len(df.index)
for i in range(0,5):
df['a'][i] = np.where( df_1['b'][i] == 0, 0, df['a'][i])

Generally you shouldn't need to handle multiple dataframes separately like this; if df1, df2 have the same shape and either the same index or some common column they can be joined/merged on (e.g. say it's named 'id'), then merge them:
df = pd.merge(df1, df2, on='id')
See Pandas Merging 101

Related

Modifying one dataframe appears to change another [duplicate]

This question already has answers here:
Why can pandas DataFrames change each other?
(3 answers)
How do I clone a list so that it doesn't change unexpectedly after assignment?
(24 answers)
Closed 1 year ago.
I am new to loop in Python and just came across a weird question. I was doing some calculations on multiple dataframes, and to simplify the question, here is an illustration.
Suppose I have 3 dataframes filled with NaN:
# generate NaN entries
data = np.empty((15, 10))
# create dataframe
data[:] = np.nan
dfnan = pd.DataFrame(data)
df1 = dfnan
df2 = dfnan
df3 = dfnan
After this step, all the three dataframes give me NaN as expected.
But then, if I add two for loops in one block like below:
for i in range(0, 15, 1):
df1.iloc[i] = 0
for j in range(0, 15, 1):
df2.iloc[j] = df1.iloc[j].transform(lambda x: x+1)
Then all of df1, df2, and df3 give me 1 entries. But shouldn't it be that:
df1 filled with 0, df2 filled with 1 and df3 filled with NaN (since I didn't make any change to it)?
Why is that and how I can change it to get the wanted result?
Assignment never copies in python. df1, df2, df3 and dfnan are all references to the same object (pd.DataFrame(data)). This means that changes in one are reflected in the remaining ones, as they all point to the same object.
This is a great reading https://nedbatchelder.com/text/names.html.
To create independent copies use the copy method
dfnan = pd.DataFrame(data)
df1 = dfnan.copy()
df2 = dfnan.copy()
df3 = dfnan.copy()

How to rename the first column of a pandas dataframe?

I have come across this question many a times over internet however not many answers are there except for few of the likes of the following:
Cannot rename the first column in pandas DataFrame
I approached the same using following:
df = df.rename(columns={df.columns[0]: 'Column1'})
Is there a better or cleaner way of doing the rename of the first column of a pandas dataframe? Or any specific column number?
You're already using a cleaner way in pandas.
It is sad that:
df.columns[0] = 'Column1'
Is impossible because Index objects do not support mutable assignments. It would give an TypeError.
You still could do iterable unpacking:
df.columns = ['Column1', *df.columns[1:]]
Or:
df = df.set_axis(['Column1', *df.columns[1:]], axis=1)
Not sure if cleaner, but possible idea is convert to list and set by indexing new value:
df = pd.DataFrame(columns=[4,7,0,2])
arr = df.columns.tolist()
arr[0] = 'Column1'
df.columns = arr
print (df)
Empty DataFrame
Columns: [Column1, 7, 0, 2]
Index: []

Reindex dataframe inside loop [duplicate]

This question already has answers here:
How to change variables fed into a for loop in list form
(4 answers)
Closed 5 months ago.
I'm trying to reindex the columns in a set of dataframes inside a loop. This only seems to work outside the loop. See sample code below
import pandas as pd
data1 = [[1,2,3],[4,5,6],[7,8,9]]
data2 = [[10,11,12],[13,14,15],[16,17,18]]
data3 = [[19,20,21],[22,23,24],[25,26,27]]
index = ['a','b','c']
columns = ['d','e','f']
df1 = pd.DataFrame(data=data1,index=index,columns=columns)
df2 = pd.DataFrame(data=data2,index=index,columns=columns)
df3 = pd.DataFrame(data=data3,index=index,columns=columns)
columns2 = ['f','e','d']
for i in [df1,df2,df3]:
i = i.reindex(columns=columns2)
print(df1)
df2 = df2.reindex(columns=columns2)
print(df2)
df1 is not reindexed as desired, however if I reindex df2 outside of the loop it works. Why is that?
Thanks
Andrew
That happens for the same reason this happens:
a = 5
b = 6
for i in [a, b]:
i = 4
>>> a
5
Why? See this accepted answer.
Concerning your problem, one way to go about it is create a list of reindexed dataframes like so:
reindexed_dfs = [df.reindex(columns=columns2) for df in [df1, df2, df3]]
and then reassign df1, df2 and df3. But it's better to just keep using your newly created list anyways.

What is the difference between 'pd.concat([df1, df2], join='outer')', 'df1.combine_first(df2)', 'pd.merge(df1, df2)' and 'df1.join(df2, how='outer')'? [duplicate]

This question already has answers here:
Difference(s) between merge() and concat() in pandas
(7 answers)
Closed 2 years ago.
Say I have the following 2 pandas dataframes:
import pandas as pd
A = [174,-155,-931,301]
B = [943,847,510,16]
C = [325,914,501,884]
D = [-956,318,319,-83]
E = [767,814,43,-116]
F = [110,-784,-726,37]
G = [-41,964,-67,-207]
H = [-555,787,764,-788]
df1 = pd.DataFrame({"A": A, "B": B, "C": C, "D": D})
df2 = pd.DataFrame({"E": E, "B": F, "C": G, "D": H})
If I do concat with join=outer, I get the following resulting dataframe:
pd.concat([data1,data2], join='outer')
If I do df1.combine_first(df2), I get the following:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()
If I do pd.merge(df1, df2), I get the following which is identical to the result produced by concat:
pd.merge(data1, data2, on=['B','C','D'], how='outer')
And finally, if I do df1.join(df2, how='outer'), I get the following:
df1.join(df2, how='outer', on='B', lsuffix='_left', rsuffix='_right')
I don't fully understand how and why each produces different results.
concat: append one dataframe to another along the given axis (default axix=0 meaning concat along index, i.e. put other dataframe below given dataframe). Data are aligned on the other axis (i.e. for default setting align columns). This is why we get NaNs in the non-matching columns 'A' and 'E'.
combine_first: replace NaNs in dataframe by existing values in other dataframe, where rows and columns are pooled (union of rows and cols from both dataframes). In your example, there are no missing values from the beginning but they emerge due to the union operation as your indices have no common entries. The order of the rows results from the sorted combined index (df1.B and df2.B).
So if there are no missing values in your dataframe you wouldn't normally use combine_first.
merge is a database-style combination of two dataframes that offers more options on how to merge (left, right, specific columns) than concat. In your example, the data of the result are identical, but there's a difference in the index between concat and merge: when merging on columns, the dataframe indices will be ignored and a new index will be created.
join merges df1 and df2 on the index of df1 and the given column (in the example 'B') of df2. In your example this is the same as pd.merge(df1, df2, left_on=df1.index, right_on='B', how='outer', suffixes=('_left', '_right')). As there's no match between the index of df1 and column 'B' of df2 there will be a lot of NaNs due to the outer join.

Python/Pandas - Query a MultiIndex Column [duplicate]

This question already has answers here:
Select columns using pandas dataframe.query()
(5 answers)
Closed 4 years ago.
I'm trying to use query on a MultiIndex column. It works on a MultiIndex row, but not the column. Is there a reason for this? The documentation shows examples like the first one below, but it doesn't indicate that it won't work for a MultiIndex column.
I know there are other ways to do this, but I'm specifically trying to do it with the query function
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((4,4)))
df.index = pd.MultiIndex.from_product([[1,2],['A','B']])
df.index.names = ['RowInd1', 'RowInd2']
# This works
print(df.query('RowInd2 in ["A"]'))
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
df.columns.names = ['ColInd1', 'ColInd2']
# query on index works, but not on the multiindexed column
print(df.query('index < 2'))
print(df.query('ColInd2 in ["A"]'))
To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()
You can using IndexSlice
df.query('ilevel_0>2')
Out[327]:
ColInd1 1 2
ColInd2 A B A B
3 0.652576 0.639522 0.52087 0.446931
df.loc[:,pd.IndexSlice[:,'A']]
Out[328]:
ColInd1 1 2
ColInd2 A A
0 0.092394 0.427668
1 0.326748 0.383632
2 0.717328 0.354294
3 0.652576 0.520870

Categories

Resources