Creating new column taking single value from column of another dataframe - python

I have two dataframes. The first one is df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) i.e
A B
0 5 2
1 0 4
another one is df2 = pd.DataFrame({'C': [1, 1], 'D': [3, 3]}) i.e
C D
0 1 3
1 1 3
I want want to grab only 4 from df1 and make new column in df2. I have tried this df2['E']=df1['B'][df1['B']==4] and got
C D E
0 1 3 NaN
1 1 3 4.0
I want both rows of df2 to be 4. How can I achieve this? Any help would be immense help.

if the value '4' appears as the last value in your column( like your example), you could do:
df2['E'].fillna(method= 'backfill')
for other methods, have a look here:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

It is not actually clear what you wanna accomplish here, but I assume you would like to check if there is any "4" in df1 (column B) and then filling all rows in df2 (column E) with "4". Then you could do:
import numpy as np
df2['E'] = np.where(df1['B'].isin([4]).any(), 4, np.nan)
Output:
C D E
0 1 3 4.0
1 1 3 4.0

Related

Pandas: How to collapse a DataFrame to a single-row DataFrame instead of a Series on aggregation?

Title says it all. I have some logic that works on rows and I would like to also use it on aggreations.
Say this is the data frame:
df = pd.DataFrame(np.array([[1, 2, 3], [1, 2, 3], [1, 2, 3]]),
columns=['a', 'b', 'c'])
a b c
0 1 2 3
1 1 2 3
2 1 2 3
What I want is the most native way to get
a b c
0 3 6 9
I came across several solutions using Series.to_frame(), .transform(), initialize a new data frame with the Series' index as columns, etc. But is there some simple way I am missing?
In pure Numpy I usually use x.sum(axis=0)[np.newaxis, :] for that.
I think need transpose DataFrame from Series:
print (df.sum().to_frame().T)
a b c
0 3 6 9
Or use DataFrame constructor:
print (pd.DataFrame([df.sum()]))
a b c
0 3 6 9

pandas most efficient way to execute arithmetic operations on multiple dataframe columns

my first post!
I'm running python 3.8.5 & pandas 1.1.0 on jupyter notebooks.
I want to divide several columns by the corresponding elements in another column of the same dataframe.
For example:
import pandas as pd
df = pd.DataFrame({'a': [2, 3, 4], 'b': [4, 6, 8], 'c':[6, 9, 12]})
df
a b c
0 2 4 6
1 3 6 9
2 4 8 12
I'd like to divide columns 'b' & 'c' by the corresponding values in 'a' and substitute the values in 'b' and 'c' with the result of this division. So the above dataframe becomes:
a b c
0 2 2 3
1 3 2 3
2 4 2 3
I tried
df.iloc[: , 1:] = df.iloc[: , 1:] / df['a']
but this gives:
a b c
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
I got it working by doing:
for colname in df.columns[1:]:
df[colname] = (df[colname] / df['a'])
Is there a faster way of doing the above by avoiding the for loop?
thanks,
mk
Almost there, use div with axis=0:
df.iloc[:,1:] = df.iloc[:,1:].div(df.a, axis=0)
df.b= df.b/df.a
df.c=df.c/df.a
or
df[['b','c']]=df.apply(lambda x: x[['b','c']]/x.a ,axis=1)

Removing certain Rows from subset of df

I have a pandas dataframe. All the columns right of column#2 may only contain the value 0 or 1. If they contain a value that is NOT 0 or 1, I want to remove that entire row from the dataframe.
So I created a subset of the dataframe to only contain columns right of #2
Then I found the indices of the rows that had values other than 0 or 1 and deleted it from the original dataframe.
See code below please
#reading data file:
data=pd.read_csv('MyData.csv')
#all the columns right of column#2 may only contain the value 0 or 1. So "prod" is a subset of the data df containing these columns:
prod = data.iloc[:,2:]
index_prod = prod[ (prod!= 0) & (prod!= 1)].dropna().index
data = data.drop(index_prod)
However when I run this, the index_prod vector is empty and so does not drop anything at all.
okay so my friend just told me that the data is not numeric and he fixed it by making it numeric. Can anyone please advise how I can find that out? Because all the columns were numeric it seemed like to me. All numbers
You can check dtypes by DataFrame.dtypes.
print (data.dtypes)
Or:
print (data.columns.difference(data.select_dtypes(np.number).columns))
And then convert all values without first 2 to numeric:
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: pd.to_numeric(x, errors='coerce'))
Or all columns:
data = data.apply(lambda x: pd.to_numeric(x, errors='coerce'))
And last apply solution:
subset = data.iloc[:,2:]
data1 = data[subset.isin([0,1]).all(axis=1)]
Let's say you have this dataframe:
data = {'A': [1, 2, 3, 4, 5], 'B': [0, 1, 4, 3, 1], 'C': [2, 1, 0, 3, 4]}
df = pd.DataFrame(data)
A B C
0 1 0 2
1 2 1 1
2 3 4 0
3 4 3 3
4 5 1 4
And you want to delete rows based on column B that don't contain 0 or 1, we could accomplish by:
subset = df.iloc[:,1:]
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index
df.drop(index)
A B C
0 1 0 2
1 2 1 1
4 5 1 4
df.reset_index(drop=True)
A B C
0 1 0 2
1 2 1 1
2 5 1 4

sum values in different rows and columns dataframe python

My Data Frame
A B C D
2 3 4 5
1 4 5 6
5 6 7 8
How do I add values of different rows and different columns
Column A Row 2 with Column B row 1
Column A Row 3 with Column B row 2
Similarly for all rows
If you only need do this with two columns (and I understand your question well), I think you can use the shift function.
Your data frame (pandas?) is something like:
d = {'A': [2, 1, 5], 'B': [3, 4, 6], 'C': [4, 5, 7], 'D':[5, 6, 8]}
df = pd.DataFrame(data=d)
So, it's possible to create a new data frame with B column shifted:
df2 = df['B'].shift(1)
which gives:
0 NaN
1 3.0
2 4.0
Name: B, dtype: float64
and then, merge this new data with the previous df and, for example, sum the values:
df = df.join(df2, rsuffix='shift')
df['out'] = df['A'] + df['Bshift']
The final output is in out column:
A B C D Bshift out
0 2 3 4 5 NaN NaN
1 1 4 5 6 3.0 4.0
2 5 6 7 8 4.0 9.0
But it's only an intuition, I'm not sure about your question!

Using pandas fillna() on multiple columns

I'm a new pandas user (as of yesterday), and have found it at times both convenient and frustrating.
My current frustration is in trying to use df.fillna() on multiple columns of a dataframe. For example, I've got two sets of data (a newer set and an older set) which partially overlap. For the cases where we have new data, I just use that, but I also want to use the older data if there isn't anything newer. It seems I should be able to use fillna() to fill the newer columns with the older ones, but I'm having trouble getting that to work.
Attempt at a specific example:
df.ix[:,['newcolumn1','newcolumn2']].fillna(df.ix[:,['oldcolumn1','oldcolumn2']], inplace=True)
But this doesn't work as expected - numbers show up in the new columns that had been NaNs, but not the ones that were in the old columns (in fact, looking through the data, I have no idea where the numbers it picked came from, as they don't exist in either the new or old data anywhere).
Is there a way to fill in NaNs of specific columns in a DataFrame with vales from other specific columns of the DataFrame?
fillna is generally for carrying an observation forward or backward. Instead, I'd use np.where... If I understand what you're asking.
import numpy as np
np.where(np.isnan(df['newcolumn1']), df['oldcolumn1'], df['newcolumn1'])
To answer your question: yes. Look at using the value argument of fillna. Along with the to_dict() method on the other dataframe.
But to really solve your problem, have a look at the update() method of the DataFrame. Assuming your two dataframes are similarly indexed, I think it's exactly what you want.
In [36]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [37]: df
Out[37]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [38]: df2 = pd.DataFrame({'A': [0, np.nan, 2, 3, 4, 5], 'B': [1, 0, 1, 1, 0, 0]})
In [40]: df2
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [52]: df.update(df2, overwrite=False)
In [53]: df
Out[53]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
Notice that all the NaNs in df were replaced except for (1, A) since that was also NaN in df2. Also some of the values like (5, B) differed between df and df2. By using overwrite=False it keeps the value from df.
EDIT: Based on comments it seems like your looking for a solution where the column names don't match over the two DataFrames (It'd be helpful if you posted sample data). Let's try that, replacing column A with C and B with D.
In [33]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [34]: df2 = pd.DataFrame({'C': [0, np.nan, 2, 3, 4, 5], 'D': [1, 0, 1, 1, 0, 0]})
In [35]: df
Out[35]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [36]: df2
Out[36]:
C D
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [37]: d = {'A': df2.C, 'B': df2.D} # pass this values in fillna
In [38]: df
Out[38]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [40]: df.fillna(value=d)
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
I think if you invest the time to learn pandas you'll hit fewer moments of frustration. It's a massive library though, so it takes time.

Categories

Resources