Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()
Related
I have data in the following format: Table 1
This data is loaded into a pandas dataframe. The date column is the index for this dataframe. How would I have it so the names become the column headings (must be unique) and the values correspond to the right dates.
So it would look something like this:
Table 2
Consider the following toy DataFrame:
>>> df = pd.DataFrame({'x': [1,2,3,4], 'y':['0 a','2 a','3 b','0 b']})
>>> df
x y
0 1 0 a
1 2 2 a
2 3 3 b
3 4 0 b
Start by processing each row into a Series:
>>> new_columns = df['y'].apply(lambda x: pd.Series(dict([reversed(x.split())])))
>>> new_columns
a b
0 0 NaN
1 2 NaN
2 NaN 3
3 NaN 0
Alternatively, new columns can be generated using pivot (the effect is the same):
>>> new_columns = df['y'].str.split(n=1, expand=True).pivot(columns=1, values=0)
Finally, concatenate the original and the new DataFrame objects:
>>> df = pd.concat([df, new_columns], axis=1)
>>> df
x y a b
0 1 0 a 0 NaN
1 2 2 a 2 NaN
2 3 3 b NaN 3
3 4 0 b NaN 0
Drop any columns that you don't require:
>>> df.drop(['y'], axis=1)
x a b
0 1 0 NaN
1 2 2 NaN
2 3 NaN 3
3 4 NaN 0
You will need to split out the column’s values, then rename your dataframe’s columns, and then you can pivot() the dataframe. I have added the steps below:
df[0].str.split(' ' , expand = True) # assumes you only have the one column
df.columns = ['col_name','values'] # use whatever naming convention you like
df.pivot(columns = 'col_name',values = 'values')
Please let me know if this helps.
I have two dataframes
df
x
0 1
1 1
2 1
3 1
4 1
df1
y
1 1
3 1
And I want to merge them on the index, but still keep the indexes that aren't present in df1. This is my desired output
x y
0 1 0
1 1 1
2 1 0
3 1 1
4 1 0
I have tried merging on index, like this
pd.merge(df, df1s, left_index=True, right_index=True)
But that gets rid of the index values not in df1. For example:
x y
1 1 1
3 1 1
This is not what I want. I have tried both outer and inner join, to no avail. I have also tried reading through other pandas merge questions, but can't seem to figure out my specific case here. Apologies if the merge questions are redundant, but again, I cannot figure out how to merge the way I would like in this certain scenario. Thanks!
Try to concatenate on rows and fill NaNs with 0
pd.concat([df,df1], axis=1).fillna(0)
x y
0 1 0.0
1 1 1.0
2 1 0.0
3 1 1.0
4 1 0.0
No need for any complicated merging, you can just copy the column over directly, fill the NaNs, and set the dtype. You can either do this directly, or with pd.concat():
pd.concat([df1, df2], axis=1).fillna(0).astype(int)
x y
0 1 0
1 1 1
2 1 0
3 1 1
4 1 0
i have df below
Cost,Reve
0,3
4,0
0,0
10,10
4,8
len(df['Cost']) = 300
len(df['Reve']) = 300
I need to divide df['Cost'] / df['Reve']
Below is my code
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
I got the error ValueError: Columns must be same length as key
df['C/R'] = df[['Cost']].div(df['Reve'].values, axis=0)
I got the error ValueError: Wrong number of items passed 2, placement implies 1
Problem is duplicated columns names, verify:
#generate duplicates
df = pd.concat([df, df], axis=1)
print (df)
Cost Reve Cost Reve
0 0 3 0 3
1 4 0 4 0
2 0 0 0 0
3 10 10 10 10
4 4 8 4 8
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
print (df)
# ValueError: Columns must be same length as key
You can find this columns names:
print (df.columns[df.columns.duplicated(keep=False)])
Index(['Cost', 'Reve', 'Cost', 'Reve'], dtype='object')
If same values in columns is possible remove duplicated by:
df = df.loc[:, ~df.columns.duplicated()]
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
#simplify division
df['C/R'] = df['Cost'].div(df['Reve'])
print (df)
Cost Reve C/R
0 0 3 0.0
1 4 0 inf
2 0 0 NaN
3 10 10 1.0
4 4 8 0.5
The issue lies in the size of data that you are trying to assign to the columns. I had an issue with this:
df[['X1','X2', 'X3']] = pd.DataFrame(df.X.tolist(), index= df.index)
I was trying to assign values of X to 3 columns X1,X2,X3, assuming that X has 3 values, but, X had 4 values.
So the revised code in my case was
df[['X1','X2', 'X3','X4']] = pd.DataFrame(df.X.tolist(), index= df.index)
I had the same error, but it did not come from the above two issues. In my case the columns had the same length. What helped me was transforming my Series to a DataFrame with pd.DataFrame() and then assigning its values to a new column of my existing df.
df:
0 1 2
0 0.0481948 0.1054251 0.1153076
1 0.0407258 0.0890868 0.0974378
2 0.0172071 0.0376403 0.0411687
etc.
I would like to remove all values in which the x and y titles/values of the dataframe are equal, therefore, my expected output would be something like:
0 1 2
0 NaN 0.1054251 0.1153076
1 0.0407258 NaN 0.0974378
2 0.0172071 0.0376403 NaN
etc.
As shown, the values of (0,0), (1,1), (2,2) and so on, have been removed/replaced.
I thought of looping through the index as followed:
for (idx, row) in df.iterrows():
if (row.index) == ???
But don't know where to carry on or whether it's even the right approach
You can set the diagonal:
In [11]: df.iloc[[np.arange(len(df))] * 2] = np.nan
In [12]: df
Out[12]:
0 1 2
0 NaN 0.105425 0.115308
1 0.040726 NaN 0.097438
2 0.017207 0.037640 NaN
#AndyHayden's answer is really cool and taught me something. However, it depends on iloc and that the array is square and that everything is in the same order.
I generalized the concept here
Consider the data frame df
df = pd.DataFrame(1, list('abcd'), list('xcya'))
df
x c y a
a 1 1 1 1
b 1 1 1 1
c 1 1 1 1
d 1 1 1 1
Then we use numpy broadcasting and np.where to perform the same fancy index assignment:
ij = np.where(df.index.values[:, None] == df.columns.values)
df.iloc[list(map(list, ij))] = 0
df
x c y a
a 1 1 1 0
b 1 1 1 1
c 1 0 1 1
d 1 1 1 1
n is number of rows/columns
df.values[[np.arange(n)]*2] = np.nan
or
np.fill_diagonal(df.values, np.nan)
see https://stackoverflow.com/a/24475214/
I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4