How to merge two pandas dataframes on index but fill missing values - python

I have two dataframes
df
x
0 1
1 1
2 1
3 1
4 1
df1
y
1 1
3 1
And I want to merge them on the index, but still keep the indexes that aren't present in df1. This is my desired output
x y
0 1 0
1 1 1
2 1 0
3 1 1
4 1 0
I have tried merging on index, like this
pd.merge(df, df1s, left_index=True, right_index=True)
But that gets rid of the index values not in df1. For example:
x y
1 1 1
3 1 1
This is not what I want. I have tried both outer and inner join, to no avail. I have also tried reading through other pandas merge questions, but can't seem to figure out my specific case here. Apologies if the merge questions are redundant, but again, I cannot figure out how to merge the way I would like in this certain scenario. Thanks!

Try to concatenate on rows and fill NaNs with 0
pd.concat([df,df1], axis=1).fillna(0)
x y
0 1 0.0
1 1 1.0
2 1 0.0
3 1 1.0
4 1 0.0

No need for any complicated merging, you can just copy the column over directly, fill the NaNs, and set the dtype. You can either do this directly, or with pd.concat():
pd.concat([df1, df2], axis=1).fillna(0).astype(int)
x y
0 1 0
1 1 1
2 1 0
3 1 1
4 1 0

Related

Mesh / divide / explode values in a column of a DataFrames according to a number of meshes for each value

Given a DataFrame
df1 :
value mesh
0 10 2
1 12 3
2 5 2
obtain a new DataFrame df2 in which for each value of df1 there are mesh values, each one obtained by dividing the corresponding value of df1 by its mesh:
df2 :
value/mesh
0 5
1 5
2 4
3 4
4 4
5 2.5
6 2.5
More general:
df1 :
value mesh_value other_value
0 10 2 0
1 12 3 1
2 5 2 2
obtain:
df2 :
value/mesh_value other_value
0 5 0
1 5 0
2 4 1
3 4 1
4 4 1
5 2.5 2
6 2.5 2
You can do map
df2['new'] = df2['value/mesh'].map(dict(zip(df1.eval('value/mesh'),df1.index)))
Out[243]:
0 0
1 0
2 1
3 1
4 1
5 2
6 2
Name: value/mesh, dtype: int64
Try as follows:
Use Series.div for value / mesh_value, and apply Series.reindex using np.repeat with df.mesh_value as the input array for the repeats parameter.
Next, use pd.concat to combine the result with df.other_value along axis=1.
Finally, rename the column with result of value / mesh_value (its default name will be 0) using df.rename, and chain df.reset_index to reset to a standard index.
df2 = pd.concat([df.value.div(df.mesh_value).reindex(
np.repeat(df.index,df.mesh_value)),df.other_value], axis=1)\
.rename(columns={0:'value_mesh_value'}).reset_index(drop=True)
print(df2)
value_mesh_value other_value
0 5.0 0
1 5.0 0
2 4.0 1
3 4.0 1
4 4.0 1
5 2.5 2
6 2.5 2
Or slightly different:
Use df.assign to add a column with the result of df.value.div(df.mesh_value), and reindex / rename in same way as above.
Use df.drop to get rid of columns that you don't want (value, mesh_value) and use df.iloc to change the column order (e.g. we want ['value_mesh_value','other_value'] instead of other way around (hence: [1,0]). And again, reset index.
We put all of this between brackets and assign it to df2.
df2 = (df.assign(tmp=df.value.div(df.mesh_value)).reindex(
np.repeat(df.index,df.mesh_value))\
.rename(columns={'tmp':'value_mesh_value'})\
.drop(columns=['value','mesh_value']).iloc[:,[1,0]]\
.reset_index(drop=True))
# same result

Updated all values in a pandas dataframe based on all instances of a specific value in another dataframe

My apologies beforehand! I have done this before a few times, but I am having some brain fog. I have two dataframes df1, and df2. I would like to update all values in df2 if it matches a specific value in df1, while not changing the other values in df2. I can do this pretty easily with np.where on columns of a dataframe, I am having brain fog on how I did this previously with 2 dataframes!
Goal: Set values in Df2 to 0 if they are 0 in DF1 - otherwise keep the DF2 value
Example
df1
A
B
C
4
0
1
0
2
0
1
4
0
df2
A
B
C
1
8
1
9
2
7
1
4
6
Expected df2 after our element swap
A
B
C
1
0
1
0
2
0
1
4
0
brain fog is bad! thank you for the assistance!
Using fillna
>>> df2[df1 != 0].fillna(0)
You can try
df2[df1.eq(0)] = 0
print(df2)
A B C
0 1 0 1
1 0 2 0
2 1 4 0

Pandas rank valus in rows of DataFrame

Learning Python. I have a dataframe like this
cand1 cand2 cand3
0 40.0900 39.6700 36.3700
1 44.2800 44.2800 35.4200
2 43.0900 51.2200 46.3500
3 35.7200 55.2700 36.4700
and I want to rank each row according to the value of the columns, so that I get
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 1 3 2
3 3 1 2
I have now
for index, row in df.iterrows():
df.loc['Rank'] = df.loc[index].rank(ascending=False).astype(int)
print (df)
However, this keeps on repeating the whole dataframe. Note also the special case in row 2, where two values are the same.
Suggestion appreciated
Use df.rank instead of series rank
df_rank = df.rank(axis=1, ascending=False, method='min').astype(int)
Out[165]:
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 3 1 2
3 3 1 2

pandas groupby apply returning a dataframe

Consider the following code:
>>> df = pd.DataFrame(np.random.randint(0, 4, 16).reshape(4, 4), columns=list('ABCD'))
... df
...
A B C D
0 2 1 0 2
1 3 0 2 2
2 0 2 0 2
3 2 1 2 0
>>> def grouper(frame):
... return frame
...
... df.groupby('A').apply(grouper)
...
A B C D
0 2 1 0 2
1 3 0 2 2
2 0 2 0 2
3 2 1 2 0
As you can see, the results are identical.
Here is the documentation of apply:
The function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series. apply is therefore a highly flexible grouping method.
Groupby will divide group into small dataframes like this:
A B C D
2 0 2 0 2
A B C D
0 2 1 0 2
3 2 1 2 0
A B C D
1 3 0 2 2
apply documentation says that it combines the dataframes back into a single dataframe. I am curious how it combined them in a way that the final result is the same as the original dataframe. If it had used concat, the final dataframe would have been equal to:
A B C D
2 0 2 0 2
0 2 1 0 2
3 2 1 2 0
1 3 0 2 2
I am curious how this concatenation has been done.
If you look at the source code you will see that there is a parameter not_indexed_same that checks if the index remains the same after groupby. If it is the same then groupby does reindexing of the dataframe before returning results. I do not know why this was implemented.
The change was made on Aug 21, 2011 and Wes made no comments on the change: https://github.com/pandas-dev/pandas/commit/00c8da0208553c37ca6df0197da431515df813b7#diff-720d374f1a709d0075a1f0a02445cd65

Pandas dataframe issue: `reset_index` does not remove hierarchical index

I am trying to flatten a Pandas Dataframe MultiIndex so that there is only a single level index. The usual solution based on any number of SE posts is to use the df.reset_index command, but that is just not fixing the problem.
I started out with an Xarray DataArray and converted it to a dataframe. The original dataframe looked like this.
results
simdata a_ss_yr attr attr1 attr2 attr3
run year
0 0 0 0 0 0 0
1 1 6 2 0 4
2 2 4 2 2 0
3 3 1 0 0 1
4 4 2 0 2 0
To flatten the index I used
df.reset_index(drop=True)
This only accomplished this:
run year results
simdata a_ss_yr attr attr1 attr2
0 0 0 0 0 0 0
1 0 1 1 6 2 0
2 0 2 2 4 2 2
3 0 3 3 1 0 0
4 0 4 4 2 0 2
I tried doing the df.reset_index() option more than once, but this is still not flattening the index, and I want to get this to only a single level index.
More specifically I need the "run" and "year" variables to go to the level 0 set of column names, and I need to remove the "result" heading entirely.
I have been reading the Pandas documentation, but it seems like doing this kind of surgery on the index is not really described. Does anyone have a sense of how to do this?
Use first droplevel for remove first level of MultiIndex and then reset_index:
df.columns = df.columns.droplevel(0)
df = df.reset_index()

Categories

Resources