I have a hard time merging and updating Pandas dataframes right now.
I have a bunch of CSV files that I'm parsing with pandas (which is not a
problem). In very few cases I have multiple files that contains some columns
present in both files.
So, for example, let's say I have:
import pandas as pd
a = pd.DataFrame({"A": [0, 1, 2, 3], "B": [4, 5, 6, 7]}, index=[0,1,2,3])
b = pd.DataFrame({"A": [11, 12, 13, 14]}, index=[41,51,61,71])
c = pd.DataFrame({"A": [110, 111, 113]}, index=[0,1,3])
What I want is this dataframe:
A B
0 110 4
1 111 5
2 2 6
3 113 7
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
Pandas has this nice guide: Merge, join, concatenate and compare. But I
fail to find a solution to what I want to achieve.
For example a.join(b, how="outer") raises ValueError: columns overlap but no suffix specified: Index(['A'], dtype='object'). Passing rsuffix="R"
is not an option, because the end result is:
A B AR
0 0.0 4.0 NaN
1 1.0 5.0 NaN
2 2.0 6.0 NaN
3 3.0 7.0 NaN
41 NaN NaN 11.0
51 NaN NaN 12.0
61 NaN NaN 13.0
71 NaN NaN 14.0
Not quite what I want.
pd.merge(a, b, how="outer") looks promising, but it is not quite right either,
because the indices are ignored:
A B
0 0 4.0
1 1 5.0
2 2 6.0
3 3 7.0
4 11 NaN
5 12 NaN
6 13 NaN
7 14 NaN
Passing left_index=True and right_index=True yields a dataframe similar to
.join(..., rsuffix="_x", lsuffix="_y"), so not what I want.
Using update is almost what I want, a.merge(c) would modify a to
A B
0 110.0 4
1 111.0 5
2 2.0 6
3 113.0 7
but a.update(b) does nothing (I assume because the indices of a and b are
disjunct).
So, is what I want even possible with a single line of code?
EDIT
I came up with this one:
> lll = pd.concat([a,b, c]).sort_index()
> pd.concat([a,b,c]).sort_index().drop_duplicates().groupby(a.index).last()
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
This is what I want, the question is: is this correct or just a coincidence that
this yields the same result as I wanted?
How are you determining which 'A' column has priority?
In the order I'm reading the files. The files are generated by a device (which
is kind of a "black box" to me) and generates files with a date in them. So I
do:
tasks = [parse_csv_file(fn) for fn in sorted(glob.glob("*.csv"))]
results = await asyncio.gather(*tasks)
And I would like to do (no error checking as this is an example):
results = iter(results)
merged_df = next(results)
for df in results:
merged_df = the_magic_function_Im_looking_for(df)
reduceing with combine_first:
from functools import reduce
to_merge = [c, b, a]
result = reduce(pd.DataFrame.combine_first, to_merge)
which successively applies combine_first to entries of the list to end up with all-combined, i.e., reduced dataframe at the end.
(we can put reversed(to_merge) in reduce if to_merge comes with the reversed order),
to get
>>> result
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Try with concat + groupby last on axis=1 to merge the Dataframes then get the "last" valid value per column group:
df = pd.concat([a, b, c], axis=1).groupby(level=0, axis=1).last()
df:
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Or concating long and getting the last valid row per row index thanks to #anky:
df = pd.concat([a, b, c]).groupby(level=0).last()
df:
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
Related
i have the following dataframe
id a_1_1, a_1_2, a_1_3, a_1_4, b_1_1, b_1_2, b_1_3, c_1_1, c_1_2, c_1_3
1 10 20 30 40 90 80 70 Nan Nan Nan
2 33 34 35 36 nan nan nan 11 12 13
and i want my result to be as follow
id col_name 1 2 3
1 a 10 20 30
1 b 90 80 70
2 a 33 34 35
2 c 11 12 13
I am trying to use pd.melt function, but not yielding correct result ?
IIUC, you can reshape using an intermediate MultiIndex after extracting the letter and last digit from the original column names:
(df.set_index('id')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract(r'^([^_]+).*(\d+)'),
names=['col_name', None]
), axis=1))
.stack('col_name')
.dropna(axis=1) # assuming you don't want columns with NaNs
.reset_index()
)
Variant using janitor's pivot_longer:
# pip install janitor
import janitor
(df
.pivot_longer(index='id', names_to=('col name', '.value'),
names_pattern=r'([^_]+).*(\d+)')
.pipe(lambda d: d.dropna(thresh=d.shape[1]-2))
.dropna(axis=1)
)
output:
id col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0
Code:
df = df1.melt(id_vars=["id"],
var_name="Col_name",
value_name="Value").dropna()
df['Num'] = df['Col_name'].apply(lambda x: x[-1])
df['Col_name'] = df['Col_name'].apply(lambda x: x[0])
df = df.pivot(index=['id','Col_name'], columns='Num', values='Value').reset_index().dropna(axis=1)
df
Output:
Num id Col_name 1 2 3
0 1 a 10.0 20.0 30.0
1 1 b 90.0 80.0 70.0
2 2 a 33.0 34.0 35.0
3 2 c 11.0 12.0 13.0
I have these dataframes:
rec = pd.DataFrame({'batch': ["001","002","003"],
'A': [1, 2, 3],
'B': [4, 5, 6]})
ing1 = pd.DataFrame({'batch': ["002","003","004"],
'C': [12, 13, 14],
'D': [15, 16, 17],
'E': [18, 19, 10]})
ing2 = pd.DataFrame({'batch': ["001","011","012"],
'C': [20, 21, 22],
'D': [23, 24, 25],
'F': [26, 27, 28]})
What I want is the following merged dataset, where columns with the same label are overwritten by the later merged dateset, and new columns are created for non-existing labels.
batch A B C D E F
0 001 1 4 20 23 NaN 26.0
1 002 2 5 12 15 18.0 NaN
2 003 3 6 13 16 19.0 NaN
I have tried to merge rec with ing1 first:
final = pd.merge(rec, ing1, how ='left', on='batch', sort=False)
Intermediate result:
batch A B C D E
0 001 1 4 NaN NaN NaN
1 002 2 5 12.0 15.0 18.0
2 003 3 6 13.0 16.0 19.0
Then I merge a second time with ing2, to obtain the missing information in columns C, D and E.
final = pd.merge(final, ing2, how ='left', on='batch', sort=False)
Result (not as expected):
batch A B C_x D_x E C_y D_y F
0 001 1 4 NaN NaN NaN 20.0 23.0 26.0
1 002 2 5 12.0 15.0 18.0 NaN NaN NaN
2 003 3 6 13.0 16.0 19.0 NaN NaN NaN
I have also tried merge, concat, and combinefirst, however these seem to operate where they append the data from the second table onto the primary table. The only approach I can think of is to split the dataframe into rows that need to pull data from ing1 and rows that need ing2, then append them to each other for the final dataset.
How about just applying np.where() after merging? If the right column (with suffix "_y") is not NA then take the right, else take the left.
final = rec.merge(ing1, how='left', on='batch')\
.merge(ing2, how='left', on='batch')
final[["C", "D"]] = np.where(~final[["C_y", "D_y"]].isna(), final[["C_y", "D_y"]], final[["C_x", "D_x"]])
Output
print(final[["A","B","C","D","E","F"]])
A B C D E F
0 1 4 20.0 23.0 NaN 26.0
1 2 5 12.0 15.0 18.0 NaN
2 3 6 13.0 16.0 19.0 NaN
Actually, df.update() may be the conceptually closest function to what you're asking for. However, you have to set index and pre-allocate the output dataframe in advance. This may or may not cause more trouble than .merge().
Code:
# set index
rec.set_index("batch", inplace=True)
ing1.set_index("batch", inplace=True)
ing2.set_index("batch", inplace=True)
# preallocate
final = pd.DataFrame(columns=["A","B","C","D","E","F"], index=rec.index)
# update in order
final.update(rec)
final.update(ing1)
final.update(ing2)
Result:
print(final)
A B C D E F
batch
001 1 4 20 23 NaN 26
002 2 5 12 15 18 NaN
003 3 6 13 16 19 NaN
I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN
I have two dataframes:
Df_1:
A B C D
1 10 nan 20 30
2 20 30 20 10
Df_2:
A B
1 10 40
2 30 70
I want to merge them and have this final dataframe.
A B C D
1 10 40 20 30
2 20 30 20 10
3 30 70 nan nan
How do I do that?
Looking at the expected result, I think, the index in the second row of Df_2
should be 3 (instead of 2).
Run Df_1.combine_first(Df_2).
The result is:
A B C D
1 10.0 40.0 20.0 30.0
2 20.0 30.0 20.0 10.0
3 30.0 70.0 NaN NaN
i.e. due to possible NaN values, the type of columns is coerced to float.
But if you want, you can revert this where possible, by applying to_numeric:
Df_1.combine_first(Df_2).apply(pd.to_numeric, downcast='integer')
I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that?
Case in point. Let's say I have a data frame that looks like this:
Index A B
1 15 NaN
2 NaN 11
3 NaN 99
4 NaN NaN
5 12 14
Let's say my desired output from this is to create a new column C such that my data frame will look like the following:
Index A B C
1 15 NaN 15
2 NaN 11 11
3 NaN 99 99
4 NaN NaN NaN
5 12 14 12 (so giving priority to A over B)
How can I accomplish this?
For a dataframe with an arbitrary number of columns, you can back fill the rows (.bfill(axis=1)) and take the first column (.iloc[:, 0]):
df = pd.DataFrame({
'A': [15, None, None, None, 12],
'B': [None, 11, 99, None, 14],
'C': [10, None, 10, 10, 10]})
df['D'] = df.bfill(axis=1).iloc[:, 0]
>>> df
A B C D
0 15 NaN 10 15
1 NaN 11 NaN 11
2 NaN 99 10 99
3 NaN NaN 10 10
4 12 14 10 12
If you just have 2 columns, the cleanest way would be to use where (the syntax is where([condition], [value if condition is true], [value if condition is false]) (for some reason it took me a while to wrap my head around this).
In [2]: df.A.where(df.A.notnull(),df.B)
Out[2]:
0 15.0
1 11.0
2 99.0
3 NaN
4 12.0
Name: A, dtype: float64
If you have more than two columns, it might be simpler to use max or min; this will ignore the null values, however you'll lose the "column prececence" you want:
In [3]: df.max(axis=1)
Out[3]:
0 15.0
1 11.0
2 99.0
3 NaN
4 14.0
dtype: float64
pandas.DataFrame.update:
df['updated'] = np.nan
for col in df.columns:
df['updated'].update(df[col])
Try this: (This methods allows for flexiblity of giving preference to columns without relying on order of columns.)
Using #Alexanders setup.
df["D"] = df["B"]
df["D"] = df['D'].fillna(df['A'].fillna(df['B'].fillna(df['C'])))
A B C D
0 15.0 NaN 10.0 15.0
1 NaN 11.0 NaN 11.0
2 NaN 99.0 10.0 99.0
3 NaN NaN 10.0 10.0
4 12.0 14.0 10.0 14.0
Or you could use 'df.apply' to give priority to column A.
def func1(row):
A=row['A']
B=row['B']
if A==float('nan'):
if B==float('nan'):
y=float('nan')
else:
y=B
else:
y=A
return y
df['C']=df.apply(func1,axis=1)