I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that?
Case in point. Let's say I have a data frame that looks like this:
Index A B
1 15 NaN
2 NaN 11
3 NaN 99
4 NaN NaN
5 12 14
Let's say my desired output from this is to create a new column C such that my data frame will look like the following:
Index A B C
1 15 NaN 15
2 NaN 11 11
3 NaN 99 99
4 NaN NaN NaN
5 12 14 12 (so giving priority to A over B)
How can I accomplish this?
For a dataframe with an arbitrary number of columns, you can back fill the rows (.bfill(axis=1)) and take the first column (.iloc[:, 0]):
df = pd.DataFrame({
'A': [15, None, None, None, 12],
'B': [None, 11, 99, None, 14],
'C': [10, None, 10, 10, 10]})
df['D'] = df.bfill(axis=1).iloc[:, 0]
>>> df
A B C D
0 15 NaN 10 15
1 NaN 11 NaN 11
2 NaN 99 10 99
3 NaN NaN 10 10
4 12 14 10 12
If you just have 2 columns, the cleanest way would be to use where (the syntax is where([condition], [value if condition is true], [value if condition is false]) (for some reason it took me a while to wrap my head around this).
In [2]: df.A.where(df.A.notnull(),df.B)
Out[2]:
0 15.0
1 11.0
2 99.0
3 NaN
4 12.0
Name: A, dtype: float64
If you have more than two columns, it might be simpler to use max or min; this will ignore the null values, however you'll lose the "column prececence" you want:
In [3]: df.max(axis=1)
Out[3]:
0 15.0
1 11.0
2 99.0
3 NaN
4 14.0
dtype: float64
pandas.DataFrame.update:
df['updated'] = np.nan
for col in df.columns:
df['updated'].update(df[col])
Try this: (This methods allows for flexiblity of giving preference to columns without relying on order of columns.)
Using #Alexanders setup.
df["D"] = df["B"]
df["D"] = df['D'].fillna(df['A'].fillna(df['B'].fillna(df['C'])))
A B C D
0 15.0 NaN 10.0 15.0
1 NaN 11.0 NaN 11.0
2 NaN 99.0 10.0 99.0
3 NaN NaN 10.0 10.0
4 12.0 14.0 10.0 14.0
Or you could use 'df.apply' to give priority to column A.
def func1(row):
A=row['A']
B=row['B']
if A==float('nan'):
if B==float('nan'):
y=float('nan')
else:
y=B
else:
y=A
return y
df['C']=df.apply(func1,axis=1)
Related
I have a hard time merging and updating Pandas dataframes right now.
I have a bunch of CSV files that I'm parsing with pandas (which is not a
problem). In very few cases I have multiple files that contains some columns
present in both files.
So, for example, let's say I have:
import pandas as pd
a = pd.DataFrame({"A": [0, 1, 2, 3], "B": [4, 5, 6, 7]}, index=[0,1,2,3])
b = pd.DataFrame({"A": [11, 12, 13, 14]}, index=[41,51,61,71])
c = pd.DataFrame({"A": [110, 111, 113]}, index=[0,1,3])
What I want is this dataframe:
A B
0 110 4
1 111 5
2 2 6
3 113 7
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
Pandas has this nice guide: Merge, join, concatenate and compare. But I
fail to find a solution to what I want to achieve.
For example a.join(b, how="outer") raises ValueError: columns overlap but no suffix specified: Index(['A'], dtype='object'). Passing rsuffix="R"
is not an option, because the end result is:
A B AR
0 0.0 4.0 NaN
1 1.0 5.0 NaN
2 2.0 6.0 NaN
3 3.0 7.0 NaN
41 NaN NaN 11.0
51 NaN NaN 12.0
61 NaN NaN 13.0
71 NaN NaN 14.0
Not quite what I want.
pd.merge(a, b, how="outer") looks promising, but it is not quite right either,
because the indices are ignored:
A B
0 0 4.0
1 1 5.0
2 2 6.0
3 3 7.0
4 11 NaN
5 12 NaN
6 13 NaN
7 14 NaN
Passing left_index=True and right_index=True yields a dataframe similar to
.join(..., rsuffix="_x", lsuffix="_y"), so not what I want.
Using update is almost what I want, a.merge(c) would modify a to
A B
0 110.0 4
1 111.0 5
2 2.0 6
3 113.0 7
but a.update(b) does nothing (I assume because the indices of a and b are
disjunct).
So, is what I want even possible with a single line of code?
EDIT
I came up with this one:
> lll = pd.concat([a,b, c]).sort_index()
> pd.concat([a,b,c]).sort_index().drop_duplicates().groupby(a.index).last()
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
This is what I want, the question is: is this correct or just a coincidence that
this yields the same result as I wanted?
How are you determining which 'A' column has priority?
In the order I'm reading the files. The files are generated by a device (which
is kind of a "black box" to me) and generates files with a date in them. So I
do:
tasks = [parse_csv_file(fn) for fn in sorted(glob.glob("*.csv"))]
results = await asyncio.gather(*tasks)
And I would like to do (no error checking as this is an example):
results = iter(results)
merged_df = next(results)
for df in results:
merged_df = the_magic_function_Im_looking_for(df)
reduceing with combine_first:
from functools import reduce
to_merge = [c, b, a]
result = reduce(pd.DataFrame.combine_first, to_merge)
which successively applies combine_first to entries of the list to end up with all-combined, i.e., reduced dataframe at the end.
(we can put reversed(to_merge) in reduce if to_merge comes with the reversed order),
to get
>>> result
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Try with concat + groupby last on axis=1 to merge the Dataframes then get the "last" valid value per column group:
df = pd.concat([a, b, c], axis=1).groupby(level=0, axis=1).last()
df:
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Or concating long and getting the last valid row per row index thanks to #anky:
df = pd.concat([a, b, c]).groupby(level=0).last()
df:
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
I have these dataframes:
rec = pd.DataFrame({'batch': ["001","002","003"],
'A': [1, 2, 3],
'B': [4, 5, 6]})
ing1 = pd.DataFrame({'batch': ["002","003","004"],
'C': [12, 13, 14],
'D': [15, 16, 17],
'E': [18, 19, 10]})
ing2 = pd.DataFrame({'batch': ["001","011","012"],
'C': [20, 21, 22],
'D': [23, 24, 25],
'F': [26, 27, 28]})
What I want is the following merged dataset, where columns with the same label are overwritten by the later merged dateset, and new columns are created for non-existing labels.
batch A B C D E F
0 001 1 4 20 23 NaN 26.0
1 002 2 5 12 15 18.0 NaN
2 003 3 6 13 16 19.0 NaN
I have tried to merge rec with ing1 first:
final = pd.merge(rec, ing1, how ='left', on='batch', sort=False)
Intermediate result:
batch A B C D E
0 001 1 4 NaN NaN NaN
1 002 2 5 12.0 15.0 18.0
2 003 3 6 13.0 16.0 19.0
Then I merge a second time with ing2, to obtain the missing information in columns C, D and E.
final = pd.merge(final, ing2, how ='left', on='batch', sort=False)
Result (not as expected):
batch A B C_x D_x E C_y D_y F
0 001 1 4 NaN NaN NaN 20.0 23.0 26.0
1 002 2 5 12.0 15.0 18.0 NaN NaN NaN
2 003 3 6 13.0 16.0 19.0 NaN NaN NaN
I have also tried merge, concat, and combinefirst, however these seem to operate where they append the data from the second table onto the primary table. The only approach I can think of is to split the dataframe into rows that need to pull data from ing1 and rows that need ing2, then append them to each other for the final dataset.
How about just applying np.where() after merging? If the right column (with suffix "_y") is not NA then take the right, else take the left.
final = rec.merge(ing1, how='left', on='batch')\
.merge(ing2, how='left', on='batch')
final[["C", "D"]] = np.where(~final[["C_y", "D_y"]].isna(), final[["C_y", "D_y"]], final[["C_x", "D_x"]])
Output
print(final[["A","B","C","D","E","F"]])
A B C D E F
0 1 4 20.0 23.0 NaN 26.0
1 2 5 12.0 15.0 18.0 NaN
2 3 6 13.0 16.0 19.0 NaN
Actually, df.update() may be the conceptually closest function to what you're asking for. However, you have to set index and pre-allocate the output dataframe in advance. This may or may not cause more trouble than .merge().
Code:
# set index
rec.set_index("batch", inplace=True)
ing1.set_index("batch", inplace=True)
ing2.set_index("batch", inplace=True)
# preallocate
final = pd.DataFrame(columns=["A","B","C","D","E","F"], index=rec.index)
# update in order
final.update(rec)
final.update(ing1)
final.update(ing2)
Result:
print(final)
A B C D E F
batch
001 1 4 20 23 NaN 26
002 2 5 12 15 18 NaN
003 3 6 13 16 19 NaN
I have a DataFrame df that looks something like this:
df
a b c
0 0.557894 -0.196294 -0.020490
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
5 -0.337374 NaN -0.771888
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
9 -2.345448 2.443669 -1.409422
I want to select the rows that have a value over some value, which I would normally do using:
new_df = df[df['c'] >= .5]
but that will return:
a b c
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
5 -0.337374 NaN 0.771888
8 0.737413 NaN 0.679575
I want to get those rows, but also keep the rows that have nan values in column 'c'. I haven't been able to find a question asking the same thing, they usually ask for one or the other, but not both. I can hard code the rows that I want to drop since I know the specific values, but I was wondering if there is a better solution. The end result should look something like this:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
Only dropping rows 0,5 and 9 since they are less than .5 in columns 'c'
You should use the | (or) operator.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.557894,1.138774,np.nan,-0.069319,1.040089,-0.337374,-1.813278,np.nan,0.737413,-2.345448],
'b': [-0.196294,-0.699224,2.384483,np.nan,-0.271777,np.nan,-1.564666,np.nan,np.nan,2.443669],
'c': [-0.020490,np.nan,0.554292,1.162941,np.nan,-0.771888,np.nan,np.nan,0.679575,-1.409422]})
df = df[(df['c'] >= .5) | (df['c'].isnull())]
print(df)
Output:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
You should be able to do this by
new_df = df[df['c'] >=5 or df['c'] == 'NaN']
I have a pandas data frame as shown below. One column has values with intervening NaN cells. The values are to be shifted ahead by one so that they replace the next value that follows with the last being lost. The intervening NaN cells have to remain. I tried using .shift() but since I never know how many intervening NaN rows it means a calculation for each shift. Is there a better approach?
IIUC, you may just groupby by non-na values, and shift them.
df['y'] = df.y.groupby(pd.isnull(df.y)).shift()
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN
Another way:
s = df['y'].notnull()
df.loc[s,'y'] = df.loc[s,'y'].shift()
It would be easier to test if you paste your text data instead of the picture.
Input:
df = pd.DataFrame({'x':list('AAABBBBCCCC'),
'y':[5,np.nan,np.nan,10, np.nan,np.nan,np.nan,
20, np.nan,np.nan,np.nan]})
output:
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN
I have a dataframe with some columns like this:
A B C
0
4
5
6
7
7
6
5
The possible range of values in A are only from 0 to 7.
Also, I have a list of 8 elements like this:
List=[2,5,6,8,12,16,26,32] //There are only 8 elements in this list
If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.
How can I do this in one go without looping over the whole dataframe?
The resulting dataframe would look like this:
A B C D
0 2
4 12
5 16
6 26
7 32
7 32
6 26
5 16
Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.
Just assign the list directly:
df['new_col'] = mylist
Alternative
Convert the list to a series or array and then assign:
se = pd.Series(mylist)
df['new_col'] = se.values
or
df['new_col'] = np.array(mylist)
IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.
>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([ 0, 40, 50, 60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
A B C D
0 0 NaN NaN 0
1 4 NaN NaN 40
2 5 NaN NaN 50
3 6 NaN NaN 60
4 15 NaN NaN 150
5 15 NaN NaN 150
6 14 NaN NaN 140
7 13 NaN NaN 130
Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.
Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.
A solution improving on the great one from #sparrow.
Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.
Let's suppose you want to call your new column simply, new_column
First make the list into a Series:
column_values = pd.Series(mylist)
Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column.
In the following example we will position the new column in the first position from left (by setting loc=0)
df.insert(loc=0, column='new_column', value=column_values)
First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.
df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})
And the mapping that you desire:
mapping = dict(enumerate([2,5,6,8,12,16,26,32]))
df['D'] = df['A'].map(mapping)
Done!
print df
Output:
A D
0 0 2
1 4 12
2 5 16
3 6 26
4 7 32
5 7 32
6 6 26
7 5 16
Old question; but I always try to use fastest code!
I had a huge list with 69 millions of uint64. np.array() was fastest for me.
df['hashes'] = hashes
Time spent: 17.034842014312744
df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673
df['key'] = np.array(hashes)
Time spent: 10.724546194076538
You can also use df.assign:
In [1559]: df
Out[1559]:
A B C
0 0 NaN NaN
1 4 NaN NaN
2 5 NaN NaN
3 6 NaN NaN
4 7 NaN NaN
5 7 NaN NaN
6 6 NaN NaN
7 5 NaN NaN
In [1560]: mylist = [2,5,6,8,12,16,26,32]
In [1567]: df = df.assign(D=mylist)
In [1568]: df
Out[1568]:
A B C D
0 0 NaN NaN 2
1 4 NaN NaN 5
2 5 NaN NaN 6
3 6 NaN NaN 8
4 7 NaN NaN 12
5 7 NaN NaN 16
6 6 NaN NaN 26
7 5 NaN NaN 32