combine 2 columns of dataframe based on a condition - python

I have created a data frame
data = [['Nan', 10], [4, 'Nan'], ['Nan', 12], ['Nan', 13], [5, 'Nan'], [6, 'Nan'], [7, 'Nan'], ['Nan', 8]]
df = pd.DataFrame(data, columns = ['min', 'max'])
print(df)
my dataset looks like,
min max
Nan 10
4 Max
Nan 12
Nan 13
5 Nan
6 Nan
7 Nan
Nan 8
I want to create a new column which will take one value from min then one value from max. If there are cont. 2 values of min/max (as we can see that 12 and 13 are 2 values) I have to consider only one value (consider only 12 and then move to select min)
In short,
new column should have one min value row, then one max value row and so on.
OUTPUT should be
combined
10
4
12
5
8

You can try to change those values of min and max with previous row not NaN to NaN using .where(). Then remove the rows with both min and max being NaN. Then update those NaN value in min with the value of max in each row using .combine_first():
df = df.replace('Nan', np.nan)
df['min'] = df['min'].where(df['min'].shift().isna())
df['max'] = df['max'].where(df['max'].shift().isna())
df = df.dropna(how='all')
df['combined'] = df['min'].combine_first(df['max'])
Result:
print(df)
min max combined
0 NaN 10.0 10.0
1 4.0 NaN 4.0
2 NaN 12.0 12.0
4 5.0 NaN 5.0
7 NaN 8.0 8.0

Stack the dataframe to reshape into a multiindex series then reset the level 1 index, then using boolean indexing filter/select only rows where the min is followed by max or vice-a-versa
s = df[df != 'Nan'].stack().reset_index(name='combined', level=1)
m = s['level_1'] != s['level_1'].shift()
s[m].drop('level_1', 1)
combined
0 10.0
1 4.0
2 12.0
4 5.0
7 8.0

What you can do is to define the first key for the first value that you want to include, for example, 'max' and then iterate through the DataFrame, and append the values to your data structure while changing the key. At the same time, you will have to check for 'NaN' values since you have a lot of those,
combined = []
key = 'max'
for index, row in df.iterrows():
if not row[key] != row[key]:
combined.append(row[key])
if key == 'max':
key = 'min'
else:
key = 'max'
Here, I have just hardcoded in the first value, but if you do not want to do that you can just check which column in the first row has an actual value that is not 'NaN' and then make that the key.
Note: I have added the data to a list, because I am not sure how you plan to include this as a column when the lengths will be different.

If my assumptions are correct then this should work.
The value is 'Nan' string and not np.NaN
If the min column has 'Nan' value then max column will have number and vice versa, it means no row can have two numbers.
import numpy as np
import pandas as pd
data = [['Nan', 10], [4, 'Nan'], ['Nan', 12], ['Nan', 13], [5, 'Nan'], [6, 'Nan'], [7, 'Nan'], ['Nan', 8]]
df = pd.DataFrame(data, columns = ['min', 'max'])
df['combined'] = np.where(df['min']!='Nan', df['min'], df['max'])
This is the output I get
min max combined
0 Nan 10 10
1 4 Nan 4
2 Nan 12 12
3 Nan 13 13
4 5 Nan 5
5 6 Nan 6
6 7 Nan 7
7 Nan 8 8

Related

How to merge two dataframes without generating extra rows in result?

I am doing the following with two dataframes but it generates duplicates and does not get sorted as the first dataframe.
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:10"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
result = pd.merge(df1, df2, on="time", how="left")
This generates the result of 8 rows! I am removing 3 digits from time column in df1 to match the time in df2.
time value counts growth
0 15:09 10 fg 1.0
1 15:09 10 mn 3.0
2 15:09 20 fg 1.0
3 15:09 20 mn 3.0
4 15:10 30 gl 6.0
5 15:11 40 NaN NaN
6 15:12 50 NaN NaN
7 15:12 60 NaN NaN
There are duplicated columns due to join.
Is it possible to join the dataframes based on time column in df1 where events are sorted well with more time granularity? Is there a way to partially match the time column values of two dataframes and merge? Ideal result would look like the following
time value counts growth
0 15:09.123 10 fg 1.0
1 15:09.234 20 mn 3.0
2 15:10.123 30 gl 6.0
3 15:11.123 40 NaN NaN
4 15:12.123 50 NaN NaN
5 15:12.987 60 NaN NaN
here is one way to do it
Assusmption: number of rows for a time without seconds in df1 and df2 will be same
# create time without seconds
df1['time2']=df1['time'].str[:-4]
# add a sequence when there are multiple rows for any time
df1['seq']=df1.groupby('time2')['time2'].cumcount()
# add a sequence when there are multiple rows for any time
df2['seq']=df2.groupby('time').cumcount()
# do a merge on time (stripped) in df1 and sequence
pd.merge(df1,
df2,
left_on=['time2', 'seq'],
right_on=['time','seq'],
how='left',
suffixes=(None,'_y')).drop(columns=['time2', 'seq'])
time value time_y counts growth
0 15:09.123 10 15:09 fg 1.0
1 15:09.234 20 15:09 mn 3.0
2 15:10.123 30 15:10 gl 6.0
3 15:11.123 40 NaN NaN NaN
4 15:12.123 50 NaN NaN NaN
5 15:12.987 60 NaN NaN NaN
Merge on column 'time' with preserved order
Assumption: Data from df1 and df2 are in order of occurrence
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:11"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
df1_keys = df1["time"].unique()
df_list = list()
for key in df1_keys:
tmp_df1 = df1[df1["time"] == key]
tmp_df1 = tmp_df1.reset_index(drop=True)
tmp_df2 = df2[df2["time"] == key]
tmp_df2 = tmp_df2.reset_index(drop=True)
df_list.append(pd.merge(tmp_df1, tmp_df2, left_index=True, right_index=True, how="left"))
print(pd.concat(df_list, axis = 0))

align two pandas dataframes on values in one column, otherwise insert NA to match row number

I have two pandas DataFrames (df1, df2) with a different number of rows and columns and some matching values in a specific column in each df, with caveats (1) there are some unique values in each df, and (2) there are different numbers of matching values across the DataFrames.
Baby example:
df1 = pd.DataFrame({'id1': [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 6, 6]})
df2 = pd.DataFrame({'id2': [1, 1, 2, 2, 2, 2, 3, 4, 5],
'var1': ['B', 'B', 'W', 'W', 'W', 'W', 'H', 'B', 'A']})
What I am seeking to do is create df3 where df2['id2'] is aligned/indexed to df1['id1'], such that:
NaN is added to df3[id2] when df2[id2] has fewer (or missing) matches to df1[id1]
NaN is added to df3[id2] & df3[var1] if df1[id1] exists but has no match to df2[id2]
'var1' is filled in for all cases of df3[var1] where df1[id1] and df2[id2] match
rows are dropped when df2[id2] has more matching values than df1[id1] (or no matches at all)
The resulting DataFrame (df3) should look as follows (Notice id2 = 5 and var1 = A are gone):
id1
id2
var1
1
1
B
1
1
B
1
NaN
B
2
2
W
2
2
W
3
3
H
3
NaN
H
3
NaN
H
3
NaN
H
4
4
B
6
NaN
NaN
6
NaN
NaN
I cannot find a combination of merge/join/concatenate/align that correctly solves this problem. Currently, everything I have tried stacks the rows in sequence without adding NaN in the proper cells/rows and instead adds all the NaN values at the bottom of df3 (so id1 and id2 never align). Any help is greatly appreciated!
You can first assign a helper column for id1 and id2 based on groupby.cumcount, then merge. Finally ffill values of var1 based on the group id1
def helper(data,col): return data.groupby(col).cumcount()
out = df1.assign(k = helper(df1,['id1'])).merge(df2.assign(k = helper(df2,['id2'])),
left_on=['id1','k'],right_on=['id2','k'] ,how='left').drop('k',1)
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
Or similar but without assign as HenryEcker suggests :
out = df1.merge(df2, left_on=['id1', helper(df1, ['id1'])],
right_on=['id2', helper(df2, ['id2'])], how='left').drop(columns='key_1')
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
print(out)
id1 id2 var1
0 1 1.0 B
1 1 1.0 B
2 1 NaN B
3 2 2.0 W
4 2 2.0 W
5 3 3.0 H
6 3 NaN H
7 3 NaN H
8 3 NaN H
9 4 4.0 B
10 6 NaN NaN
11 6 NaN NaN

Pandas: move row (index and values) from last to first [duplicate]

This question already has answers here:
add a row at top in pandas dataframe [duplicate]
(6 answers)
Closed 4 years ago.
I would like to move an entire row (index and values) from the last row to the first row of a DataFrame. Every other example I can find either uses an ordered row index (to be specific - my row index is not a numerical sequence - so I cannot simply add at -1 and then reindex with +1) or moves the values while maintaining the original index. My DF has descriptions as the index and the values are discrete to the index description.
I'm adding a row and then would like to move that into row 1. Here is the setup:
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'F', 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
}).set_index('col1')
#output
In [7]: df
Out[7]:
col2 col3
col1
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
I then add a new row as follows:
df.loc["Deferred Description"] = pd.Series([''])
In [9]: df
Out[9]:
col2 col3
col1
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
Deferred Description NaN NaN
I would like the resulting output to be:
In [9]: df
Out[9]:
col2 col3
col1
Defenses Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I've tried using df.shift() but only the values shift. I've also tried df.sort_index() but that requires the index to be ordered (there are several SO examples using df.loc[-1] = ... then then reindexing with df.index = df.index + 1). In my case I need the Defenses Description to be the first row.
Your problem is not one of cyclic shifting, but a simpler oneā€”one of insertion (which is why I've chosen to mark this question as duplicate).
Construct an empty DataFrame and then concatenate the two using pd.concat.
pd.concat([pd.DataFrame(columns=df.columns, index=['Deferred Description']), df])
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
If this were columns, it'd have been easier. Funnily enough, pandas has a DataFrame.insert function that works for columns, but not rows.
Generalized Cyclic Shifting
If you were curious to know how you'd cyclically shift a dataFrame, you can use np.roll.
# apply this fix to your existing DataFrame
pd.DataFrame(np.roll(df.values, 1, axis=0),
index=np.roll(df.index, 1), columns=df.columns
)
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
This, thankfully, also works when you have duplicate index values. If the index or columns aren't important, then pd.DataFrame(np.roll(df.values, 1, axis=0)) works well enough.
You can using append
pd.DataFrame({'col2':[np.nan],'col3':[np.nan]},index=["Deferred Description"]).append(df)
Out[294]:
col2 col3
Deferred Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0

Can't create jagged dataframe in pandas?

I have a simple dataframe with 2 columns and 2rows.
I also have a list of 4 numbers.
I want to concatenate this list to the FIRST column of the dataframe, and only the first. So the dataframe will have 6rows in the first column, and 2in the second.
I wrote this code:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
numbers = [5, 6, 7, 8]
for i in range(0, 4):
df1['A'].loc[i + 2] = numbers[i]
print(df1)
It prints the original dataframe oddly enough. But when I debug and evaluate the expression df1['A'] then it does show the new numbers. What's going on here?
It's not just that it's printing the original df, it also writes the original df to csv when I use to_csv method.
It seems you need:
for i in range(0, 4):
df1.loc[0, i] = numbers[i]
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN
df1 = pd.concat([df1, pd.DataFrame([numbers], index=[0])], axis=1)
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN

Pandas fillna: Output still has NaN values

I am having a strange problem in Pandas. I have a Dataframe with several NaN values. I thought I could fill those NaN values using column means (that is, fill every NaN value with its column mean) but when I try the following
col_means = mydf.apply(np.mean, 0)
mydf = mydf.fillna(value=col_means)
I still see some NaN values. Why?
Is it because I have more NaN values in my original dataframe than entries in col_means? And what exactly is the difference between fill-by-column vs fill-by-row?
You can just fillna with the df.mean() Series (which is dict-like):
In [11]: df = pd.DataFrame([[1, np.nan], [np.nan, 4], [5, 6]])
In [12]: df
Out[12]:
0 1
0 1 NaN
1 NaN 4
2 5 6
In [13]: df.fillna(df.mean())
Out[13]:
0 1
0 1 5
1 3 4
2 5 6
Note: that df.mean() is the row-wise mean, which gives the fill values:
In [14]: df.mean()
Out[14]:
0 3
1 5
dtype: float64
Note: if df.mean() has some NaN values then these will be used in the DataFrame's fillna, perhaps you want to use a fillna on this Series i.e.
df.mean().fillna(0)
df.fillna(df.mean().fillna(0))

Categories

Resources