Merge pandas rows based on values and NaNs - python

My dataframe looks like this :
ID VALUE1 VALUE2 VALUE3
1 NaN [ab,c] Good
1 google [ab,c] Good
2 NaN [ab,c1] NaN
2 First [ab,c1] Good1
2 First [ab,c1]
3 NaN [ab,c] Good
Requirement is :
ID is the key. I have 3 rows for ID 2. So, I need to merge two rows into 1 row such that I have valid values (excluding Nulls and spaces) for all the columns.
My expected output is:
ID VALUE1 VALUE2 VALUE3
1 google [ab,c] Good
2 First [ab,c1] Good1
3 NaN [ab,c] Good
Do we have any pandas function to achieve this or should I have to seperate the data into two or more dataframes and process for merging based on NaN/spaces?
Thanks for your help

Micheal G has a more elegant solution above.
Here is my more time consuming and amateur approach:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,1,2,2,2,3],
"V1": [np.nan,'google',np.nan,'First','First',np.nan],
"V2": [['ab','c'],['ab','c'],['ab','c1'],['ab','c1'],['ab','c1'],['ab','c']],
"V3": ['Good','Good',np.nan,np.nan,'Good1','Good']
})
uniq = df.ID.unique() #Get the unique values in ID
df = df.set_index(['ID']) #Since we are try find the rows with the least amount of nan's.
#Setting the index by ID is going to make our future statements faster and easier.
newDf = pd.DataFrame()
for i in uniq: #Running the loop per unique value in column ID
temp = df.loc[i]
if(isinstance(temp, pd.Series)): #if there is only 1 row with the i, add that row to out new DataFrame
newDf = newDf.append(temp)
else:
NonNanCountSeries = temp.apply(lambda x: x.count(), axis=1)
#Get the number of non-nan's in the per each row. It is given in list.
NonNanCountList = NonNanCountSeries.tolist()
newDf = newDf.append(temp.iloc[NonNanCountList.index(max(NonNanCountList))])
#Let's break this down.
#Find the max in out nanCountList: max(NonNanCountList))
#Find the index of where the max is. Paraphrased: get the row number with the
#most amount of non-nan's: NonNanCountList.index(max(NonNanCountList))
#Get the row by passing the index into temp.iloc
#Add the row to newDf and update newDf
print(newDf)
Which should return:
V1 V2 V3
1 google [ab, c] Good
2 First [ab, c1] Good1
3 NaN [ab, c] Good

Note, I capitalised Google.
import pandas as pd
import numpy as np
data = {'ID' : [1,1,2,2,2,3], 'VALUE1':['NaN','Google','NaN', 'First', 'First','NaN'], 'VALUE2':['abc', 'abc', 'abc1', 'abc1', 'abc1', 'abc'], 'VALUE3': ['Good', 'Good', 'NaN', 'Good1', '0', 'Good']}
df = pd.DataFrame(data)
df_ = df.replace('NaN', np.NaN).fillna('zero', inplace=False)
df2 = df_.sort_values(['VALUE1', 'ID'])
mask = df2.ID.duplicated()
print (df_[~mask])
Output
ID VALUE1 VALUE2 VALUE3
1 1 Google abc Good
3 2 First abc1 Good1
5 3 zero abc Good
Finally, just be aware the tilda character (~) in the mask is essential

Related

create dataframe with outliers and then replace with nan

I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria.
I also want in the function the value of the column "first_column" corresponding to the outlier.
For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :
I start with this dataframe:
import pandas as pd
data = {'first_column': [product_name', 'product_name2', 'product_name3'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_100':['25', '1568200', '5''],
}
df = pd.DataFrame(data)
print (df)
expected output:
IIUC, you can use filter and boolean indexing:
# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)
# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))
# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))
# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column']+list(df.filter(like='100'))]
output 1:
first_column second_column third_100 fourth_100
0 product_name first_value 89 25
1 product_name2 second_value 9 NaN
2 product_name3 third_value NaN 5
output 2:
first_column third_100 fourth_100
1 product_name2 9 1568200
2 product_name3 589 5

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

How to replace multiple values with other conditional strings or int values in a Pandas Dataframe/Series

I'm dealing with a huge excel sheet, about 176k rows of raw data and I need to replace multiple values from one column with conditionals from another column, something like "if column A has written 'word' in any row, then replace(str1, str2) in column B of the same row"
I've done another replacements with replace function using lists:
list1 = ["..","...","..."]
list2 = ["..","...","..."]
df['column'] = df['column'].replace(
list1, list2)
This has worked perfectly... but now I need to do multiple replacements:
for example, current df:
a b
0 value1 value5
1 value2 value5
2 value3 value5
Excpected output:
a b
0 value1 x-value
1 value2 y-value
2 value3 z-value
In the example I would need to replace the column "b" with another value with the condition of the objects (str or int) of column a0, a1, a2, so the condition would be "== (value1, value2, value3)". The resulting value in the "b" could be anything.
I've also tried this for loop but it replaces other values I don't want:
for i in df['column1']:
if i in df['column1'] == 'value1':
df['column2'] = df['column2'].replace("value2", "value3")
I have tried with subsetting also but didn't work:
new_df = df['column1'] == "value1"
new = df[new_df]
df['column2'] = new['column2'].replace('value2', 'value3')
From what I understood about your problem, you want to changed the values of a redundant column based on three conditions: copy the values of another column but changed a few things to fit the attributes of this revised column. In that case try this...
import pandas as pd
df = pd.DataFrame({'Col1': [1, 2, 3, 4, 'Five'], 'Col2': ['differ', '', '', '', '']})
print(df)
def replacefunc(data, char1, char2):
for a, b in zip(data['Col1'], data['Col2']):
if a != b:
data['Col2'] = data['Col1'].replace(char1, char2)
return data
df = replacefunc(df, 4, 'changed')
print(df)
Output:
previous_df
Col1 Col2
0 1 differ
1 2
2 3
3 4
4 Five
new_df
Col1 Col2
0 1 1
1 2 2
2 3 3
3 4 changed
4 Five Five

Pandas merge creates unwanted duplicate entries

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.
import pandas as pd
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
print(pd.merge(df1,df2))
output:
A
0 2
1 2
2 2
3 2
4 3
5 4
6 5
desired/expected output:
A
0 2
1 2
2 3
3 4
4 5
Please let me know how/if I can achieve the desired output using merge, thank you!
EDIT
To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?
import pandas as pd
df1 = df2 = pd.DataFrame(
{'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)
print(pd.merge(df1,df2))
output:
A B
0 2 red
1 2 orange
2 3 yellow
3 4 green
4 5 blue
However, based on the first example I would expect:
A B
0 2 red
1 2 orange
2 2 red
3 2 orange
4 3 yellow
5 4 green
6 5 blue
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()
df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)
print(df)
Output:
A
0 2
1 2
2 3
3 4
4 5
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2).drop('index', 1, inplace = True)
The idea is to merge based on the matching indices as well as matching 'A' column values.
Previously, since the way merge works depends on matches, what happened is that the first 2 in df1 was matched to both the first and second 2 in df2, and the second 2 in df1 was matched to both the first and second 2 in df2 as well.
If you try this, you will see what I am talking about.
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2, on = 'A')
did you try df.drop_duplicates() ?
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df=pd.merge(df1,df2)
df_new=df.drop_duplicates()
print df
print df_new
Seems that it gives the results that you want
The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
join_cols = ['A']
merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False], on=join_cols)
Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.
I have unfortunately stumbled upon a similar problem which I see is now old.
I solved it by using this function in a different way, applying it to the two original tables, even though there were no duplicates in these. This is an example (I apologize, I am not a professional programmer):
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1=df1.drop_duplicates()
df2 = pd.DataFrame(dict2)
df2=df2.drop_duplicates()
df=pd.merge(df1,df2)
print('df1:')
print( df1 )
print('df2:')
print( df2 )
print('df:')
print( df )

Python find out records in dataframe by column values greater than or equal to their median in each subgroup

suppose I have a dataframe which could be initiated by:
df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
'group2': ['c','c','d','d','d','e'],
'value1': [1.1,2,3,4,5,6],
'value2': [7.1,8,9,10,11,12]
})
df = df.set_index(['group1', 'group2'])
I want to subset df by the value2 column, the value of which is greater or equal to the median of each sub-group specified by the index of group2. In this example, the row of group1 in ['2','4','5','6'] should stay in the result. Can anyone help?
This should work:
df['value2'] = df['value2'].groupby(level='group2').transform(lambda x: np.where(x>=np.median(x), x, np.NaN))
df = df.dropna()
What this does is it gets the value2 column, and splits it into groups by group2. For each group, it finds the median, then replaces and value below the median with NaN. It then puts this back into the value2 column, then gets rid of all the rows with NaN values.
As an alternative, here is a slightly less clear one-liner:
df = df.groupby(level='group2').transform(lambda x: x if x.name != 'group2' else np.where(x>=np.median(x), x, np.NaN)).dropna()
This does roughly the same thing, except it runs on both columns, but doesn't do anything to the group1 column.
Note that in the second approach you could instead store to a second variable, like df2, without altering the original df if you prefer. You could do that with the first approach, but that would require yet another line to make a copy. This version is much simpler for that case.
I think you need to do a groupby and comparison before setting the index:
df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
'group2': ['c','c','d','d','d','e'],
'value1': [1.1,2,3,4,5,6],
'value2': [7.1,8,9,10,11,12]
})
gb = df.groupby('group2').value2.median()
df.join(gb, on='group2', rsuffix='_median')
df_filtered = df[df.value2 >= df.join(gb, on='group2', rsuffix='_median').value2_median]
df_filtered.set_index(['group1', 'group2'], inplace=True)
>>> df_filtered
value1 value2
group1 group2
2 c 2 8
4 d 4 10
5 d 5 11
6 e 6 12

Categories

Resources