Remove duplicate values in a pandas column, but ignore one value

Remove duplicate values in a pandas column, but ignore one value - python

I'm sure there is an elegant solution for this, but I cannot find one. In a pandas dataframe, how do I remove all duplicate values in a column while ignoring one value?
repost_of_post_id title
0 7139471603 Man with an RV needs a place to park for a week
1 6688293563 Land for lease
2 None 2B/1.5B, Dishwasher, In Lancaster
3 None Looking For Convenience? Check Out Cordova Par...
4 None 2/bd 2/ba, Three Sparkling Swimming Pools, Sit...
5 None 1 bedroom w/Closet is bathrooms in Select Unit...
6 None Controlled Access/Gated, Availability 24 Hours...
7 None Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent
8 7143099582 Need Help Getting Approved?
9 None *MOVE IN READY APT* REQUEST TOUR TODAY!
What I want is to keep all None values in repost_of_post_id, but omit any duplicates of the numerical values, for example if there are duplicates of 7139471603 in the dataframe.
[UPDATE]
I got the desired outcome using this script, but I would like to accomplish this in a one-liner, if possible.
# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned
ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")
ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)

You could try dropping the None values, then detecting duplicates, then filtering them out of the original list.
In [1]: import pandas as pd
...: from string import ascii_lowercase
...:
...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5]
...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])})
...: print(df)
...:
...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
6 2.0 g
7 3.0 h
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l

You could use drop_duplicates and merge with the NaNs as follows:
df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer')
This will keep the first occurence of ids duplicated and all NaNs rows.

Related

Dynamically Fill NaN Values in Dataframe

I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.

You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x

Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0

Pandas: How to use (df.groupby) in a lambda formula

The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!

groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')

Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN

Loop that counts unique values in a pandas df

I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0

I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.

Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0

Averaging values between files, but keeping non-matching values

I have two files:
File 1:
key.1 10 6
key.2 5 6
key.3. 5 8
key.4. 5 10
key.5 4 12
File 2:
key.1 10 6
key.2 6 6
key.4 5 10
key.5 2 8
I have a rather complicated issue. I want to average between the two files for each loc. ID. But if an ID is unique to either of the files, I simply want to keep that value in the output file. So the output file would look like this:
key.1 10 6
key.2 5.5 6
key.3. 5 8
key.4. 5 10
key.5 3 10
This is an example. In reality I have 100s of columns that I would like to average.

The following solution uses Pandas, and assumes that your data is stored in plain text files 'file1.txt' and 'file2.txt'. Let me know if this assumption is incorrect - it is likely a minimal edit to alter for different file types. If I have misunderstood your meaning of the word 'file' and your data is already in DataFrames, you can ignore the first step.
First read in the data to DataFrames:
import pandas as pd
df1 = pd.read_table('file1.txt', sep=r'\s+', header=None)
df2 = pd.read_table('file2.txt', sep=r'\s+', header=None)
Giving us:
In [9]: df1
Out[9]:
0 1 2
0 key.1 10 6
1 key.2 5 6
2 key.3 5 8
3 key.4 5 10
4 key.5 4 12
In [10]: df2
Out[10]:
0 1 2
0 key.1 10 6
1 key.2 6 6
2 key.4 5 10
3 key.5 2 8
Then join these datasets on column 0:
combined = pd.merge(df1, df2, 'outer', on=0)
Giving:
0 1_x 2_x 1_y 2_y
0 key.1 10 6 10.0 6.0
1 key.2 5 6 6.0 6.0
2 key.3 5 8 NaN NaN
3 key.4 5 10 5.0 10.0
4 key.5 4 12 2.0 8.0
Which is a bit of a mess, but we can select only the columns we want after doing the calculations:
combined[1] = combined[['1_x', '1_y']].mean(axis=1)
combined[2] = combined[['2_x', '2_y']].mean(axis=1)
Selecting only useful columns:
results = combined[[0, 1, 2]]
Which gives us:
0 1 2
0 key.1 10.0 6.0
1 key.2 5.5 6.0
2 key.3 5.0 8.0
3 key.4 5.0 10.0
4 key.5 3.0 10.0
Which is what you were looking for I believe.
You didn't state which file format you wanted the output to be, but the following will give you a tab-separated text file. Let me know if something different is preferred and I can edit.
results.to_csv('output.txt', sep='\t', header=None, index=False)
I should add that it would be better to give your columns relevant labels rather than using numbers as I have in this example - I just used the default integer values here since I don't know anything about your dataset.

This is one solution via pandas. The idea is to define indices for each dataframe and use ^ [equivalent to symmetric_difference in set terminology] to find your unique indices.
Treat each case separately via 2 pd.concat calls, perform a groupby.mean, and append your isolated indices at the end.
# read files into dataframes
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
# set first column as index
df1 = df1.set_index(0)
df2 = df2.set_index(0)
# calculate symmetric difference of indices
x = df1.index ^ df2.index
# Index(['key.3'], dtype='object', name=0)
# aggregate common and unique indices
df_common = pd.concat((df1[~df1.index.isin(x)], df2[~df2.index.isin(x)]))
df_unique = pd.concat((df1[df1.index.isin(x)], df2[df2.index.isin(x)]))
# calculate mean on common indices; append unique indices
mean = df_common.groupby(df_common.index)\
.mean()\
.append(df_unique)\
.sort_index()\
.reset_index()
# output to csv
mean.to_csv('out.csv', index=False)
Result
0 1 2
0 key.1 10.0 6.0
1 key.2 5.5 6.0
2 key.3 5.0 8.0
3 key.4 5.0 10.0
4 key.5 3.0 10.0

You can use itertools.groupby:
import itertools
import re
file_1 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]]
file_2 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename1.txt')]]
special_keys ={a for a, *_ in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]+[re.split('\s+', i.strip('\n')) for i in open('filename2.txt')] if a.endswith('.')}
new_results = [[a, [c for _, *c in b]] for a, b in itertools.groupby(sorted(file_1+file_2, key=lambda x:x[0])[1:], key=lambda x:x[0])]
last_results = [(" "*4).join(["{}"]*3).format(a+'.' if a+'.' in special_keys else a, *[sum(i)/float(len(i)) for i in zip(*b)]) for a, b in new_results]
Output:
['key.1 10.0 6.0', 'key.2 5.5 6.0', 'key.3. 5.0 8.0', 'key.4. 5.0 10.0', 'key.5 3.0 10.0']

One possible solution is to read the two files into dictionaries (key being the key variable, and the value being a list with the two elements after). You can then get the keys of each dictionary, see which keys are duplicated (and if so, average the results), and which keys are unique (and if so, just output the key). This might not be the most efficient, but if you only have hundreds of columns that should be the simplest way to do it.
Look up set intersection and set difference, as they will help you find the common items and unique items.

Default value for pandas lookup when lookup value doesn't exist or is null

I have a dataframe that looks like this:
parent region
estid
1 NaN A
2 NaN B
3 1.0 A
4 1.0 B
5 2.0 C
6 2.0 C
7 8.0 A
What I want is to create an extra column containing the region of the parent, defaulting to None if the parent is not found in the data, e.g.:
parent region parent_region
estid
1 NaN A None
2 NaN B None
3 1.0 A A
4 1.0 B A
5 2.0 C B
6 2.0 C B
7 8.0 A None
The following returns the correct result:
df["parent_region"] = df.apply(lambda x : df.loc[x["parent"]]["region"] if not math.isnan(x["parent"]) and x["parent"] in df.index else None, axis = 1)
But I'm very scared of inefficiencies, given that my dataframe has 168 million rows. Is there a better way to do it? I looked at lookup and get but I can't quite figure out how to work with IDs that can be NaN or not present in the dataframe.
For instance, I thought this could work: df.lookup(df["region"], df["parent"]), but it doesn't like very much null keys. df.get("region") does not return the parent's region, but the column itself, so it doesn't do what I want.

You can use Series.map method which functions similar to a dictionary. The values from parent column and region column serve as the keys and values constituting it. The mapping takes place if they share a common index between them.
Additionally, na_action=ignore could be used to speedup this mapping process, as all NaNs present in these columns would be ignored altogether and simply propagated.
Lastly, the missing values must be replaced with None using Series.replace method.
df["parent_region"] = df.parent.map(df.region, na_action='ignore').replace({np.NaN:None})
Out[121]:
estid
1 None
2 None
3 A
4 A
5 B
6 B
7 None
Name: parent_region, dtype: object

We could also use a merge for this, joining on itself to match parents to estid:
z = pd.merge(x, x[['estid','region']],\
left_on = 'parent',\
right_on = 'estid',\
how = 'left',\
suffixes=('', '_parent')) #left join
del z['estid_parent'] #remove uneeded row
z['region_parent'] = z['region_parent'].replace({np.NaN:None}) #remove nans, same as other answer
z
estid parent region region_parent
0 1 NaN A None
1 2 NaN B None
2 3 1.0 A A
3 4 1.0 B A
4 5 2.0 C B
5 6 2.0 C B
6 7 8.0 A None

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicate values in a pandas column, but ignore one value - python

You could use drop_duplicates and merge with the NaNs as follows: df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer') This will keep the first occurence of ids duplicated and all NaNs rows.

Related

Dynamically Fill NaN Values in Dataframe

Pandas: How to use (df.groupby) in a lambda formula

Loop that counts unique values in a pandas df

Averaging values between files, but keeping non-matching values

Default value for pandas lookup when lookup value doesn't exist or is null

Categories

Resources