pandas/python: drop duplicates of same strings with different order

pandas/python: drop duplicates of same strings with different order - python

is it possible to drop duplicate of rows with the same strings but of different order within the same column?
exampe: dl3_hr_rank.r0 and hr_dl3_rank.r0
code for df before drop:
data = {'item':['dl3_hr_rank.r0','hr_dl3_rank.r0','hr_kl3_rank.r0',
'kl3_hr_rank.r0','hcrfr_hr_rank.r0',
'hr_hcrfr_rank.r0','hcfr_hkfr_rank.r0_wp','hkfr_hcfr_rank.r0_wp',
'hr_krl2_rank.r0_wp','krl2_hr_rank.r0_wp',],
'result':[1.17,1.17,1.17,1.17,1.13,1.13,1,1,1,1]}
df = pd.DataFrame(data)
df
code for df after drop:
data = {'item':['dl3_hr_rank.r0','hr_kl3_rank.r0',
'hcrfr_hr_rank.r0',
'hcfr_hkfr_rank.r0_wp',
'hr_krl2_rank.r0_wp'],
'result':[1.17,1.17,1.13,1,1]}
df = pd.DataFrame(data)
df
ps.i'm having trouble inserting tables with the command..
many thanks, regards

Try:
df[~df.item.str.split('_').apply(frozenset).duplicated(keep='first')]
Result df:
Use pandas.Series.str.split to split by '_'
Use apply(frozenset) to get a hashable set such that I can use duplicated
Use pandas.Series.duplicated with keep='first' to keep only the first occurrence of duplicate strings

Related

pandas concatenate multiple columns together with pipe while skip the empty values

hi I want to concatenate multiple columns together using pipes as connector in pandas python and if the columns is blank values then skip this columns.
I tried the following code, it does not skip the values when its empty, it will still have a '|' to connect with other fields, what I want is the completely pass the empty field ..
for example: currently it gives me 'N|911|WALLACE|AVE||||MT|031|000600'
while I want 'N|911|WALLACE|AVE|MT|031|000600'
df['key'] = df[['fl_predir','fl_prim_range','fl_prim_name','fl_addr_suffix','fl_postdir','fl_unit_desig','fl_sec_range','fl_st','fl_fips_county','blk']].agg('|'.join, axis=1)
can anybody help me on this?

cols = ['fl_predir','fl_prim_range','fl_prim_name','fl_addr_suffix','fl_postdir','fl_unit_desig','fl_sec_range','fl_st','fl_fips_county','blk']
df['key'] = df[cols].apply(lambda row: '|'.join(x for x in row if x), axis=1, raw=True)

You can use melt to flat your dataframe, drop null values then group by index and finally concatenate values:
cols = ['fl_predir', 'fl_prim_range', 'fl_prim_name', 'fl_addr_suffix' ,
'fl_postdir', 'fl_unit_desig', 'fl_sec_range', 'fl_st',
'fl_fips_county', 'blk']
df['key'] = (df[cols].melt(ignore_index=False)['value'].dropna()
.astype(str).groupby(level=0).agg('|'.join))
Output:
>>> df['key']
0 N|911|WALLACE|AVE|MT|31|600
Name: key, dtype: object
Alternative (Pandas < 1.1.0)
df['keys'] = (df[cols].unstack().dropna().astype(str)
.groupby(level=1).agg('|'.join))

Exclude values in DF column

I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?

Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]

pandas df masking specific row by list

I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)

replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A

(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?

Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}

Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Python.pandas: how to select rows where objects start with letters 'PL'

I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!

Use startswith for this:
df = df[df['Code'].str.startswith('pl')]

Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]

If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()

The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas/python: drop duplicates of same strings with different order - python

Related

pandas concatenate multiple columns together with pipe while skip the empty values

Exclude values in DF column

pandas df masking specific row by list

Create a dictionary from pandas empty dataframe with only column names

Python.pandas: how to select rows where objects start with letters 'PL'

Categories

Resources