Erase columns where duplicated rows exist, in groups. Pandas - python

I need to show columns which have only duplicated rows inside, in Name groups
I cannot remove/drop columns for one groupo because for other this specific column could be usefull.
So when in specific column will be duplicates i need to make this column empty (replace with np.nan for example)
my df:
Name,B,C,D
Adam,20,dog,cat
Adam,20,cat,elephant
Katie,21,cat,cat
Katie,21,cat,dog
Brody,22,dog,dog
Brody,21,cat,dog
expected output:
#grouping by Name, always two Names are same, not less not more.
Name,B,C,D
Adam,np.nan,dog,cat
Adam,np.nan,cat,elephant
Katie,np.nan,np.nan,cat
Katie,np.nan,np.nan,dog
Brody,22,dog,np.nan
Brody,21,cat,np.nan
I know I should use groupby() function and duplicated()
but I dont know how this approach should looks like
output=df[df.duplicated(keep=False)].groupby('Name')
output=output.replace({True:'np.nan'},regex=True)

Use GroupBy.transform with lambda function and DataFrame.mask for replace:
df = df.set_index('Name')
output=df.mask(df.groupby('Name').transform(lambda x: x.duplicated(keep=False))).reset_index()
print (output)
Name B C D
0 Adam NaN dog cat
1 Adam NaN cat elephant
2 Katie NaN NaN cat
3 Katie NaN NaN dog
4 Brody 22.0 dog NaN
5 Brody 21.0 cat NaN

Related

Dataframe add column with list of values from groupby

I am trying to generate a list of values, grouped by 'Melder' and add that list as a column to my dataframe. But the apply(list) doesn't work in conjunction with the new_df.insert():
This works, but generates a new Dataframe with only the groupy values
new_df2 = new_df.groupby('Melder')['SAG-Nummer'].apply(list)
This adds a column to my current dataframe, but the values are all NaN
Example:
my_df.insert(1,'Liste',my_df.groupby('Melder')['SAG-ummer'].apply(list))
print(my_df)
SAG-Nummer Liste Melder
0 SAG-2001-0389 NaN Meyer
1 SAG-2001-0388 NaN Meyer
2 SAG-2001-1833 NaN Schmidt
3 SAG-2001-1836 NaN Berg
new_df2 = new_df.groupby('Melder')['SAG-Nummer'].apply(list)
print(my_df2)
Melder
Berg [SAG-2001-1836]
Meyer [SAG-2001-0389, SAG-2001-0388]
Schmidt [SAG-2001-1833]
Expected Result:
SAG-Nummer Liste Melder
0 SAG-2001-0389 [SAG-2001-0389, SAG-2001-0388] Meyer
1 SAG-2001-0388 [SAG-2001-0389, SAG-2001-0388] Meyer
2 SAG-2001-1833 [SAG-2001-1833] Schmidt
3 SAG-2001-1836 [SAG-2001-1836] Berg
Use the following transformation to expand the result of each group row-wise:
my_df.assign(Liste=my_df.groupby('Melder')['SAG-ummer'].transform(lambda x: [x.values] * len(x)))

Merge dataframes based on substrings

I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.
For illustration purposes:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})
This is df1:
id numbers add_info
0 abc 1.0 3123.0
1 adcfek 2.0 NaN
2 acefeasdq NaN 312441.0
And this is df2:
matching needed_info other_info
0 adc 1 22
1 fek 2 33
2 acefeasdq 3 11
3 abcef 4 44
4 acce 5 55
5 dcf 6 66
And this is the desired output:
id numbers add_info needed_info other_info
0 abc 1.0 3123.0 NaN NaN
1 adcfek 2.0 NaN 2.0 33.0
2 adcfek 2.0 NaN 6.0 66.0
3 acefeasdq NaN 312441.0 3.0 11.0
So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.
In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?
You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:
df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)
Docs:
pd.merge
Indexing and selecting data
If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.
map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met

Access PostgreSQL hstore keys and values in Python and create new dataframe column for each key

I manage a PostgreSQL database and am working on a tool for users to access a subset of the database. The database has a number of columns, and in addition we use a huge number of hstore keys to store additional information specific to certain rows in the database. Basic example below
A B C hstore
"foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"
"bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar" 2 8
"baz" 3 1 "Fruit"=>"apple", "Name"=>"David"
The data is routinely exported to a CSV file like this:
COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;
I read this into a Pandas dataframe in Python like this:
import pandas as pd
df = pd.read_csv('/filepath/file.csv')
I then access a subset of the data. This subset should have a common set of hstore keys in most, but not necessarily all rows.
I would like to create a separate column for each of the hstore keys. Where a key does not exist for a row, the cell should be left empty, or filled with a NULL or NAN value, whatever is easiest. What is the most effective way to do this?
You can use .str.extractall() to extract the keys and values from column hstore, then use .pivot() to transform the keys to column labels. Aggregate the entries for each row in original dataframe by .groupby() and .agg(). Set NaN for empty entries with .replace(). Finally, join back the result dataframe to original dataframe with .join():
df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
Result:
A B C hstore Country Fruit Name Pet
0 "foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" Norway apple NaN dog
1 "bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname" Suriname NaN NaN cat
2 "foobar" 2 8 None NaN NaN NaN NaN
3 "baz" 3 1 "Fruit"=>"apple", "Name"=>"David" NaN apple David NaN
If you want to get a new dataframe for the extraction instead of joining back to the original dataframe, you can remove the .join() step and do a .reindex(), as follows:
df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
df_out = df_out.reindex(df.index)
Result:
print(df_out)
Country Fruit Name Pet
0 Norway apple NaN dog
1 Suriname NaN NaN cat
2 NaN NaN NaN NaN
3 NaN apple David NaN

How would I pivot this basic table using pandas?

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

Merging two Excel files by ID and combining columns with same name (python, pandas)

I am new to stackoverflow and pandas for python. I found part of my answer in the post Looking to merge two Excel files by ID into one Excel file using Python 2.7
However, I also want to merge or combine columns from the two excel files with the same name. I thought the following post would have my answer but I guess it's not titled correctly: Merging Pandas DataFrames with the same column name
Right now I have the code:
import pandas as pd
file1 = pd.read_excel("file1.xlsx")
file2 = pd.read_excel("file2.xlsx")
file3 = file1.merge(file2, on="ID", how="outer")
file3.to_excel("merged.xlsx")
file1.xlsx
ID,JanSales,FebSales,test
1,100,200,cars
2,200,500,
3,300,400,boats
file2.xlsx
ID,CreditScore,EMMAScore,test
2,good,Watson,planes
3,okay,Thompson,
4,not-so-good,NA,
what I get is merged.xlsx
ID,JanSales,FebSales,test_x,CreditScore,EMMAScore,test_y
1,100,200,cars,NaN,NaN,
2,200,500,,good,Watson,planes
3,300,400,boats,okay,Thompson,
4,NaN,NaN,,not-so-good,NaN,
what I want is merged.xlsx
ID,JanSales,FebSales,CreditScore,EMMAScore,test
1,100,200,NaN,NaN,cars
2,200,500,good,Watson,planes
3,300,400,okay,Thompson,boats
4,NaN,NaN,not-so-good,NaN,NaA
In my real data, there are 200+ columns that correspond to the "test" column in my example. I want the program to find these columns with the same names in both file1.xlsx and file2.xlsx and combine them in the merged file.
OK, here is a more dynamic way, after merging we assume that clashes will occur and result in 'column_name_x' or '_y'.
So first figure out the common column names and remove 'ID' from this list
In [51]:
common_columns = list(set(list(df1.columns)) & set(list(df2.columns)))
common_columns.remove('ID')
common_columns
Out[51]:
['test']
Now we can iterate over this list to create the new column and use where to conditionally assign the value dependent on which value is not null.
In [59]:
for col in common_columns:
df3[col] = df3[col+'_x'].where(df3[col+'_x'].notnull(), df3[col+'_y'])
df3
Out[59]:
ID JanSales FebSales test_x CreditScore EMMAScore test_y test
0 1 100 200 cars NaN NaN NaN cars
1 2 200 500 NaN good Watson planes planes
2 3 300 400 boats okay Thompson NaN boats
3 4 NaN NaN NaN not-so-good NaN NaN NaN
[4 rows x 8 columns]
Then just to finish off drop all the extra columns:
In [68]:
clash_names = [elt+suffix for elt in common_columns for suffix in ('_x','_y') ]
clash_names
df3.drop(labels=clash_names, axis=1,inplace=True)
df3
Out[68]:
ID JanSales FebSales CreditScore EMMAScore test
0 1 100 200 NaN NaN cars
1 2 200 500 good Watson planes
2 3 300 400 okay Thompson boats
3 4 NaN NaN not-so-good NaN NaN
[4 rows x 6 columns]
The snippet above is from this :Prepend prefix to list elements with list comprehension

Categories

Resources