pandas dataframe - select rows that are similar

pandas dataframe - select rows that are similar - python

Is there a way to select rows that are 'similar', (NOT DUPLICATES!) in a pandas dataframe?
I have a dataframe that has columns including 'school_name' and 'district'.
I want to see if there are any schools that have similar names in different districts.
All I can think of is to select a random school name and manually check if similar names exist in the dataframe, by running something like:
df[df['school_name'].str.contains('english grammar')]
Is there a more efficient way to do this?
edit: I am ultimately going to string match this particular dataframe with another dataframe, on school_name column, while blocking on district column. The issue is that some district names in the two dataframes don't exactly match, due to district names changing over time - df1 is from year 2000 and it has districts in 'city-i', 'city-ii'... format, whereas df2, which is from year 2020, has districts in 'city-south', 'city-north' ... format. The names have changed over time, and the districts have been reshuffled to get merged / separated etc.
So I am trying to remove the 'i', 'ii', 'south' etc to just block on 'city'. Before I do that I want to make sure that there are no similar-sounding names in 'city-i' and 'city-south', (because I wouldn't want to match the similar sounding names together if the schools are actually in a completely different district).

Let`s say this is your DataFrame.
df = pd.DataFrame({'school':['High School',
'High School',
'Middle School',
'Middle School'],
'district':['Main District',
'Unknown District',
'Unknown District',
'Some other District']
})
Than you can use a combination of pandas.DataFrame.duplicated() and pandas.groupby().
df[df.duplicated('school', keep=False)].groupby('school')
.apply(lambda x: ', '.join(x['district']))
.to_frame('duplicated names by district')
This returns a DataFrame which looks like this:

To get all duplicated rows only, based on a column school_name:
df[df['school_name'].duplicated(keep=False)]
If you want to get rows for a single school_name value:
# If we are searching for school let's say 'S1'
name = "S1"
df.loc[df['school_name'] == name]
To select rows, which have same school_name based on a given list, use isin
dupl_list = ['S1', 'S2', 'S3']
df.loc[df['school_name'].isin(dupl_list)]

You can make two sets one will contain the school name and other will contain duplicates if it encounters duplicate while traversing each school name in the column
school=set()
duplicate=set()
for i in df1.schl_name:
if i not in school:
school.add(i)
else:
duplicate.add(i)
df1[df1['schl_name'].isin(duplicate)]

Related

How can I compare two pandas dataframes when the rows of one dataframe are a substring of the rows of another dataframe?

I have two dataframes, with the rows of the dataframes corresponding to each other by one dataframe having its rows be a substring of another dataframe. An example can be seen with this:
Donald
Bobby
Christian
7:15
6:43
5:05
Don
Bob
Chris
7:13
6:46
5:00
Essentially. Don is a substring of Donald, Bob a substring of Bobby, etc. and these rows are subsequently corresponding. I want to figure out a way to compare the corresponding rows (like how 7:15 for Donald is different from 7:13 for Don). Ideally, I would be able to have a new dataframe that posits each string and substring and then the corresponding differences next to each other. (Would look like Donald, Don, 7:15, 7:13). To be clear, Donald/Don, Bobby/Bob, Christian/Chris are all rows and not columns. The columns would be like Name, Mile Time and then each row is name and time. I am currently thinking of iterating through the dataframe with the longer names by doing
for index, rows in df2.iterrows():
substring = rows['Name']
data = df1.loc[df1['Name'].str.contains(substring, case=False)]
if data is not None:
name1 = data['Name']
time1 = data['Mile Time']
name2 = rows['Name']
time2 = rows['Mile Time']
But I'm not entirely sure how many different columns will be in each row because there might be multiple attributes like age, sex, etc. so I'm not sure how to go about collecting that data and putting it into something meaningful to compare the two visually. Ideally it goes into a new dataframe and convet to csv so then I can visually see the differences side by side.
EDIT: Posting with DataFrames and Desired Output
full_name_dict = {'Name': [Donald, Bobby, Christian], 'Mile Time': [7:15, 6:43, 5:05]}
substring_dict = {'Name': [Don, Bob, Chris], 'Mile Time': [7:13, 6:46, 5:00]}
df1 = pd.DataFrame(full_name_dict)
df2 = pd.DataFrame(substring_dict)
Desired Output...
output_wanted = {'Name1': [Donald, Bobby, Christian], 'Name2': [Don, Bob,
Chris], 'Time1': [7:15, 6:43, 5:05], 'Time2': [7:13, 6:46, 5:00]}

Vlookup based on multiple columns in Python DataFrame

I have two dataframes. I am trying to Vlookup 'Hobby' column from the 2nd dataframe and update the 'Interests' column of the 1st dataframe. Please note that the columns: Key, Employee and Industry should match exactly between the two dataframes but in case of City, even if the first part of the city matches between the two dataframe it should be acceptable. Though, it is starightforward in Excel, it looks a bit complicated to implement it on Python. Any cue on how to proceed will be really helpful. (Please see below the screenshot for the expected output.)
data1=[['AC32456','NYC-URBAN','JON','BANKING','SINGING'],['AD45678','WDC-RURAL','XING','FINANCE','DANCING'],
['DE43216', 'LONDON-URBAN','EDWARDS','IT','READING'],['RT45327','SINGAPORE-URBAN','WOLF','SPORTS','WALKING'],
['Rs454457','MUMBAI-RURAL','NEMBIAR','IT','ZUDO']]
data2=[['AH56245','NYC','MIKE','BANKING','BIKING'],['AD45678','WDC','XING','FINANCE','TREKKING'],
['DE43216', 'LONDON-URBAN','EDWARDS','FINANCE','SLEEPING'],['RT45327','SINGAPORE','WOLF','SPORTS','DANCING'],
['RS454457','MUMBAI','NEMBIAR','IT','ZUDO']]
List1=['Key','City','Employee', 'Industry', 'Interests']
List2=['Key','City','Employee', 'Industry', 'Hobby']
df1=pd.DataFrame(data1, columns=List1)
df2=pd.DataFrame(data2,columns=List2)

Set in index of df1 to Key (you can set the index to whatever you want to match on) and the use update:
# get the first part of the city
df1['City_key'] = df1['City'].str.split('-', expand=True)[0]
df2['City_key'] = df2['City'].str.split('-', expand=True)[0]
# set index
df1 = df1.set_index(['Key', 'Employee', 'Industry', 'City_key'])
# update
df1['Interests'].update(df2.set_index(['Key', 'Employee', 'Industry', 'City_key'])['Hobby'])
# reset index and drop the City_key column
new_df = df1.reset_index().drop(columns=['City_key'])
Key Employee Industry City Interests
0 AC32456 JON BANKING NYC-URBAN SINGING
1 AD45678 XING FINANCE WDC-RURAL TREKKING
2 DE43216 EDWARDS IT LONDON-URBAN READING
3 RT45327 WOLF SPORTS SINGAPORE-URBAN DANCING
4 Rs454457 NEMBIAR IT MUMBAI-RURAL ZUDO

What better way to implement this idea on a Pandas dataframe?

I have a pandas dataframe called Df which contains 26,000 rows. This dataframe includes 10 columns called "first price", "second price" and .... "tenth price".
I want to create a new column called "y" to this dataframe like this
For example, the 26th row of the "y" column indicates the name of the column whose value in row 26 of that column is closer to the number of the first column (the column whose name is the first price) of the 27th row(26+1) than the elements of the other columns in the 26th row.
I implemented this code with an algorithm, but this algorithm works very slowly for a sample of 1000, let alone 26,000!
y=[]
for i in range(1000):
y.append((abs(df[df.index==(i)]-df["first price"][i+1])).idxmin(axis=1)[i])
for i in range(1000,len(df)):
y.append(0)
df["y"]=y
Do you know a better way?

You want to reshape the data to make it tidy. It's not good to have a bunch of columns all with the same value type (first price, second price, etc.). Better to have the type in its own column and the price beside it. Since you are comparing everything to the first price, you can leave it in its own index column and melt the rest of the columns into pairs of 'price number' and 'price' before finding the minimum of each 'item' (or what you had as rows in your example):
# example data:
np.random.seed(11)
df = (pd.DataFrame(np.random.choice(range(100), (6,4)),
columns=['first', 'second', 'third', 'fourth'])
.rename_axis('item_id')
.reset_index())
# reshape data to make it easier to work with
df = df.melt(id_vars=['item_id', 'first'], var_name='price_number', value_name='price')
# calculate price differences
df['price_diff'] = (df.price - df['first']).abs()
# find the minimum price difference for each item
df_closest = df.loc[df.groupby('item_id')['price_diff'].idxmin()]

How to take specific columns in pandas dataframe only if they exist (different CSVs)

I downloaded a bunch of football data from the internet in order to analyze it (around 30 CSV files).
Each season's game data is saved as a CSV file with different data columns.
Some data columns are common to all files e.g. Home team, Away team, Full time result, ref name, etc...
Earlier years CSV data columns picture - These column are common to all CSVs
However in more recent years the data became richer and has some new data columns e.g. Corners for home team, Corners for away team, yellow cards for each team, shots on goal for each side, etc...
Recent years CSV data columns picture - Contains the common columns as well as additional ones
I made a generic function that take each season's CSV gameweek data and turns it into a full table (how it looked at the end of the season) with different stats.
Now when I try to build the "final-day" table of each season from the common data columns alone everything works out fine, However, when I try to throw in the uncommon columns (corners for example) I get an error.
This is no surprise to me and i know how to check whether a CSV includes a certain column, but i'd like to know if there is a clever way to command the dataset to take a certain column if it exists (say 'Corners') and just skip this column if it does not exist.
I present part of the function that does rises the error. the last line is the problematic one. When I use leave only the common columns in (i.e. deleting every column after FTR) the function works fine.
The code in general gets 1 season at the time and builds the table.
# create a pandas dataframe of a specific season before the season started
# returns a pandas dataframe with the year of the season and the teams involved with initialized stats
# path is the full path of the file retained by glob function, and raw_data is a pandas dataframe read directly from the CSV
def create_initial_table(path, raw_data):
# extracts the season's year
season_number = path[path.index("/") + 1:path.index(".")]
# reduce the information to the relevant columns
raw_data = raw_data[['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']]
with this constellation i'd like to continue. i.e. when the column name does not exist just skip to the next column and so on till the ones that do exist remains and the ones that arent wont rise an error.
In later functions i also update the values of these columns (corners, shots on goal, etc...), so the same skip functionallity is needed there too.
Thanks for the advices :>

You can use DataFrame.filter(items=...) see this example:
all_columns = ['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']
df = pd.DataFrame(np.random.rand(5, 5), columns=['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'BAD COLUMN'])
print(df)
HomeTeam AwayTeam FTHG FTAG BAD COLUMN
0 0.265389 0.523248 0.093941 0.575946 0.929296
1 0.318569 0.667410 0.131798 0.716327 0.289406
2 0.183191 0.586513 0.020108 0.828940 0.004695
3 0.677817 0.270008 0.735194 0.962189 0.248753
4 0.576157 0.592042 0.572252 0.223082 0.952749
Even though I feed it column names that don't exist in the dataframe, it will only pull out the columns that exist
new_df = df.filter(items=all_columns)
print(new_df)
HomeTeam AwayTeam FTHG FTAG
0 0.265389 0.523248 0.093941 0.575946
1 0.318569 0.667410 0.131798 0.716327
2 0.183191 0.586513 0.020108 0.828940
3 0.677817 0.270008 0.735194 0.962189
4 0.576157 0.592042 0.572252 0.223082

Python Pandas extract unique values from a column and another column

I am studying pandas, bokeh etc. to get started with Data Vizualisation. Right now I am practising with a giant table containing different birds. There are plenty of columns; two of those columns are "SCIENTIFIC NAME" and another one is "OBSERVATION COUNT".
I want to extract those two columns.
I did
df2 = df[["SCIENTIFIC NAME" , "OBSERVATION COUNT"]]
but the problem then is, that every entry is inside the table (since sometimes there are multiple entries/rows due to other columns of the same SCIENTIFIC NAME, but the OBSERVATION COUNT is always the same for the scientific name)
How can I get those two sectors but with the unique values, so every scientific name once, with the corresonding observation count.
EDIT: I just realized that sometimes the same scientific names have different observation counts due to another column. Is there a way to extract every first unique item from a column

IIUC, You can use drop_duplicates:
df2 = df[["SCIENTIFIC NAME" , "OBSERVATION COUNT"]].drop_duplicates()
To get counts:
df2 = df.groupby(["SCIENTIFIC NAME" , "OBSERVATION COUNT"])["SCIENTIFIC NAME"].count()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe - select rows that are similar - python

Related

How can I compare two pandas dataframes when the rows of one dataframe are a substring of the rows of another dataframe?

Vlookup based on multiple columns in Python DataFrame

What better way to implement this idea on a Pandas dataframe?

How to take specific columns in pandas dataframe only if they exist (different CSVs)

Python Pandas extract unique values from a column and another column

Categories

Resources