Changing DataFrame columns and data based on logic - python

So I have a pandas data frame consisting of data from boxing matches and their odds of winning. It's columns are:
[Red_fighter, Blue_fighter, Red_odds, Blue_odds, winner]
I want to change it so that if for example, blue's odds are 'less' than red's that blue gets added to column 'Favourite' and red gets added to 'Underdog' both of which replace 'Red_fighter' and 'Blue fighter'
[favourite, underdog, favourite_odds, underdog_odds, winner]
So if I have:
{'Red_fighter' : 'Tom Jones', 'Blue_fighter' : 'Jack Jackson', 'Red_odds' : 200, 'Blue_odds' : -200 , 'Winner' : 'Blue'}
It becomes:
{'Underdog' : 'Tom Jones', 'Favourite' : 'Jack Jackson', 'Red_odds' : 200, 'Blue_odds' : -200 , 'Winner' : 'Favourite'}
I appreciate any help you can give, I'm a newbie to pandas and data analytics in general, thanks!

You can achieve this using the pd.Series.where method:
df['Underdog'] = df.Red_fighter.where(df.Red_odds < df.Blue_odds, df.Blue_fighter)
df['Favourite'] = df.Red_fighter.where(df.Red_odds > df.Blue_odds, df.Blue_fighter)
df['Underdog_odds'] = df.Red_odds.where(df.Red_odds < df.Blue_odds, df.Blue_odds)
df['Favourite_odds'] = df.Red_odds.where(df.Red_odds > df.Blue_odds, df.Blue_odds)
This method works by replacing the values where a condition is NOT satisfied with values from another series. The remaining values which satisfy the condition are left untouched.
So for example if we have df.A.where(cond, df.B), all rows where cond is True will have values from A and all rows where cond is False will have values from B. There is more information in the documentation.

Related

How to create a new columns conditional on two other columns in python?

I want to create a new columns conditional on two other columns in python.
Below is the dataframe:
name
address
apple
hello1234
banana
happy111
apple
str3333
pie
diary5144
I want to create a new column "want", conditional on column "name" and "column" address.
The rules are as follows:
(1)If the value in "name" is apple, the the value in "want" should be the first five letters in column "address".
(2)If the value in "name" is banana, the the value in "want" should be the first four letters in column "address".
(3)If the value in "name" is pie, the the value in "want" should be the first three letters in column "address".
The dataframe I want look like this:
name
address
want
apple
hello1234
hello
banana
happy111
happ
apple
str3333
str33
pie
diary5144
dia
How to address such problem? Thanks!
I hope you are well,
import pandas as pd
# Initialize data of lists.
data = {'Name': ['Apple', 'Banana', 'Apple', 'Pie'],
'Address': ['hello1234', 'happy111', 'str3333', 'diary5144']}
# Create DataFrame
df = pd.DataFrame(data)
# Add an empty column
df['Want'] = ''
for i in range(len(df)):
if df['Name'].iloc[i] == "Apple":
df['Want'].iloc[i] = df['Address'].iloc[i][:5]
if df['Name'].iloc[i] == "Banana":
df['Want'].iloc[i] = df['Address'].iloc[i][:4]
if df['Name'].iloc[i] == "Pie":
df['Want'].iloc[i] = df['Address'].iloc[i][:3]
# Print the Dataframe
print(df)
I hope it helps,
Have a lovely day
I think a broader way of doing this is by creating a conditional map dict and applying it with lambda functions on your dataset.
Creating the dataset:
import pandas as pd
data = {
'name': ['apple', 'banana', 'apple', 'pie'],
'address': ['hello1234', 'happy111', 'str3333', 'diary5144']
}
df = pd.DataFrame(data)
Defining the conditional dict:
conditionalMap = {
'apple': lambda s: s[:5],
'banana': lambda s: s[:4],
'pie': lambda s: s[:3]
}
Applying the map:
df.loc[:, 'want'] = df.apply(lambda row: conditionalMap[row['name']](row['address']), axis=1)
With the resulting df:
name
address
want
0
apple
hello1234
hello
1
banana
happy111
happ
2
apple
str3333
str33
3
pie
diary5144
dia
You could do the following:
for string, length in {"apple": 5, "banana": 4, "pie": 3}.items():
mask = df["name"].eq(string)
df.loc[mask, "want"] = df.loc[mask, "address"].str[:length]
Iterate over the 3 conditions: string is the string on which the length requirement depends, and the length requirement is stored in length.
Build a mask via df["name"].eq(string) which selects the rows with value string in column name.
Then set column want at those rows to the adequately clipped column address values.
Result for the sample dataframe:
name address want
0 apple hello1234 hello
1 banana happy111 happ
2 apple str3333 str33
3 pie diary5144 dia

Finding the Corresponding Max Value in a Data Frame

I have the following code
import pandas as pd
df = {'sport' : ['football', 'hockey', 'baseball', 'basketball', 'nan'], 'league': ['NFL', 'NHL', 'MLB', 'NBA', 'NaN'], 'number': [1,2,3,4,'']}
df = pd.DataFrame(df)
df
I'd like to print out the sport with the highest number
I've tried the following code:
highestnumberedsport = df[df['number'] == df['number'].max()]
However, this gives me the entire row. I'd like to only print out the value in the sport column.
Update: I've tried the following based on suggestion, but this still does not output the desired string of 'basketball' in this instance:
df.loc[df['number'] == df['number'].max(), 'sport']
how do I only print out the sport that has the highest value?

Is there a way to index a list using a series without using loop?

Result = pd.DataFrame({
'File': filenames_,
'Actual Classes': Actual_classes,
'Predicted Classes': Predicted_classes
})
Result.sample(frac = 0.02)
Actual Classes and Predicted Classes are integer values ranging from (1 to 8). I want to create a new column in the dataframe using the list of 9 strings:
['Black Sea Sprat', 'Gilt-Head Bream', 'Hourse Mackerel', 'Red Mullet',
'Red Sea Bream', 'Sea Bass', 'Shrimp', 'Striped Red Mullet', 'Trout']
By indexing the values in df to the list without using a loop, rather by using the inbuilt pandas function.
I actually want a new column added to the dataframe using the list with indices corresponding to the row.
How about using apply?
classes = ['Black Sea Sprat', 'Gilt-Head Bream', 'Hourse Mackerel', 'Red Mullet',
'Red Sea Bream', 'Sea Bass', 'Shrimp', 'Striped Red Mullet', 'Trout']
Result['class name'] = Result['Predicted Classes'].apply(lambda x: classes[x])

How to remove duplicates based on partial match

I don't even know how to approach it as it feels too complex for my level.
Imagine courier tracking numbers and I am receiving some duplicated updates from upstream system in following format:
see attached image or a small piece of code that creates such table:
import pandas as pd
incoming_df = pd.DataFrame({
'Tracking ID' : ['4845','24345', '8436474', '457453', '24345-S2'],
'Previous' : ['Paris', 'Lille', 'Paris', 'Marseille', 'Dijon'],
'Current' : ['Nantes', 'Dijon', 'Dijon', 'Marseille', 'Lyon'],
'Next' : ['Lyone', 'Lyon', 'Lyon', 'Rennes', 'NICE']
})
incoming_df
Obviously, tracking ID 24345-S2 (green arrow) is a duplication of 24345 (red arrow), however, it is not fully duplicated but a newer, updated location information (with history) for the parcel. How do I delete old line 24345 and keep new line 24345-S2 in the data set?
The length of tracking ID can be from 4 to 20 chars but '-S2' is always helpfully appended.
Thank you!
Edit: New solution:
# extract duplicates
duplicates = df['Tracking ID'].str.extract('(.+)-S2').dropna()
# remove older entry if necessary
df = df[~df['Tracking ID'].isin(duplicates[0].unique())]
If the 1234-S2 entry is always lower in the DataFrame than the 1234 entry , you could do something like:
# remove the suffix from all entries
incoming_df['Tracking ID'] = incoming_df['Tracking ID'].apply(lambda x: x.split('-')[0])
# keep only the last entry of the duplicates
incoming_df = incoming_df.drop_duplicates(subset='Tracking ID', keep='last')

Check whether all unique value of column B are mapped with all unique value of Column A

I need little help, I know it's very easy I tried but didn't reach the goal.
# Import pandas library
import pandas as pd
data1 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600]]
df1 = pd.DataFrame(data1, columns = ['Country', 'Bottle_Weight'])
data2 = [['India', 350], ['India', 600],['India', 200], ['Bangladesh', 350],['Bangladesh', 600]]
df2 = pd.DataFrame(data2, columns = ['Country', 'Bottle_Weight'])
data3 = [['India', 350], ['India', 600], ['Bangladesh', 350],['Bangladesh', 600],['Bangladesh', 200]]
df3 = pd.DataFrame(data3, columns = ['Country', 'Bottle_Weight'])
So basically I want to create a function, which will check the mapping by comparing all other unique countries(Bottle weights) with the first country.
According to the 1st Dataframe, It should return text as - All unique value of 'Bottle Weights' are mapped with all unique countries
According to the 2nd Dataframe, It should return text as - 'Country_name' not mapped 'Column name' 'value'
In this case, 'Bangladesh' not mapped with 'Bottle_Weight' 200
According to the 3rd Dataframe, It should return text as - All unique value of Bottle Weights are mapped with all unique countries (and in a new line) 'Country_name' mapped with new value '200'
It is not a particularly efficient algorithm, but I think this should get you the results you are looking for.
def check_weights(df):
success = True
countries = df['Country'].unique()
first_weights = df.loc[df['Country']==countries[0]]['Bottle_Weight'].unique()
for country in countries[1:]:
weights = df.loc[df['Country']==country]['Bottle_Weight'].unique()
for weight in first_weights:
if not np.any(weights[:] == weight):
success = False
print(f"{country} does not have bottle weight {weight}")
if success:
print("All bottle weights are shared with another country")

Categories

Resources