Replace all column values based on another dataframe's index - python

I am trying to map the values of one df column with values in another one.
First df contains football match results:
Date|HomeTeam|AwayTeam
2009-08-15|0|2
2009-08-15|18|15
2009-08-15|20|10
Second df contains teams and has only one column:
TeamName
Arsenal
Bournetmouth
Chelsea
The end result is the first df with matches but with team names instead of numbers in "HomeTeam" and "AwayTeam". The numbers in the first df mean indexes of the second one.
I've tried ".replace":
for item in matches.HomeTeam:
matches = matches.replace(to_replace = matches.HomeTeam[item], value=teams.TeamName[item])
It did replace the values for some items (~80% of them), but ignored the other ones. I could not find a way to replace the other values.
Please let me know what I did wrong and how this can be fixed. Thanks!

Maybe try using applymap:
df[['HomeTeam', 'AwayTeam']] = df[['HomeTeam', 'AwayTeam']].applymap(lambda x: teams['TeamName'].tolist()[x])
And now:
print(df)
Output will be as expected.

I assume that teams is also a DataFrame, something like:
teams = pd.DataFrame(data=[['Team_0'], ['Team_1'], ['Team_2'], ['Team_3'],
['Team_4'], ['Team_5'], ['Team_6'], ['Team_7'], ['Team_8'],
['Team_9']], columns=['TeamName'])
but you failed to include the index in the provided sample (actually, in
both samples).
Then my proposition is:
matches.set_index('Date')\
.applymap(lambda id: teams.loc[id, 'TeamName'])\
.reset_index()

Related

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Python: Matching a pattern and relocating values to another column

I am trying to select a range of numbers from one column 'Description' and then move this pattern to a new column called 'Seating' however the new column is not returning any values and is just populated with values equalling to 'none'. I have used a for loop to iterate through the columns to locate any rows with this pattern but as i said this returns values equal to none. Maybe I have defined the pattern incorrectly.
import re
import pandas as pd
# Defined the indexes
data = pd.read_csv('Inspections.csv').set_index('ACTIVITY DATE')
# Created a new column for seating which will be populated with pattern
data['SEATING'] = None
# Defining indexes for desired columns
index_description = data.columns.get_loc('PE DESCRIPTION')
index_seating = data.columns.get_loc('SEATING')
# Creating a pattern to be extracted
seating_pattern = r' \d([1-1] {1} [999-999] {3}\/[61-61] {2} [150-150] {3})'
# For loop to iterate through rows to find and extract pattern to 'Seating' column
for row in range(0, len(data)):
score = re.search(seating_pattern, data.iat[row, index_description])
data.iat[row, index_seating] = score
data
Output of code showing table where the columns are populated:
Following code populates seating column
I have tried .group() and it returns the following error AttributeError: 'NoneType' object has no attribute 'group'
What am I doing wrong in that it shows <re.Match object; span=(11, 17), match='(0-30)'> instead of the result from the pattern.
It's not completely clear to me what you want to extract with your pattern. But here's a suggestion that might help. With this small sample frame
df = pd.DataFrame({'Col1': ['RESTAURANT (0-30) SEATS MODERATE RISK',
'RESTAURANT (31-60) SEATS HIGH RISK']})
Col1
0 RESTAURANT (0-30) SEATS MODERATE RISK
1 RESTAURANT (31-60) SEATS HIGH RISK
this
df['Col2'] = df['Col1'].str.extract(r'\((\d+-\d+)\)')
gives you
Col1 Col2
0 RESTAURANT (0-30) SEATS MODERATE RISK 0-30
1 RESTAURANT (31-60) SEATS HIGH RISK 31-60
Selecting columns in pandas can be much easier than this
first take a copy of the dataframe to apply the changes safely and then select values as the following
data_copied = data.copy()
data_copied['SEATING'] = data_copied[(data_copied['Description'] <= start_range_value) & (data_copied['Description'] >= end_range_value)]
this link is helpful on building column by selecting based on rows of another column without changing values https://www.geeksforgeeks.org/how-to-select-rows-from-a-dataframe-based-on-column-values/
this question to dive into the same topic with more customization , it will make u solve similar more complex issues
pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

Find which columns contain a certain value for each row in a dataframe

I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!
Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)

Panda groupby shifting and count at same time

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Get nth row after applying lambda on groupby in python

So I need to group a dataframe by its SessionId, and then I need to sort each group with the created time, afterwards i need to retrieve the nth row only of each group.
but i found that after applying lambda it becomes a dataframe instead of a group by object, hence i cannot use the .nth property
grouped = df.groupby(['SessionId'])
sorted = grouped.apply(lambda x: x.sort_values(["Created"], ascending = True))
sorted.nth ---> error
Changing the order in which you are approaching the problem in this case will help. If you first sort and then use groupby, you will get the desired output and you can use the groupby.nth function.
Here is a code snippet to demonstrate the idea:
df = pd.DataFrame({'id':['a','a','a','b','b','b'],
'var1':[3,2,1,8,7,6],
'var2':['g','h','i','j','k','l']})
n = 2 # replace with required row from each group
df.sort_values(['id','var1']).groupby('id').nth(n).reset_index()
Assuming id is your sessionid and var1 is the timestamp, this sorts your dataframe by id and then var1. Then picks up the nth row from each of these sorted groups. The reset_index() is there just to avoid the resulting multi-index.
If you want to get the last n rows of each group, you can use .tail(n) instead of .nth(n).
I have created a small dataset -
n = 2
grouped = df.groupby('SessionId')
pd.concat([grouped.get_group(x).sort_values(by='SortVar').reset_index().loc[[n]] for x in grouped.groups]\
,axis=0)
This will return -
Please note that in python index start from zero, so for n=2, it will give you 3rd row in sorted data

Categories

Resources