I am trying to select a range of numbers from one column 'Description' and then move this pattern to a new column called 'Seating' however the new column is not returning any values and is just populated with values equalling to 'none'. I have used a for loop to iterate through the columns to locate any rows with this pattern but as i said this returns values equal to none. Maybe I have defined the pattern incorrectly.
import re
import pandas as pd
# Defined the indexes
data = pd.read_csv('Inspections.csv').set_index('ACTIVITY DATE')
# Created a new column for seating which will be populated with pattern
data['SEATING'] = None
# Defining indexes for desired columns
index_description = data.columns.get_loc('PE DESCRIPTION')
index_seating = data.columns.get_loc('SEATING')
# Creating a pattern to be extracted
seating_pattern = r' \d([1-1] {1} [999-999] {3}\/[61-61] {2} [150-150] {3})'
# For loop to iterate through rows to find and extract pattern to 'Seating' column
for row in range(0, len(data)):
score = re.search(seating_pattern, data.iat[row, index_description])
data.iat[row, index_seating] = score
data
Output of code showing table where the columns are populated:
Following code populates seating column
I have tried .group() and it returns the following error AttributeError: 'NoneType' object has no attribute 'group'
What am I doing wrong in that it shows <re.Match object; span=(11, 17), match='(0-30)'> instead of the result from the pattern.
It's not completely clear to me what you want to extract with your pattern. But here's a suggestion that might help. With this small sample frame
df = pd.DataFrame({'Col1': ['RESTAURANT (0-30) SEATS MODERATE RISK',
'RESTAURANT (31-60) SEATS HIGH RISK']})
Col1
0 RESTAURANT (0-30) SEATS MODERATE RISK
1 RESTAURANT (31-60) SEATS HIGH RISK
this
df['Col2'] = df['Col1'].str.extract(r'\((\d+-\d+)\)')
gives you
Col1 Col2
0 RESTAURANT (0-30) SEATS MODERATE RISK 0-30
1 RESTAURANT (31-60) SEATS HIGH RISK 31-60
Selecting columns in pandas can be much easier than this
first take a copy of the dataframe to apply the changes safely and then select values as the following
data_copied = data.copy()
data_copied['SEATING'] = data_copied[(data_copied['Description'] <= start_range_value) & (data_copied['Description'] >= end_range_value)]
this link is helpful on building column by selecting based on rows of another column without changing values https://www.geeksforgeeks.org/how-to-select-rows-from-a-dataframe-based-on-column-values/
this question to dive into the same topic with more customization , it will make u solve similar more complex issues
pandas create new column based on values from other columns / apply a function of multiple columns, row-wise
Related
I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]
I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!
Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)
I am trying to map the values of one df column with values in another one.
First df contains football match results:
Date|HomeTeam|AwayTeam
2009-08-15|0|2
2009-08-15|18|15
2009-08-15|20|10
Second df contains teams and has only one column:
TeamName
Arsenal
Bournetmouth
Chelsea
The end result is the first df with matches but with team names instead of numbers in "HomeTeam" and "AwayTeam". The numbers in the first df mean indexes of the second one.
I've tried ".replace":
for item in matches.HomeTeam:
matches = matches.replace(to_replace = matches.HomeTeam[item], value=teams.TeamName[item])
It did replace the values for some items (~80% of them), but ignored the other ones. I could not find a way to replace the other values.
Please let me know what I did wrong and how this can be fixed. Thanks!
Maybe try using applymap:
df[['HomeTeam', 'AwayTeam']] = df[['HomeTeam', 'AwayTeam']].applymap(lambda x: teams['TeamName'].tolist()[x])
And now:
print(df)
Output will be as expected.
I assume that teams is also a DataFrame, something like:
teams = pd.DataFrame(data=[['Team_0'], ['Team_1'], ['Team_2'], ['Team_3'],
['Team_4'], ['Team_5'], ['Team_6'], ['Team_7'], ['Team_8'],
['Team_9']], columns=['TeamName'])
but you failed to include the index in the provided sample (actually, in
both samples).
Then my proposition is:
matches.set_index('Date')\
.applymap(lambda id: teams.loc[id, 'TeamName'])\
.reset_index()
I noticed that when using .loc in pandas dataframe, it not only finds the row of data I am looking for but also includes the header column names of the dataframe I am searching within.
So when I try to append the .loc row of data, it includes the data + column headers - I don't want any column headers!
##1st dataframe
df_futures.head(1)
date max min
19990101 2000 1900
##2nd dataframe
df_cash.head(1)
date$ max$ min$
1999101 50 40
##if date is found in dataframe 2, I will collect the row of data
data_to_track = []
for ii in range(len(df_futures['date'])):
##date I will try to find in df2
date_to_find = df_futures['date'][ii]
##append the row of data to my list
data_to_track.append(df_cash.loc[df_cash['Date$'] == date_to_find])
I want the for loop to return just 19990101 50 40
It currently returns 0 19990101 50 40, date$, max$, min$
I agree with other comments regarding the clarity of the question. However, if what you want to get is just a string that contains a particular row's data, then you could use to_string() method of Pandas.
In your case,
Instead of this:
df_cash.loc[df_cash['Date$'] == date_to_find]
You could get a string that includes only the row data:
df_cash[df_cash['Date$'] == date_to_find].to_string(header=None)
Also notice that I dropped the .loc part, which outputs the same result.
If your dataframe has multiple columns and you dont want them to be joined in a string (may bring data type issues and is potentially problematic if you want to separate them later on), you could use list() method such as:
list(df_cash[df_cash['Date$'] == date_to_find].iloc[0])
I have a Pandas df (See below), I want to sum the values based on the index column. My index column contains string values. See the example below, here I am trying to add Moving, Playing and Using Phone together as "Active Time" and sum their corresponding values, while keep the other index values as these are already are. Any suggestions, that how can I work with this type of scenario?
**Activity AverageTime**
Moving 0.000804367
Playing 0.001191772
Stationary 0.320701558
Using Phone 0.594305473
Unknown 0.060697612
Idle 0.022299218
I am sure that there must be a simpler way of doing this, but here is one possible solution.
# Filters for active and inactive rows
active_row_names = ['Moving','Playing','Using Phone']
active_filter = [row in active_row_names for row in df.index]
inactive_filter = [not row for row in active_filter]
active = df.loc[active_filter].sum() # Sum of 'active' rows as a Series
active = pd.DataFrame(active).transpose() # as a dataframe, and fix orientation
active.index=["active"] # Assign new index name
# Keep the inactive rows as they are, and replace the active rows with the
# newly defined row that is the sum of the previous active rows.
df = df.loc[inactive_filter].append(active, ignore_index=False)
OUTPUT
Activity AverageTime
Stationary 0.320702
Unknown 0.060698
Idle 0.022299
active 0.596302
This will work even when only a subset of the active row names are present in the dataframe.
I would add a new boolean column called "active" and then groupby that column:
df['active']=False
df['active'][['Moving','Playing','Using Phone']] = True
df.groupby('active').AverageTime.sum()