So I have a dataframe (df9), that has several columns, one of which is "ASSET_CLASS", and I also have a variable called "terms". ASSET_CLASS is made up of different names, whereas "terms" is numbers. I want to be able to create a new row in the dataframe that outputs different numbers per row based on the corresponding asset class and # of terms. For example, if Asset_Class is 'A' in a row, and #terms is between 30 and 60 for that row, I want my new column to output the number 5 for that row. Or if Asset_Class is 'A' and #terms is between 0 and 30, the new column shoots out 3 for that row. Or if Asset Class is 'B', and terms is between 30 and 60, then the output in the new column for that row is 8. Anyone have a good idea of how to do this? I was thinking maybe if, else statements, but I'm not sure.
Use numpy.select. To match your first two examples, the following code should add a new column called 'newcol' to your data frame, matching your first two cases, and putting a value of -1 everywhere not covered by explicitly defined cases.
ac = df9.ASSET_CLASS
t = df9.terms
condlist = [(ac=='A') & (t>=30) & (t<60), (ac=='A') & (t>=0) & (t<30)]
choicelist = [5, 3]
df9['newcol'] = np.select(condlist, choicelist, default=-1)
Related
I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]
I am trying to get the first non-null value of the list inside each row of Emails column to write to the Email_final1 then get the next value of the list inside each row of Emails, if there is one, to Emails_final2 otherwise to write Emails2 value to Emails2_final if not blank and doesn't equal 'Emails' otherwise leave Emails_final2 blank. Lastly if a value from Emails 2 was written to Emails_final1 then make Emails_final2 None I have tried many different ways to achieve this to no avail here is what I have so far including pseudo-code:
My Current Code:
df = pd.DataFrame({'Emails': [['jjj#gmail.com', 'jp#gmail.com', 'jc#gmail.com'],[None, 'www#gmail.com'],[None,None,None]],
'Emails 2': ['sss#gmail.com', 'zzz#gmail.com','ccc#gmail.com'],
'num_specimen_seen': [10, 2,3]},
index=['falcon', 'dog','cat'])
df['Emails_final1'] = df['Emails'].explode().groupby(level=0).first()
#pseudo code
df['Emails_final2'] = df['Emails'].explode().groupby(level=0).next() #I know next doesn't exist but I want it to try to get the next value of 'Emails' before trying to get 'Emails 2 values.
Desired Output:
Emails_final1 Emails_final2
falcon jjj#gmail.com jp#gmail.com
falcon www#gmail.com zzz#gmail.com
falcon ccc#gmail.com None
Any direction of how to approach a problem like this would be appreciated.
It looks a bit messy but it works. Basically, we keep a boolean mask from the first step in filling "Emails_final1" and use it in the second step to fill "Emails_final1".
To fill the second column, the idea is to use groupby + nth to get the second elements and if they don't match the previously selected emails; keep it (for example for the first row) but if it doesn't select from "Emails 2" column, unless it was already selected before (for example in the 3rd row):
exp_g = df['Emails'].explode().groupby(level=0)
df['Emails_final1'] = exp_g.first()
msk = df['Emails_final1'].notna()
df['Emails_final1'] = df['Emails_final1'].fillna(df['Emails 2'])
df['Emails_final2'] = exp_g.nth(1)
df['Emails_final2'] = df['Emails_final2'].mask(lambda x: ((x == df['Emails_final1']) | x.isna()) & msk, df['Emails 2'])
The relevant columns are:
Emails_final1 Emails_final2
falcon jjj#gmail.com jp#gmail.com
dog www#gmail.com zzz#gmail.com
cat ccc#gmail.com None
I am trying to select a range of numbers from one column 'Description' and then move this pattern to a new column called 'Seating' however the new column is not returning any values and is just populated with values equalling to 'none'. I have used a for loop to iterate through the columns to locate any rows with this pattern but as i said this returns values equal to none. Maybe I have defined the pattern incorrectly.
import re
import pandas as pd
# Defined the indexes
data = pd.read_csv('Inspections.csv').set_index('ACTIVITY DATE')
# Created a new column for seating which will be populated with pattern
data['SEATING'] = None
# Defining indexes for desired columns
index_description = data.columns.get_loc('PE DESCRIPTION')
index_seating = data.columns.get_loc('SEATING')
# Creating a pattern to be extracted
seating_pattern = r' \d([1-1] {1} [999-999] {3}\/[61-61] {2} [150-150] {3})'
# For loop to iterate through rows to find and extract pattern to 'Seating' column
for row in range(0, len(data)):
score = re.search(seating_pattern, data.iat[row, index_description])
data.iat[row, index_seating] = score
data
Output of code showing table where the columns are populated:
Following code populates seating column
I have tried .group() and it returns the following error AttributeError: 'NoneType' object has no attribute 'group'
What am I doing wrong in that it shows <re.Match object; span=(11, 17), match='(0-30)'> instead of the result from the pattern.
It's not completely clear to me what you want to extract with your pattern. But here's a suggestion that might help. With this small sample frame
df = pd.DataFrame({'Col1': ['RESTAURANT (0-30) SEATS MODERATE RISK',
'RESTAURANT (31-60) SEATS HIGH RISK']})
Col1
0 RESTAURANT (0-30) SEATS MODERATE RISK
1 RESTAURANT (31-60) SEATS HIGH RISK
this
df['Col2'] = df['Col1'].str.extract(r'\((\d+-\d+)\)')
gives you
Col1 Col2
0 RESTAURANT (0-30) SEATS MODERATE RISK 0-30
1 RESTAURANT (31-60) SEATS HIGH RISK 31-60
Selecting columns in pandas can be much easier than this
first take a copy of the dataframe to apply the changes safely and then select values as the following
data_copied = data.copy()
data_copied['SEATING'] = data_copied[(data_copied['Description'] <= start_range_value) & (data_copied['Description'] >= end_range_value)]
this link is helpful on building column by selecting based on rows of another column without changing values https://www.geeksforgeeks.org/how-to-select-rows-from-a-dataframe-based-on-column-values/
this question to dive into the same topic with more customization , it will make u solve similar more complex issues
pandas create new column based on values from other columns / apply a function of multiple columns, row-wise
I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!
Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)
"Python for data analysis" (ch5) uses a double selection:
data.iloc[:,:3][data.three>5]
There is no explanation of the logic behind this statement. How should it be understood?
Is it a selection over a previous selection, i.e. data.iloc[:,:3] first selects all lines and first three columns, then [data.three>5] reduces this selection to all lines for which the values in column 'three' is greater than 5 ?
I saw also the following expression:
df[['CoCode','Doc_Type','Doc_Nr','Amount_LC']][df['Amount_LC']>1000000000]
I am a bit lost. It looks like loc and iloc can be used with double selection, i.e df.loc[][] what is the logic of the second []? What goes in the first one, and in the second ?
Two separate selections are being applied here to dataframe data:
1) data.iloc[:,:3] is selecting all rows, and all columns up to (but not including) column index 3, thus column indices 0, 1 and 2
2) The dataframe data is being limited to all rows where column three contains values greater than 5
The output of these two selections is independent of ordering, therefore:
data.iloc[:,:3][data.three>5] == data[data.three>5].iloc[:,:3] will return a dataframe populated with True
Note that you are not using double selection here (as you call it), but rather you are querying specific rows and columns in your first selection, while your second selection is merely a filter applied to the dataframe returned by your first selection.
Effectively, you are using .iloc() to select specific index locations (or slices) from the dataframe, while .loc() allows to select specific locations based on column and row labels.
Finally, when filtering your dataframe with something like data[data.three>5], you can read this as "Return rows in dataframe data where the column three of that row has a value greater than 5".
iloc and loc take 2 parameters , columns and rows.
data.iloc[<row selection> , <column selection>]
Hope this helped.
Is it a selection over a previous selection, i.e. data.iloc[:,:3] first selects all lines and first three columns, then [data.three>5] reduces this selection to all lines for which the values in column 'three' is greater than 5 ?
Yes, #rahlf23 has great explanation.
It looks like loc and iloc can be used with double selection, i.e df.loc[][] what is the logic of the second []? What goes in the first one, and in the second ?
Even you can make triple or more selection of rows.
Example:
df = pd.DataFrame({'a':[1,2,3,4,5], 'b':[6,7,8,9,10], 'c': [11,12,13,14,15]})
# It will give you first 3 rows of column a and b
df.loc[:,:2][:4][:3]
# It will give you {'a':[2,3], 'b':[7,8]}
df.iloc[:,:2][df.a*7 > df.c][:2]
# It will give you error, you can't slice more on columns
df.iloc[:,:2][:3,:1]