How to select multiple columns and rows from dataframe under condition? [duplicate] - python

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I would like to choose the rows and columns under condition, for example:
0 is camera, 1 is video.
when the column == 1, return the data of video.
else return the data of photo
The purpose is to get separate data based on video and photo.
The code shows below. I guess the problem is from loc.[i, :] because when I changed i to 0, it grab the first row successfully. But don't know why i doesn't work.
for i in range(len(dataset)):
if dataset['status_type_num'][i] == 1:
video_data = dataset[['num_reactions', 'num_comments', 'num_shares', 'num_likes', 'num_loves']].loc[i, :]
print(video_data)
I expect output would be the data from 5 columns('num_reactions', 'num_comments', 'num_shares', 'num_likes', 'num_loves') of video.
Thank you.

Subset the dataset.
Example:
Df_Camera = Dataset[(Dataset['status_type_num'] == 0)]
Df_Video = Dataset[(Dataset['status_type_num'] == 1)]

Related

How to create a new column that is a calculation of other columns [duplicate]

This question already has answers here:
Adding a column in pandas df using a function
(2 answers)
Closed 1 year ago.
I would like to create a column that is the sum of columns A + B / C * 100, in order to get a column that is a percentage, yet when I run the code:
# Create new column that displays the % of the population that has a long-term health issue.
for i, row in health_issues.iterrows():
health_issues.loc[i, 'PC_LTHP'] = (row['LTHP_littl'] + row['LTHP_lot']) / row['residents'] * 100
print(health_issues.columns.values)
No new column is created - I am not sure what the issue is, is the code not working as a new column needs to actually exist prior to this?
Would appreciate any help!
You don't have to iterate through rows to do this:
health_issues["PC_LTHP"] = (health_issues["LTHP_littl"] + row["LTHP_lot"]) / row["residents"] * 100

Equivalent R and Python with a DataFrame [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I'm stuck with an equivalence of code between R and Python.
Code in R
library(datasets)
data <- airquality
data2 <- data[data$Ozone < 63,]
I download the file of airquality and use pd.read_csv() function for obtain the .csv file into Python. But I don't know how obtain this equivalent line data[data$Ozone < 63,].
data2 = data.loc[data["Ozone"] < 63,:]
This should do the trick.
data["Ozone"] < 63 returns an index where the condition is verified
data.loc[index, :] returns a copy of the dataframe data, for all columns : on the given index

how to get one number from pandas sum / is in function [duplicate]

This question already has an answer here:
Count occurrences of certain string in entire pandas dataframe
(1 answer)
Closed 2 years ago.
Suppose I want to find the number of occurrences of something in a pandas dataframe as one number.
If I do df.isin(["ABC"]).sum() it gives me a table of all occurrences of "ABC" under each column.
What do I do if I want just one number which is the number of "ABC" entries under column 1?
Moreover, is there code to find entries that have both "ABC" under say column 1 and "DEF" under column 2. even this should just be a single number of entries/rows that have both of these.
You can check with groupby + size
out = df.groupby(['col1', 'col2']).size()
print(out.loc[('ABC','DEF')])
Q1: I'm sure there are more sophisticated ways of doing this, but you can do something like:
num_occurences = data[(data['column_name'] == 'ABC')]
len(num_occurences.index)
Q2: To add in 'DEF' search, you can try
num_occurences = data[(data['column_name'] == 'ABC') & (data['column_2_name'] == 'DEF')]
len(num_occurences.index)
I know this works for quantitative values; you'll need to see with qualitative.

How to store results of a for loop to a dataframe given a specific column/row?

I have a dataframe with different signals and returns. I want to do the following:
Subset a specific signal
Calculate the annualized return
Store result to a dataframe
My dataframe looks like this:
enter image description here
My code looks like this:
years = range(1990,2019,1)
returns = pd.DataFrame(columns=signals)
for i in signals:
signal_i = portbase[portbase['signalname'] == i] #Select single signal from dataframe
for j in years:
signal_i_j = signal_i[signal_i['year'] == j] #Subset single year from signal
return_j = (((signal_i_j['return']/100)+1).prod() -1) * 100 #Calculate annualized return for signal i in year j
returns.loc[j,i] #Add result to dataframe in column i and year j
Everything works except for the last part, where i want so save my results.
I want my dataframe to look like this:
enter image description here
Signals as columns and Years as rows
Edit:
Using the following code works:
df = portbase.groupby(['signalname','year'])['return'].apply(lambda x: (np.prod(1+x/100)-1) * 100).reset_index().T
But my output is still not correct:
enter image description here
I tried to convert my output to a dataframe, reset the index and now somehow transpose my signal column as row/header.
Try this code:
df = portbase.groupby(['signalname','year'])['return'].apply(lambda x: (np.prod(1+x/100)-1) * 100).unstack().T
its possible using pivot_table for this.
signal_cols = ['signalname1', 'signalname2']##..
agg_func = lambda x: np.prod(1+x/100)-1)
result = my_df.pivot_table(index='year', columns=signal_cols, values='return', aggfunc=agg_func)
first thing is that it seems you are not using your calculations to save them. j and i are the signals and years.
From the top of my head the .loc() function is for accessing/reading rows and columns by their name.
So you're essentially trying to access the data of years and signals of returns.
You might have to put your results in lists and then make a data frame out of them.
I hope my answer has helped somewhat.

if-else for multiple conditions dataframe [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I don't know how to right properly the following idea:
I have a dataframe that has two columns, and many many rows.
I want to create a new column based on the data in these two columns, such that if there's 1 in one of them the value will be 1, otherwise 0.
Something like that:
if (df['col1']==1 | df['col2']==1):
df['newCol']=1
else:
df['newCol']=0
I tried to use .loc function in different ways but i get different errors, so either I'm not using it correctly, or this is not the right solution...
Would appreciate your help. Thanks!
Simply use np.where or np.select
df['newCol'] = np.where((df['col1']==1 | df['col2']==1), 1, 0)
OR
df['newCol'] = np.select([cond1, cond2, cond3], [choice1, choice2, choice3], default=def_value)
When a particular condition is true replace with the corresponding choice(np.select).
one way to solve this using .loc,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = 1
df['newCol'].fillna(0,inplace=True)
incase if you want newcol as string use,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = '1'
df['newCol'].fillna('0',inplace=True)
or
df['newCol']=df['newCol'].astype(str)

Categories

Resources