how to get one number from pandas sum / is in function [duplicate] - python

This question already has an answer here:
Count occurrences of certain string in entire pandas dataframe
(1 answer)
Closed 2 years ago.
Suppose I want to find the number of occurrences of something in a pandas dataframe as one number.
If I do df.isin(["ABC"]).sum() it gives me a table of all occurrences of "ABC" under each column.
What do I do if I want just one number which is the number of "ABC" entries under column 1?
Moreover, is there code to find entries that have both "ABC" under say column 1 and "DEF" under column 2. even this should just be a single number of entries/rows that have both of these.

You can check with groupby + size
out = df.groupby(['col1', 'col2']).size()
print(out.loc[('ABC','DEF')])

Q1: I'm sure there are more sophisticated ways of doing this, but you can do something like:
num_occurences = data[(data['column_name'] == 'ABC')]
len(num_occurences.index)
Q2: To add in 'DEF' search, you can try
num_occurences = data[(data['column_name'] == 'ABC') & (data['column_2_name'] == 'DEF')]
len(num_occurences.index)
I know this works for quantitative values; you'll need to see with qualitative.

Related

How do you sum a dataframe based off a grouping in Python pandas?

I have a for loop with the intent of checking for values greater than zero.
Problem is, I only want each iteration to check the sum of a group of ID’s.
The grouping would be a match of the first 8 characters of the ID string.
I have that grouping taking place before the loop but the loop still appears to search the entire df instead of each group.
LeftGroup = newDF.groupby(‘ID_Left_8’)
for g in LeftGroup.groups:
if sum(newDF[‘Hours_Calc’] > 0):
print(g)
Is there a way to filter that sum to each grouping of leftmost 8 characters?
I was expecting the .groups function to accomplish this, but it still seems to search every single ID.
Thank you.
def filter_and_sum(group):
return sum(group[group['Hours_Calc'] > 0]['Hours_Calc'])
LeftGroup = newDF.groupby('ID_Left_8')
results = LeftGroup.apply(filter_and_sum)
print(results)
This will compute the sum of the Hours_Calc column for each group, filtered by the condition Hours_Calc > 0. The resulting series will have the leftmost 8 characters as the index, and the sum of the Hours_Calc column as the value.

Setting numbers outside of range as null [duplicate]

This question already has answers here:
Python Pandas replace values if not in value range
(4 answers)
Closed last year.
I am working with pandas df and I am trying to make all the numbers that are outside of range set as null, but having trouble
df['Numbers'] = df['Numbers'].mask((df['Numbers']< -10) & (df['Numbers']> 10), inplace=True)
So I want to keep the numbers between -10 and 10, if the numbers are outside of those two numbers, it should be set as null.
What am I doing wrong here?
One thing that immediately strikes out at me is that you're using & with your two conditions, so you're basically trying to select all numbers that are both less than -10 and greater than 10...which isn't gonna work ;)
I'd rewrite your code like this:
df.loc[df['Numbers'].lt(-10) | df['Numbers'].gt(10), 'Numbers'] = np.nan
I would do it like this:
df['Numbers'] = df['Numbers'].where((df['Numbers']>-10) & (df['Numbers']<10))

create a new column which is a value_counts of another column in python [duplicate]

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
I have a pandas datafram df that contains a column say x, and I would like to create another column out of x which is a value_count of each item in x.
Here is my approach
x_counts= []
for item in df['x']:
item_count = len(df[df['x']==item])
x_counts.append(item_count)
df['x_count'] = x_counts
This works but this is far inefficient. I am looking for a more efficient way to handle this. Your approach and recommendations are highly appreciated
It sounds like you are looking for groupby function that you are trying to get the count of items in x
There are many other function driven methods but they may differ in different versions.
I suppose that you are looking to join the same elements and find their sum
df.loc[:,'x_count']=1 # This will make a new column of x_count to each row with value 1 in it
aggregate_functions={"x_count":"sum"}
df=df.groupby(["x"],as_index=False,sort=False).aggregate(aggregate_functions) # as_index and sort functions will allow you to choose x separately otherwise it would conside the x column as index column
Hope it heps.

if-else for multiple conditions dataframe [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I don't know how to right properly the following idea:
I have a dataframe that has two columns, and many many rows.
I want to create a new column based on the data in these two columns, such that if there's 1 in one of them the value will be 1, otherwise 0.
Something like that:
if (df['col1']==1 | df['col2']==1):
df['newCol']=1
else:
df['newCol']=0
I tried to use .loc function in different ways but i get different errors, so either I'm not using it correctly, or this is not the right solution...
Would appreciate your help. Thanks!
Simply use np.where or np.select
df['newCol'] = np.where((df['col1']==1 | df['col2']==1), 1, 0)
OR
df['newCol'] = np.select([cond1, cond2, cond3], [choice1, choice2, choice3], default=def_value)
When a particular condition is true replace with the corresponding choice(np.select).
one way to solve this using .loc,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = 1
df['newCol'].fillna(0,inplace=True)
incase if you want newcol as string use,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = '1'
df['newCol'].fillna('0',inplace=True)
or
df['newCol']=df['newCol'].astype(str)

Performance of Pandas string contains for column [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I have a DataFrame of 83k rows and a column "Text" of text that i have to search for ~200 masks. Is there a way to pass a column to .str.contains()?
I'm able to do it like this:
start = time.time()
[a["Text"].str.contains(m).sum() for m in \
b["mask"].values]
print time.time() - start
But it's taking 34.013s. Is there any faster way?
Edit:
b["mask"] looks like:
'PR347856|P5478'
'BS7623|B5763'
and i want the count of occurances for each mask, so i can't join them.
Edit:
a["text"] contains strings of the size of ~ 3 sentences
Maybe you can vectorize the containment operation.
text_contains = a['Text'].str.contains
b['mask'].map(lambda m: text_contains(m).sum())

Categories

Resources