Performance of Pandas string contains for column [duplicate] - python

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I have a DataFrame of 83k rows and a column "Text" of text that i have to search for ~200 masks. Is there a way to pass a column to .str.contains()?
I'm able to do it like this:
start = time.time()
[a["Text"].str.contains(m).sum() for m in \
b["mask"].values]
print time.time() - start
But it's taking 34.013s. Is there any faster way?
Edit:
b["mask"] looks like:
'PR347856|P5478'
'BS7623|B5763'
and i want the count of occurances for each mask, so i can't join them.
Edit:
a["text"] contains strings of the size of ~ 3 sentences

Maybe you can vectorize the containment operation.
text_contains = a['Text'].str.contains
b['mask'].map(lambda m: text_contains(m).sum())

Related

create a new column which is a value_counts of another column in python [duplicate]

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
I have a pandas datafram df that contains a column say x, and I would like to create another column out of x which is a value_count of each item in x.
Here is my approach
x_counts= []
for item in df['x']:
item_count = len(df[df['x']==item])
x_counts.append(item_count)
df['x_count'] = x_counts
This works but this is far inefficient. I am looking for a more efficient way to handle this. Your approach and recommendations are highly appreciated
It sounds like you are looking for groupby function that you are trying to get the count of items in x
There are many other function driven methods but they may differ in different versions.
I suppose that you are looking to join the same elements and find their sum
df.loc[:,'x_count']=1 # This will make a new column of x_count to each row with value 1 in it
aggregate_functions={"x_count":"sum"}
df=df.groupby(["x"],as_index=False,sort=False).aggregate(aggregate_functions) # as_index and sort functions will allow you to choose x separately otherwise it would conside the x column as index column
Hope it heps.

how to get one number from pandas sum / is in function [duplicate]

This question already has an answer here:
Count occurrences of certain string in entire pandas dataframe
(1 answer)
Closed 2 years ago.
Suppose I want to find the number of occurrences of something in a pandas dataframe as one number.
If I do df.isin(["ABC"]).sum() it gives me a table of all occurrences of "ABC" under each column.
What do I do if I want just one number which is the number of "ABC" entries under column 1?
Moreover, is there code to find entries that have both "ABC" under say column 1 and "DEF" under column 2. even this should just be a single number of entries/rows that have both of these.
You can check with groupby + size
out = df.groupby(['col1', 'col2']).size()
print(out.loc[('ABC','DEF')])
Q1: I'm sure there are more sophisticated ways of doing this, but you can do something like:
num_occurences = data[(data['column_name'] == 'ABC')]
len(num_occurences.index)
Q2: To add in 'DEF' search, you can try
num_occurences = data[(data['column_name'] == 'ABC') & (data['column_2_name'] == 'DEF')]
len(num_occurences.index)
I know this works for quantitative values; you'll need to see with qualitative.

Pandas - How to get sum of column with positive and negative values? [duplicate]

This question already has answers here:
converting currency with $ to numbers in Python pandas
(5 answers)
Closed 3 years ago.
I am summing a column of data using pandas that includes positive and negative values.
I first clean the data by removing the $ sign and parenthesis. Then format as a float.
How can I sum the whole column and subtract by the negative numbers?
Example:
$1000
($200)
$300
$1250
($100)
I want the answer to be 2250 not 2550.
Thanks in advance!
You want to identify the values and the signs:
# positive and negative
signs = np.where(s.str.startswith('('), -1, 1)
# extract the values
vals = s.str.extract('\$([\d\.]*)')[0].astype(int)
# calculate the sum
vals.mul(signs).sum()
# 2250
A Pandas DataFrame object has the .sum method that takes axis as a parameter
my_dataframe['name_of_column_you_want'].sum(axis = 0) # axis=0 means down (the rows)
I don't understand your example.
import re
def clean(column_name) :
if column_name.find('(') > 0 :
return float(re.match(r'(\d+)').group(0))
else :
return -float(re.match(r'(\d+)').group(0))
my_dataframe['column_name'].apply(clean).sum()

How to get the second largest value in Pandas Python [duplicate]

This question already has answers here:
Get first and second highest values in pandas columns
(7 answers)
Closed 4 years ago.
This is my code:
maxData = all_data.groupby(['Id'])[features].agg('max')
all_data = pd.merge(all_data, maxData.reset_index(), suffixes=["", "_max"], how='left', on=['Id'])
Now Instead of getting the max value, How can I fetch the second max value in the above code (groupBy Id)
Try using nlargest
maxData = all_data.groupby(['Id'])[features].apply(lambda x:x.nlargest(2)[1]).reset_index(drop=True)
You can use the nth method just after sorting the values;
maxData = all_data.sort_values("features", ascending=False).groupby(['Id']).nth(1)
Please ignore apply method as it decreases performance of code.

Python: len() of unknown numpy array column length [duplicate]

This question already has answers here:
Counting the number of non-NaN elements in a numpy ndarray in Python
(5 answers)
Closed 4 years ago.
I'm currently trying to learn Python and Numpy. The task is to determine the length of individual columns of an imported CSV file.
So far I have:
import numpy as np
data = np.loadtxt("assignment5_data.csv", delimiter = ',')
print (data.shape[:])
Which returns:
(62, 2)
Is there a way to iterate through each column to count [not is.nan]?
If I understand correctly, and you are trying to get the length of non-nan values in each column, use:
np.sum(~np.isnan(data),axis=0)

Categories

Resources