Count values in a column conditional on another column - python

I have a dataset from which I am trying to count the number of 1's in a column and group them depending on another column and return this as a value (to use within a Class).
Example data
import pandas as pd
Current = {'Item': ['Chocolate', 'Chocolate', 'Sweets', 'Chocolate', 'Sweets', 'Pop'],
'Order': [0, 1, 1, 1, 1, 0],
}
Current = pd.DataFrame (Current, columns = ['Item','Order'])
I want to then count the number of 1s by each item (the real table has 25 columns) and return this value.
I have managed to do that when there are values using this code:
choc = Current[Current["Item"] == "Chocolate"]
print(choc["Order"].value_counts()[1])
returns: 2
(in reality I would use the bit inside the print to return it in my Class, not just print it)
This works if there is a count, such as for chocolate, but if there is no count, it returns an error.
pop = Current[Current["Item"] == "Pop"]
print(pop["Order"].value_counts()[1])
Returns: KeyError: 1.0
My questions are:
Is there a better way to do this?
If not, how do I get the value to return 0 if there isn't a count, e.g. in the case of pop?

If you want to check the items individually, you can do something like this:
Current[Current.Item=='Pop'].Order.sum()
This will return 0 for no count items.
If you expect summary as your end result, you can do:
Current.groupby('Item').agg({'Order':sum}).reset_index()
It will return a dataframe with count values of each item

Related

I want to save the mean (by row) of different set of dataframe columns and store them in a new dataframe

For doing so, I have a list of lists (which are my clusters), for example:
asset_clusts=[[0,1],[3,5],[2,4, 12],...]
and original dataframe(in my code I call it 'x') is as :
return time series of s&p 500 companies
I want to choose column [0,1] of the original dataframe and compute the mean (by row) of them and store it in a new dataframe, then compute the mean of columns [3, 5], and add it to the new dataframe, and so on ...
mu=pd.DataFrame()
for j in range(get_number_of_elements(asset_clusts)):
mu=x.iloc[:,asset_clusts[j]].mean(axis=1)
but, it gives to me only a column and i checked, this one column is the mean of last cluster columns
in case of ambiguity, function of get_number_of_elements is:
def get_number_of_elements(clist):
count = 0
for element in clist:
count += 1
return count
def get_number_of_elements(clust_list):
count = 0
for element in clust_list:
count += 1
return count
I solved it and in case if it would be helpful for others, here is the final function:
def clustered_series(x, org_asset_clust):
"""
x:return data
org_asset_clust: list of clusters
----> mean of each cluster returns by row
"""
def get_number_of_elements(org_asset_clust):
count = 0
for element in org_asset_clust:
count += 1
return count
mu=[]
for j in range(get_number_of_elements(org_asset_clust)):
mu.append(x.iloc[:,org_asset_clust[j]].mean(axis=1))
cluster_mean=pd.concat(mu, axis=1)
return cluster_mean

find number of values in pandas dataframe

I have a excel file, has file name and a value, for example:
file. count
001.txt 1
002.txt 2
003.txt 2
004.txt 3
005.txt 1
006.txt 2
I'm using the following code to find how many 2s are in the value column, but somehow the result is 0
df = pd.read_excel('report.xlsx')
df.columns = ['file', 'count']
count = df['count'].tolist().count('2')
print(count)
>>0
Did I do something wrong in the code?
Firstly, check the column 'count' if it's 'number'(such as 'int64','int32','float64',etc) or 'object'.
df['count'] # check the result
If the data type is 'number', then you can use the code you worte, but just correct like this:
df['count'].to_list().count(2)
I guess when you .to_list, the elements are all numbers. And count('2') means to count the string element — '2' how many times it appears, while there was no elements like '2'. That's why you got zero result.
Here is a simple example.
lis = [0, 1, 2, 2, 3]
lis.count('2') # it returns 0
lis.count(2) # it returns 2

How to efficiently count the number of smaller elements for every element in another column?

I have the following df
name created_utc
0 t1_cqug90j 1430438400
1 t1_cqug90k 1430438400
2 t1_cqug90z 1430438400
3 t1_cqug91c 1430438401
4 t1_cqug91e 1430438401
... ... ...
in which column name contains only unique values. I would like to create a dictionary whose keys are the same elements as in column name. The value for each such a key is the number of elements in column created_utc strictly smaller than that of the key. My expected result is something like
{'t1_cqug90j': 6, 't1_cqug90k': 0, 't1_cqug90z': 3, ...}
In this case, there are 6 elements in column created_utc strictly smaller than 1430438400, which is the corresponding value of t1_cqug90j. I can do the loop to generate such dictionary. However, the loop is not efficient in my case with more than 3 millions rows.
Could you please elaborate on a more efficient way?
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df1.csv', header = 0)[['name', 'created_utc']]
df
Update: I posted the question How to efficiently count the number of larger elements for every elements in another column? and received a great answer there. However, I'm not able to modify the code into this case. It would be great if there is an efficient code that can be adapted for both cases, i.e. "strictly larger" and "strictly smaller".
I think you need sort_index for descending sorting for your previous answer:
count_utc = df.groupby('created_utc').size().sort_index(ascending=False)
print (count_utc)
created_utc
1430438401 2
1430438400 3
dtype: int64
cumulative_counts = count_utc.shift(fill_value=0).cumsum()
output = dict(zip(df['name'], df['created_utc'].map(cumulative_counts)) )
print (output)
{'t1_cqug90j': 2, 't1_cqug90k': 2, 't1_cqug90z': 2, 't1_cqug91c': 0, 't1_cqug91e': 0}

How to find a row that in a given table return maximize return and minimalize variance

Suppose I have a DataFrame that consists of three columns (index, return, volatility) and five rows.
I would like to receive the index number that maximizes value of return AND minimizes value of volatility however the second condition is less important than first.
How can I receive that? I tried to apply method .rank() for both columns and then sort by them but effect was poor.
You could first select a subset of your dataframe containing the rows for which the Return is maximum and then use idxmin() to return the index of the first occurrence of the minimum Volatility.
Here is a dataframe example:
import pandas as pd
mydict = {'Return': [99999,1,1,99999, 809],
'Volatility': [1, 2, 3, 4, 7]
}
df = pd.DataFrame(mydict, columns = ['Return', 'Volatility'])
For this example, df[df.Return==df.Return.max()] yields:
Return Volatility
0 99999 1
3 99999 4
And df[df.Return==df.Return.max()].Volatility.idxmin() returns:
0
Which is the index associated with the Volatility of 1.

Replacing column values in a pandas DataFrame

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64

Categories

Resources