I have a excel file, has file name and a value, for example:
file. count
001.txt 1
002.txt 2
003.txt 2
004.txt 3
005.txt 1
006.txt 2
I'm using the following code to find how many 2s are in the value column, but somehow the result is 0
df = pd.read_excel('report.xlsx')
df.columns = ['file', 'count']
count = df['count'].tolist().count('2')
print(count)
>>0
Did I do something wrong in the code?
Firstly, check the column 'count' if it's 'number'(such as 'int64','int32','float64',etc) or 'object'.
df['count'] # check the result
If the data type is 'number', then you can use the code you worte, but just correct like this:
df['count'].to_list().count(2)
I guess when you .to_list, the elements are all numbers. And count('2') means to count the string element — '2' how many times it appears, while there was no elements like '2'. That's why you got zero result.
Here is a simple example.
lis = [0, 1, 2, 2, 3]
lis.count('2') # it returns 0
lis.count(2) # it returns 2
Related
I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array
I have the following df
name created_utc
0 t1_cqug90j 1430438400
1 t1_cqug90k 1430438400
2 t1_cqug90z 1430438400
3 t1_cqug91c 1430438401
4 t1_cqug91e 1430438401
... ... ...
in which column name contains only unique values. I would like to create a dictionary whose keys are the same elements as in column name. The value for each such a key is the number of elements in column created_utc strictly smaller than that of the key. My expected result is something like
{'t1_cqug90j': 6, 't1_cqug90k': 0, 't1_cqug90z': 3, ...}
In this case, there are 6 elements in column created_utc strictly smaller than 1430438400, which is the corresponding value of t1_cqug90j. I can do the loop to generate such dictionary. However, the loop is not efficient in my case with more than 3 millions rows.
Could you please elaborate on a more efficient way?
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df1.csv', header = 0)[['name', 'created_utc']]
df
Update: I posted the question How to efficiently count the number of larger elements for every elements in another column? and received a great answer there. However, I'm not able to modify the code into this case. It would be great if there is an efficient code that can be adapted for both cases, i.e. "strictly larger" and "strictly smaller".
I think you need sort_index for descending sorting for your previous answer:
count_utc = df.groupby('created_utc').size().sort_index(ascending=False)
print (count_utc)
created_utc
1430438401 2
1430438400 3
dtype: int64
cumulative_counts = count_utc.shift(fill_value=0).cumsum()
output = dict(zip(df['name'], df['created_utc'].map(cumulative_counts)) )
print (output)
{'t1_cqug90j': 2, 't1_cqug90k': 2, 't1_cqug90z': 2, 't1_cqug91c': 0, 't1_cqug91e': 0}
I have a dataset from which I am trying to count the number of 1's in a column and group them depending on another column and return this as a value (to use within a Class).
Example data
import pandas as pd
Current = {'Item': ['Chocolate', 'Chocolate', 'Sweets', 'Chocolate', 'Sweets', 'Pop'],
'Order': [0, 1, 1, 1, 1, 0],
}
Current = pd.DataFrame (Current, columns = ['Item','Order'])
I want to then count the number of 1s by each item (the real table has 25 columns) and return this value.
I have managed to do that when there are values using this code:
choc = Current[Current["Item"] == "Chocolate"]
print(choc["Order"].value_counts()[1])
returns: 2
(in reality I would use the bit inside the print to return it in my Class, not just print it)
This works if there is a count, such as for chocolate, but if there is no count, it returns an error.
pop = Current[Current["Item"] == "Pop"]
print(pop["Order"].value_counts()[1])
Returns: KeyError: 1.0
My questions are:
Is there a better way to do this?
If not, how do I get the value to return 0 if there isn't a count, e.g. in the case of pop?
If you want to check the items individually, you can do something like this:
Current[Current.Item=='Pop'].Order.sum()
This will return 0 for no count items.
If you expect summary as your end result, you can do:
Current.groupby('Item').agg({'Order':sum}).reset_index()
It will return a dataframe with count values of each item
Let's say I have an UNORDERED Dataframe :
df = pandas.DataFrame({'A': [6, 2, 3, 5]})
I have an input :
input = 3
I want to find the rank of my input in the list. Here :
expected_rank_in_df(input) = 2
# Because 2 < 3 < 5 < 6
Assumption : The input is always included in the dataframe. So for example, I will not find the position of "4" in this df.
The first idea was to use like here : Pandas rank by column value:
df.rank()
But it seems overkill to me as I don't need to rank the whole column. Maybe it's not ?
If you know for sure that the input is in the column, the rank will be equal to
df[df > input].count()
Does that make sense? If you intend on calling this multiple times, it may be worth it to just sort the column. But this is probably faster if you only care about a few inputs.
You can get first position of matched value by numpy.where with boolean mask for first True:
a = 3
print (np.where(np.sort(df['A']) == a)[0][0] + 1)
2
If default RangeIndex:
a = 3
print (df['A'].sort_values().eq(3).idxmax())
2
Another idea is count True values by sum:
print (df['A'].gt(3).sum())
2
I use the following python code to read a CSV with 50K rows. Every row has a 4 digit code, for example '1234'.
import csv
import pandas as pd
import re
df = pd.read_csv('Parkingtickets.csv', sep=';',encoding='ISO-8859-1')
df['Parking tickets']
I would like to sort the code and get the count of the top 5 occurrence of the same code.
codes = df['Parking tickets']
Counter(codes).most_common(5)
With this is got kind of what I'm looking for, but it doesn't count only the digit codes and some may have two codes in the same row. How can I use "re.findall(r'\d{4}')"? I know I need to use it, but don't understand how to implement it.
Perhaps look at pandas.Series.value_counts() (http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.Series.value_counts.html). This returns a series containing the counts of the unique values in the original series. Here is some trivial example code:
import pandas as pd
list1 = [1, 1, 1, 2, 2, 3]
df = pd.DataFrame(data={
'number': list1})
df['number'].value_counts()
This returns
2 3
1 2
3 1
Indicating that the number 2 occurred 3 times, the number 1 occurred 2 times, and the number 3 occurred 1 time. You could always do:
top5 = list(df['number'].value_counts())
top5 = top5[:5]
Or a dictionary, etc.