Dividing two pandas values - python

I have the following dataset from the following commands
df.ab.value_counts()
Out[154]:
0 31196
1 18804
dtype: int64
I want to find the fraction of 1 out of total counts. So basically 18804/50000
I do the following:
(df.ab.value_counts()[1])/(df.ab.count())
Out[146]:
0
As you see it gives me zero while it should be 0.3768 (18804/50000)
Any idea why?
Edit II:
Any idea how to plot the bar graph of these two values 0 and 1 and their counts?

I think you need add casting to float:
print (df.ab.value_counts()[1])/float(df.ab.count())
0.37608
print 18804/50000
0
print 18804/float(50000)
0.37608
print float(18804)/50000
0.37608

You can do it easier
df.ab.value_counts(normalize=True)

from __future__ import division
(df.ab.value_counts()[1])/(df.ab.count()) ###should give float value 0.3768
Have not tested it, but hopes it will help you.

Related

Python - How to split a Pandas value and only get the value between the slashs

Example:
I the df['column'] has a bunch of values similar to: F/4500/O or G/2/P
The length of the digits range from 1 to 4 similar to the examples given above.
How can I transform that column to only keep 1449 as an integer?
I tried the split method but I can't get it right.
Thank you!
You could extract the value and convert to_numeric:
df['number'] = pd.to_numeric(df['column'].str.extract('/(\d+)/', expand=False))
Example:
column number
0 F/4500/O 4500
1 G/2/P 2
How's about:
df['column'].map(lambda x: int(x.split('/')[1]))

Why does the groupby command in Pandas produce non-exist ids?

I use the pandas groupby command on my dataframe as:
df.groupby('courier_id').type_of_vehicle.size()
but this code produces some 'courier_id' that they're not in my dataframe
courier_id
00aecd42-472f-11ec-94e0-77812be296a5 4
011da6a6-eb0b-11ec-97e1-179dc13cdf87 1
0140f63c-02e0-11ed-b314-9b2e7e4f7e5c 1
0188d572-7228-11ec-ab3b-07d470cb404d 7
01cef7ba-e32e-11ec-bb21-67c7079055d4 0
..
c98fc418-7b51-11ec-a81c-77139d6dd889 0
d98a4b9a-d056-11ec-9e3c-0b80c11ec04b 1
dae54c80-d1f8-11ec-bbb0-b71d7b2c4e1a 1
f7925664-0ac1-11ed-ab40-df16023f78cb 0
f857cb84-371c-11ec-9af6-ffeaeea4b0f1 4
Name: type_of_vehicle, Length: 268, dtype: int64
I checked it with: '01cef7ba-e32e-11ec-bb21-67c7079055d4' in df.courier_id.values and result was False
I used df.groupby('courier_id').get_group('01cef7ba-e32e-11ec-bb21-67c7079055d4') and it raise KeyError but when make for in it, return empty DataFrame
Note: when I slice my dataframe as new_df = df[['courier_id', 'type_of_vehicle']] the result become right!
If you provide some reproducible code/data it would be appreciated. That way we can provide you the best possible answer.
However, I think the problem is due the following:
When you use the function groupby(), the original courier_id becomes the new index of the transformed DataFrame. Try to use .reset_index() and your problem should be solved.
df.groupby('courier_id').type_of_vehicle.size().reset_index()

Iterating over dataframe and using replace method based on condtions

I am attempting to iterate over a specific column in my dataframe.
The column is:
df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']
I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).
To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.
I also attempted to use pd.DataFrame.where, but am getting an error:
for i,row in df.iterrows():
df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])
``AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).
Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:
x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')
This'll give you:
0 1
0 1.4 million
1 1235000 NaN
2 100 million
3 NaN NaN
4 14 million
5 2.5 mill
Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:
res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)
That'll give you:
0 1400000.0
1 1235000.0
2 100000000.0
3 NaN
4 14000000.0
5 2500000.0
Try this:
df['column'].apply(lambda x : x.replace('million','00000'))
Make sure your dtype is string before applying this
For the given data:
df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
if 'million' in str(x) or 'mill' in str(x) else x)
If there may be many forms of million in the column, then regex search.

How to subtract two columns when floating point precision matters

I am trying to subtract two columns in the dataframe but it is giving me same result for all the values?
Here is my data:
a b
0 0.35805 -0.01315
1 0.35809 -0.01311
2 0.35820 -0.01300
3 0.35852 -0.01268
I tried following approach suggested in here, but it is repeating same result for me in all the rows.
More like a precision issue , I always using decimal
from decimal import *
df.z.map(Decimal)-df.dist.map(Decimal)
Out[189]:
0 0.3711999999999999796246319406
1 0.3712000000000000195232718880
2 0.3712000000000000177885484121
3 0.3712000000000000056454840802
dtype: object
I think this will work fine
df['a-b'] = df['a']-df['b']

how to determine if a cell has multiple values and count the number of occurences

I have a table as below where i need to count the number of times the type column has more than one value in it.
My logic at the moment is to go through each time and check if the type cell has more than one value in it and place a counter but i am not sure how to code this in Python correctly.
I tried this method below but i don't think it helps in my case considering that it is also hierarchical:
from collections import Counter
Counter(pd.DataFrame(data['Country'].str.split(',', expand=True)).values.ravel())
You can do:
## df is your data (gives pandas series)
df['type'].apply(lambda x: len(str(x).split(','))).value_counts()
## or convert it to dict
df['type'].apply(lambda x: len(str(x).split(','))).value_counts().to_dict()
Using get_dummies with sum
df=pd.DataFrame({'type':['big,green','big','small,red']})
df.type.str.get_dummies(sep=',').sum(1)
Out[382]:
0 2
1 1
2 2
dtype: int64
Maybe you should try this one:
df=pd.DataFrame({'type':['big,green','big','small,red']})
for i in df['type']: print(len(i.split(',')))

Categories

Resources