Pandas, millions and billions - python

I have a dataframe with this kind of data
1 400.00M
2 1.94B
3 2.72B
4 -400.00M
5 13.94B
I would like to convert the data to billions so that the output would be something like this
1 0.40
2 1.94
3 2.72
4 -0.40
5 13.94
Note that dtype: object

Use replace with dictionary and map pd.eval
Sample df:
Out[1629]:
val
1 400.00M
2 1.94B
3 2.72B
4 -400.00M
5 13.94B
d = {'M': '*0.001', 'B': ''}
s_convert = df.val.replace(d, regex=True).map(pd.eval)
Out[1633]:
1 0.40
2 1.94
3 2.72
4 -0.40
5 13.94
Name: val, dtype: float64

You can use a lambda expression if you know for a fact that you either have only millions or billions:
amount=["400.00M","1.94B","2.72B","-400.00M","13.94B"]
df=pd.DataFrame(amount,columns=["amount"])
df.amount.apply(lambda x: float(x[:-1]) if x[-1]=="B" else float(x[:-1])/1000)

Or a list comprehension...
data = {'value': ['400.00M', '1.94B', '2.72B', '-400.00M', '13.94B']}
df = pd.DataFrame(data, index = [1, 2, 3, 4, 5])
df['value'] = [float(n[:-1])/1000 if n[-1:] == 'M' else float(n[:-1]) for n in df['value']]
...though #Andy's answer is more concise.

Related

Find where word is present in string with where statement [duplicate]

I having replace issue while I try to replace a string with value from another column.
I want to replace 'Length' with df['Length'].
df["Length"]= df["Length"].replace('Length', df['Length'], regex = True)
Below is my data
Input:
**Formula** **Length**
Length 5
Length+1.5 6
Length-2.5 5
Length 4
5 5
Expected Output:
**Formula** **Length**
5 5
6+1.5 6
5-2.5 5
4 4
5 5
However, with the code I used above, it will replace my entire cell instead of Length only.
I getting below output:
I found it was due to df['column'] is used, if I used any other string the behind offset (-1.5) will not get replaced.
**Formula** **Length**
5 5
6 6
5 5
4 4
5 5
May I know is there any replace method for values from other columns?
Thank you.
If want replace by another column is necessary use DataFrame.apply:
df["Formula"]= df.apply(lambda x: x['Formula'].replace('Length', str(x['Length'])), axis=1)
print (df)
Formula Length
0 5 5
1 6+1.5 6
2 5-2.5 5
3 4 4
4 5 5
Or list comprehension:
df["Formula"]= [x.replace('Length', str(y)) for x, y in df[['Formula','Length']].to_numpy()]
Just wanted to add, that list comprehension is much faster of course:
df = pd.DataFrame({'a': ['aba'] * 1000000, 'c': ['c'] * 1000000})
%timeit df.apply(lambda x: x['a'].replace('b', x['c']), axis=1)
# 1 loop, best of 5: 11.8 s per loop
%timeit [x.replace('b', str(y)) for x, y in df[['a', 'c']].to_numpy()]
# 1 loop, best of 5: 1.3 s per loop

Python/Pandas: use one column's value to be the suffix of the column name from which I want a value

I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar in this example) of that row.
I need the fastest way to do this.
Dataframe for application is like this:
foo bar ID_A ID_B ID_C ID_D ID_E ...
1 B 1.5 2.3 4.1 0.5 6.6 ...
2 E 3 4 5 6 7 ...
3 A 9 6 3 8 1 ...
4 C 13 5 88 9 0 ...
5 B 6 4 6 9 4 ...
...
An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.
df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G
Result will be like this (after dropping used columns):
foo baz
1 2.3
2 7
3 9
4 88
5 4
...
I have tried with .apply() and it was very slow.
I tried with np.where() which was still much slower than the example shown above (which was 1000% faster than np.where()).
Would appreciate recommendations!
Many thanks
EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky.
My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo and baz). I have to reduce 13 columns to the single column per the example.
The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"
You can use a variant of the indexing lookup:
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
foo baz
0 1 2.3
1 2 7.0
2 3 9.0
3 4 88.0
4 5 4.0
testing speed
Setting up a test dataset (280k rows, 54 ID columns):
from string import ascii_uppercase, ascii_lowercase
letters = list(ascii_lowercase+ascii_uppercase)
N = 280_000
np.random.seed(0)
df = (pd.DataFrame({'foo': np.arange(1, N+1),
'bar': np.random.choice(letters, size=N)})
.join(pd.DataFrame(np.random.random(size=(N, len(letters))),
columns=[f'ID_{l}' for l in letters]
))
)
speed testing:
%%timeit
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Can try this. It should generalize to arbitrary number of columns.
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 'B', 1.5, 2.3, 4.1, 0.5, 6.6],
[2, 'E', 3, 4, 5, 6, 7],
[3, 'A', 9, 6, 3, 8, 1],
[4, 'C', 13, 5, 88, 9, 0],
[5, 'B', 6, 4, 6, 9, 4]])
df.columns = ['foo', 'bar', 'ID_A', 'ID_B', 'ID_C', 'ID_D', 'ID_E']
for val in np.unique(df['bar'].values):
df.loc[df.bar==val, 'baz'] = df[f'ID_{val}']
To show an alternative approach, you can perform a combination of melting your data and reindexing. In this case I used wide_to_long (instead of melt/stack) because of the patterned nature of your column names:
out = (
pd.wide_to_long(
df, stubnames=['ID'], i=['foo', 'bar'], j='', sep='_', suffix=r'\w+'
)
.loc[lambda d:
d.index.get_level_values('bar') == d.index.get_level_values(level=-1),
'ID'
]
.droplevel(-1)
.rename('baz')
.reset_index()
)
print(out)
foo bar baz
0 1 B 2.3
1 2 E 7.0
2 3 A 9.0
3 4 C 88.0
4 5 B 4.0
An alternative syntax to the above leverages .melt & .query to shorten the syntax.
out = (
df.melt(id_vars=['foo', 'bar'], var_name='id', value_name='baz')
.assign(id=lambda d: d['id'].str.get(-1))
.query('bar == id')
)
print(out)
foo bar id baz
2 3 A A 9.0
5 1 B B 2.3
9 5 B B 4.0
13 4 C C 88.0
21 2 E E 7.0

Populate column in dataframe based on iat values

lookup={'Tier':[1,2,3,4],'Terr.1':[0.88,0.83,1.04,1.33],'Terr.2':[0.78,0.82,0.91,1.15],'Terr.3':[0.92,0.98,1.09,1.33],'Terr.4':[1.39,1.49,1.66,1.96],'Terr.5':[1.17,1.24,1.39,1.68]}
df={'Tier':[1,1,2,2,3,2,4,4,4,1],'Territory':[1,3,4,5,4,4,2,1,1,2]}
df=pd.DataFrame(df)
lookup=pd.DataFrame(lookup)
lookup contains the lookup values, and df contains the data being fed into iat.
I get the correct values when I print(lookup.iat[tier,terr]). However, when I try to set those values in a new column, it endlessly runs, or in this simple test case just copies 1 value 10 times.
for i in df["Tier"]:
tier=i-1
for j in df["Territory"]:
terr=j
#print(lookup.iat[tier,terr])
df["Rate"]=lookup.iat[tier,terr]
Any thoughts on a possible better solution?
You can use apply() after some modification to your lookup dataframe:
lookup = lookup.rename(columns={i: i.split('.')[-1] for i in lookup.columns}).set_index('Tier')
lookup.columns = lookup.columns.astype(int)
df['Rate'] = df.apply(lambda x: lookup.loc[x['Tier'],x['Territory']], axis=1)
Returns:
Tier Territory Rate
0 1 1 0.88
1 1 3 0.92
2 2 4 1.49
3 2 5 1.24
4 3 4 1.66
5 2 4 1.49
6 4 2 1.15
7 4 1 1.33
8 4 1 1.33
9 1 2 0.78
Once lookup modified a bit the same way than #rahlf23 plus using stack, you can merge both dataframes such as:
df['Rate'] = df.merge( lookup.rename(columns={ i: int(i.split('.')[-1])
for i in lookup.columns if 'Terr' in i})
.set_index('Tier').stack()
.reset_index().rename(columns={'level_1':'Territory'}),
how='left')[0]
If you have a big dataframe df, then it should be faster than using apply and loc
Also, if any couple (Tier, Territory) in df does not exist in lookup, this method won't throw an error

split string column by "_", drop the preceding text, recombine str by "_" in pandas

>e = {0: pd.Series(['NHL_toronto_maple-leafs_Canada', 'NHL_boston_bruins_US', 'NHL_detroit_red-wings', 'NHL_montreal'])}
>df = pd.DataFrame(e)
>df
0
0 NHL_toronto_maple-leafs_Canada
1 NHL_boston_bruins_US
2 NHL_detroit_red-wings
3 NHL_montreal
I want to:
1) split the above dataframe (Series) by '_'
2) drop the 'NHL' string
3) recombine the remaining text by '_'
4) attach the result in #3 to the original dataframe as the second column
To do this I tried the following:
>df2 = df.icol(0).str.split('_').apply(pd.Series).iloc[:,1:]
>df2
1 2 3
0 toronto maple-leafs Canada
1 boston bruins US
2 detroit red-wings NaN
3 montreal NaN NaN
I tried to follow the suggestion in combine columns in Pandas by doing something like:
>df2['4'] = df2.iloc[:,0] + "_" + df2.iloc[:,1] + "_" + df2.iloc[:,2]
>df2
1 2 3 4
0 toronto maple-leafs Canada toronto_maple-leafs_Canada
1 boston bruins US boston_bruins_US
2 detroit red-wings NaN NaN
3 montreal NaN NaN NaN
However, you can see that in situations where a combine involves a cell that is NaN the end result is NaN as well. This is not what I want.
Column 4 should look like:
toronto_maple-leafs_Canada
boston_bruins_US
detroit_red-wings_US
montreal
Also is there an efficient way to do this type of operation as my real data set is quite large.
If you're just looking to remove starting 'NHL_' substring, you could just
In [84]: df[0].str[4:]
Out[84]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
However, if you need to split and join, you could use a string method like-
In [85]: df[0].str.split('_').str[1:].str.join('_')
Out[85]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
Alternatively, you could also use apply
In [86]: df[0].apply(lambda x: '_'.join(x.split('_')[1:])) # Also, x.split('_', 1)[1]
Out[86]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
And, as #DSM pointed out - "split accepts an argument for the maximum number of splits"
In [87]: df[0].str.split("_", 1).str[1]
Out[87]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
Depending on the size of your data, you could benchmark these methods and use appropriate one.
You could use apply like this :
In [1782]: df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
Out[1782]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
In [1783]: df[0] = df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
Surprisingly, applying str seem to be taking longer than apply :
In [1811]: %timeit df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
10000 loops, best of 3: 127 µs per loop
In [1810]: %timeit df[0].str[4:]
1000 loops, best of 3: 179 µs per loop
In [1812]: %timeit df[0].str.split('_').str[1:].str.join('_')
1000 loops, best of 3: 553 µs per loop
In [1813]: %timeit df[0].str.split("_", 1).str[1]
1000 loops, best of 3: 374 µs per loop

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

Categories

Resources