Formating numbers in multiindex array Pandas - python

I have a dataframe that looks like this:
Admin ... Unnamed: 14
Job Family Name Values ...
Dentist McDentistFace, Dentist UDS Encounters 0.000000 ... 1.000000
Actual FTE 0.000000 ... 1.000000
UDS Encounters2 NaN ... 1475.000000
Actual FTE2 NaN ... 7.589426
Where the Job Family, Name, and Values are all dimensions of a multiindex.
I'm trying to format the float values in the file, but can't seem to get it to work. I have been able to highlight certain rows with this line:
for i in flagged_providers:
ind = flagged_providers.index(i) * 4
for q in i.results.keys():
style.apply(highlight_col, axis=0, subset=(style.index[ind: ind + 4], q))
# style.apply(format_numbers, axis=0, subset=(style.index[ind: ind + 2], q))
where format_numbers is:
def format_numbers(s):
return f'{s:,.2f}'
and I have also tried this:
for i in flagged_providers:
format_dict[(i.jfam, i.name)] = '{:.2f}'
format_dict[(i.jfam, i.name)] = '{:.2f}'
style.format(formatter=format_dict)
But I can't quite seem to get it to work. Hoping for any ideas? I want to format the first two rows as percentages, then export to excel using the to_excel function.

I figured it out finally. Probably a better way to do this but what worked was:
style.applymap(lambda x: 'number-format:0.00%;', subset=(style.index[ind: ind + 2], locations))

Related

Pandas .transform() results in NaN values after update to newer version

I have some code that used to function ~3-4 years ago. I've upgraded to newer versions of pandas, numpy, python since then and it has broken. I've isolated what I believe is the issue, but don't quite understand why it occurs.
def function_name(S):
L = df2.reindex(S.index.droplevel(['column1','column2']))*len(S)
return (-L/np.expm1(-L) - 1)
gb = df.groupby(level=['name1', 'name2'])
dc = gb.transform(function_name)
Problem: the last line "dc" is a pandas.Series with only NaN values. It should have no NaN values.
Relevant information -- the gb object is correct and has no NaN or null values. Also, when I print out the "L" in the function, or the "return" in the function, I get the correct values. However, it's lost somewhere in the "dc" line. When I swap 'transform' to 'apply' I get the correct values out of 'dc' but the object has duplicate column labels that make it unusable.
Thanks!
EDIT:
Below is some minimal code I spun up to produce the error.
import pandas as pd
import numpy as np
df1_arrays = [
np.array(["CAT","CAT","CAT","CAT","CAT","CAT","CAT","CAT"]),
np.array(["A","A","A","A","B","B","B","B"]),
np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]),
]
df2_arrays = [
np.array(["A","A","A","A","B","B","B","B"]),
np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]),
]
df1 = pd.Series(np.abs(np.random.randn(8))*100, index=df1_arrays)
df2 = pd.Series(np.abs(np.random.randn(8)), index=df2_arrays)
df1.index.set_names(["mouse", "target", "barcode"], inplace=True)
df2.index.set_names(["target", "barcode"], inplace=True)
def function_name(S):
lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S)
return (-lambdas/np.expm1(-lambdas) - 1)
gb = df1.groupby(level=['mouse','target'])
d_collisions = gb.transform(function_name)
print(d_collisions)
mouse target barcode
CAT A AAAT NaN
AAAG NaN
AAAC NaN
AAAD NaN
B AAAZ NaN
AAAX NaN
AAAW NaN
AAAM NaN
The cause of the NaNs is that your function outputs a DataFrame/Series with different indices, thus causing reindexing to NaNs.
You can return a numpy array in your function:
def function_name(S):
lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S)
return (-lambdas/np.expm1(-lambdas) - 1).to_numpy() # convert to array here
gb = df1.groupby(level=['mouse','target'])
d_collisions = gb.transform(function_name)
output:
mouse target barcode
CAT A AAAT 6.338965
AAAG 2.815679
AAAC 0.547306
AAAD 1.811785
B AAAZ 1.881744
AAAX 10.986611
AAAW 5.124226
AAAM 0.250513
dtype: float64

Pandas sum of count per percentile of rows

Here is a link to a working example on Google Colaboratory.
I have a dataset that represents the reviews (between 0.0 to 10.0) that users have left on various books. It looks like this:
user sum count mean
0 2 0.0 1 0.000000
60223 159665 8.0 1 8.000000
60222 159662 8.0 1 8.000000
60221 159655 8.0 1 8.000000
60220 159651 5.0 1 5.000000
... ... ... ... ...
13576 35859 6294.0 5850 1.075897
37356 98391 51418.0 5891 8.728230
58113 153662 17025.0 6109 2.786872
74815 198711 123.0 7550 0.016291
4213 11676 62092.0 13602 4.564917
The first rows have 1 review while the last ones have thousands. I want to see the distribution of the reviews across the user population. I researched percentile or binning data with Pandas and found pd.qcut and pd.cut but using those, I was unable to get the format in the way I want it.
This is what I'm looking to get.
# users: reviews
# top 10%: 65K rev
# 10%-20%: 23K rev
# etc...
I could not figure out a "Pandas" way to do it so I wrote a loop to generate the data in that format myself and graph it.
SLICE_NUMBERS = 5
step_size = int(user_count/SLICE_NUMBERS)
labels = ['100-80', '80-60', '60-40', '40-20', '0-20']
count_per_percentile = []
for chunk_i in range(SLICE_NUMBERS):
start_index = step_size * chunk_i;
end_index = start_index + step_size;
slice_sum = most_active_list.iloc[start_index:end_index]['count'].sum()
count_per_percentile.append(slice_sum)
print(labels)
print(count_per_percentile) // [21056, 21056, 25058, 62447, 992902]
How can I achieve the same outcome more directly with the library?
I think you can use qcut to create the slices, in a groupby.sum. So with the sample data given slightly modified to avoid duplicated edges on this small sample (I replaced all the ones in count by 1,2,3,4,5)
count_per_percentile = (
df['count']
.groupby(pd.qcut(df['count'], q=[0,0.2,0.4,0.6,0.8,1])).sum()
.tolist()
)
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152]
being the same result as with your method.
In case your real data has too many 1, you could also use np.array_split so
count_per_percentile = [_s.sum() for _s in np.array_split(df['count'].sort_values(),5)]
print(count_per_percentile)
# [3, 7, 5855, 12000, 21152] #same result

How to use pandas' df.get function for a dataframe column so that each row in the column maintains its own value?

To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).
This is the code that I have written currently:
import pandas as pd
import pubchempy
import numpy as np
df = pd.read_csv("Data.tsv.txt", sep="\t")
from pubchempy import get_properties
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))
df = df.drop(df[df.CID=='nan'].index)
df = df.drop( df.index.to_list()[5:] ,axis = 0 )
df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)
df = df.rename(columns={'CID.': 'MolecularWeight'})
print(df)
This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):
MolecularWeight
[{'CID': 5339, 'MolecularWeight': '398.4'}]
[{'CID': 3889, 'MolecularWeight': '520.5'}]
[{'CID': 2788, 'MolecularWeight': '305.50'}]
[{'CID': 1422517, 'MolecularWeight': '440.5'}]
.
.
.
Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).
So I then added this bit of code:
df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')
This is the output that I am now getting:
MolecularWeight
398.4
398.4
398.4
398.4
.
.
.
What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.
What I was hoping to get is something like this:
MolecularWeight
398.4
520.5
305.50
440.5
.
.
.
Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!
Few lines of text file:
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments
1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25) R2|R2|R25|R46| A
2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32) R25| A
3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H R18|R26|R27| A
If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float).
It appears that you may want multiple properties from each CID:
props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))
print(df2)
Output:
CID HBondDonorCount RotatableBondCount MolecularWeight
0 5339 398.4 3 6
1 3889 520.5 4 9
2 2788 305.50 1 0
You can then merge this information onto the original dataframe:
df = df.merge(df2) # df = df.merge(pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props)))
print(df)
...
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments MolecularWeight HBondDonorCount RotatableBondCount
0 1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]... O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18... NaN R2|R2|R25|R46| A NaN 398.4 3 6
1 2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]... COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)... 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8... NaN R25| A NaN 520.5 4 9
2 3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1... NaN R18|R26|R27| A NaN 305.50 1 0

Summarize a column in pandas data frame based on other columns

I have a small data frame tbl:
CatAreaSqKm CatMean CatPctFull CatCount CatSum COMID
1861888 0.2439 0.0000 0.000000 0 0.000000
1862004 0.4050 27.9765 18.222222 82 2294.072964
1862014 0.0720 27.9765 28.750000 23 643.459490
UpCatAreaSqKm UpCatMean UpCatPctFull UpCatCount UpCatSum
COMID
1861888 105360.5349 29.177349 97.901832 114610993 3.344045e+09
1862004 105445.4517 29.174944 97.902537 114704191 3.346488e+09
1862014 105360.2127 29.177349 97.902093 114610948 3.344044e+09
I want to do the following operation:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount))
However, if I get a zero for CatCount + UpCatCount I will be dividing by zero, so for that particular row I want to set 'WsMean' to zero but for the others I would like it to be computed for the value calculated by the statement above. How can I do this? I can only think of a statement like:
tbl['WsMean'] = 0
but that would blanket all records in the table with 0.
Any ideas? Thanks
Dividing by zero results in a NaN value. You could use fillna(0) to replace the NaNs with zeros:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount)).fillna(0)

Output different precision by column with pandas.DataFrame.to_csv()?

Question
Is it possible to specify a float precision specifically for each column to be printed by the Python pandas package method pandas.DataFrame.to_csv?
Background
If I have a pandas dataframe that is arranged like this:
In [53]: df_data[:5]
Out[53]:
year month day lats lons vals
0 2012 6 16 81.862745 -29.834254 0.0
1 2012 6 16 81.862745 -29.502762 0.1
2 2012 6 16 81.862745 -29.171271 0.0
3 2012 6 16 81.862745 -28.839779 0.2
4 2012 6 16 81.862745 -28.508287 0.0
There is the float_format option that can be used to specify a precision, but this applys that precision to all columns of the dataframe when printed.
When I use that like so:
df_data.to_csv(outfile, index=False,
header=False, float_format='%11.6f')
I get the following, where vals is given an inaccurate precision:
2012,6,16, 81.862745, -29.834254, 0.000000
2012,6,16, 81.862745, -29.502762, 0.100000
2012,6,16, 81.862745, -29.171270, 0.000000
2012,6,16, 81.862745, -28.839779, 0.200000
2012,6,16, 81.862745, -28.508287, 0.000000
Change the type of column "vals" prior to exporting the data frame to a CSV file
df_data['vals'] = df_data['vals'].map(lambda x: '%2.1f' % x)
df_data.to_csv(outfile, index=False, header=False, float_format='%11.6f')
The more current version of hknust's first line would be:
df_data['vals'] = df_data['vals'].map(lambda x: '{0:.1}'.format(x))
To print without scientific notation:
df_data['vals'] = df_data['vals'].map(lambda x: '{0:.1f}'.format(x))
This question is a bit old, but I'd like to contribute with a better answer, I think so:
formats = {'lats': '{:10.5f}', 'lons': '{:.3E}', 'vals': '{:2.1f}'}
for col, f in formats.items():
df_data[col] = df_data[col].map(lambda x: f.format(x))
I tried with the solution here, but it didn't work for me, I decided to experiment with previus solutions given here combined with that from the link above.
You can use round method for dataframe before saving the dataframe to the file.
df_data = df_data.round(6)
df_data.to_csv('myfile.dat')
You can do this with to_string. There is a formatters argument where you can provide a dict of columns names to formatters. Then you can use some regexp to replace the default column separators with your delimiter of choice.
The to_string approach suggested by #mattexx looks better to me, since it doesn't modify the dataframe.
It also generalizes well when using jupyter notebooks to get pretty HTML output, via the to_html method. Here we set a new default precision of 4, and override it to get 5 digits for a particular column wider:
from IPython.display import HTML
from IPython.display import display
pd.set_option('precision', 4)
display(HTML(df.to_html(formatters={'wider': '{:,.5f}'.format})))

Categories

Resources