String operation on Series [duplicate]

String operation on Series [duplicate] - python

Assuming that I have a pandas dataframe and I want to add thousand separators to all the numbers (integer and float), what is an easy and quick way to do it?

When formatting a number with , you can just use '{:,}'.format:
n = 10000
print '{:,}'.format(n)
n = 1000.1
print '{:,}'.format(n)
In pandas, you can use the formatters parameter to to_html as discussed here.
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
return {
column:format
for column, dtype in df.dtypes.items()
if dtype in [ np.dtype('int64'), np.dtype('float64') ]
}
formatters = build_formatters(data_frame, num_format)
data_frame.to_html(formatters=formatters)
Adding the thousands separator has actually been discussed quite a bit on stackoverflow. You can read here or here.

Use Series.map or Series.apply with this solutions:
df['col'] = df['col'].map('{:,}'.format)
df['col'] = df['col'].map(lambda x: f'{x:,}')
df['col'] = df['col'].apply('{:,}'.format)
df['col'] = df['col'].apply(lambda x: f'{x:,}')

Assuming you just want to display (or render to html) the floats/integers with a thousands separator you can use styling which was added in version 0.17.1:
import pandas as pd
df = pd.DataFrame({'int': [1200, 320], 'flt': [5300.57, 12000000.23]})
df.style.format('{:,}')
To render this output to html you use the render method on the Styler.

Steps
use df.applymap() to apply a function to every cell in your dataframe
check if cell value is of type int or float
format numbers using f'{x:,d}' for integers and f'{x:,f}' for floats
Here is a simple example for integers only:
df = df.applymap(lambda x: f'{x:,d}' if isinstance(x, int) else x)

If you want "." as thousand separator and "," as decimal separator this will works:
Data = pd.read_Excel(path)
Data[my_numbers] = Data[my_numbers].map('{:,.2f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
If you want three decimals instead of two you change "2f" --> "3f"
Data[my_numbers] = Data[my_numbers].map('{:,.3f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")

The formatters parameter in to_html will take a dictionary.
Click the example link for reference

Related

Can pandas findall() return a str instead of list?

I have a pandas dataframe containing a lot of variables:
df.columns
Out[0]:
Index(['COUNADU_SOIL_P_NUMBER_16_DA_B_VE_count_nr_lesion_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_VT_count_nr_lesion_PRATZE',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V6_count_nr_lesion_PRATZE',
'COUNADU_SOIL_P_SAUDPC_150_DA_B_V6_lesion_saudpc_PRATZE',
'CONTRO_SOIL_P_pUNCK_150_DA_B_V6_lesion_p_control_PRATZE',
'COUNJUV_SOIL_P_p_0_100_16_DA_B_V6_lesion_incidence_PRATZE',
'COUNADU_SOIL_P_p_0_100_50_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_p_0_100_128_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_V6_count_nr_spiral_HELYSP',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V10_count_nr_spiral_HELYSP', # and so on
I would like to keep only the number followed by DA, so the first column is 16_DA. I have been using the pandas function findall():
df.columns.str.findall(r'[0-9]*\_DA')
Out[595]:
Index([ ['16_DA'], ['50_DA'], ['128_DA'], ['150_DA'], ['150_DA'],
['16_DA'], ['50_DA'], ['128_DA'], ['50_DA'], ['128_DA'], ['150_DA'],
['150_DA'], ['50_DA'], ['128_DA'],
But this returns a list, which i would like to avoid, so that i end up with a column index looking like this:
df.columns
Out[595]:
Index('16_DA', '50_DA', '128_DA', '150_DA', '150_DA',
'16_DA', '50_DA', '128_DA', '50_DA', '128_DA', '150_DA',
Is there a smoother way to do this?

You can use .str.join(", ") to join all found matches with a comma and space:
df.columns.str.findall(r'\d+_DA').str.join(", ")
Or, just use str.extract to get the first match:
df.columns.str.extract(r'(\d+_DA)', expand=False)

from typing import List
pattern = r'[0-9]*\_DA'
flattened: List[str] = sum(df.columns.str.findall(pattern), [])
output: str = ",".join(flattened)

Preserve string when writing pandas data frame to csv [duplicate]

I am having a recurring problem with saving large numbers in Python to csv. The numbers are millisecond epoch time stamps, which I cannot convert or truncate and have to save in this format. As the columns with the millisecond timestamps also contain some NaN values, pandas casts them automatically to float (see the documentation in the Gotchas under "Support for integer NA".
I cannot seem to avoid this behaviour, so my question is, how can I save these numbers as an integer value when using df.to_csv, i.e. with no decimal point or trailing zeros? I have columns with numbers of different floating precision in the same dataframe and I do not want to lose the information there. Using the float_format parameter in to_csv seems to apply the same format for ALL float columns in my dataframe.
An example:
>>> df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
>>> df['b'].dtype
Out[1]: dtype('int64')
>>> df.loc[2] = np.NaN
>>> df
Out[1]:
a b
0 1.25 1.424380e+12
1 2.54 1.425511e+12
2 NaN NaN
>>> df['b'].dtype
dtype('float64')
>>> df.to_csv('test.csv')
>>> with open ('test.csv') as f:
... for line in f:
... print(line)
,a,b
0,1.25,1.42438044944e+12
1,2.54,1.42551073119e+12
2,,
As you can see, I lost the precision of the last two digits of my epoch time stamp.

While pd.to_csv does not have a parameter to change the format of individual columns, pd.to_string does. It is a little cumbersome and might be a problem for very large DataFrames but you can use it to produce a properly formatted string and then write that string to a file (as suggested in this answer to a similar question). to_string's formatters parameter takes for example a dictionary of functions to format individual columns. In your case, you could write your own custom formatter for the "b" column, leaving the defaults for the other column(s). This formatter might look somewhat like this:
def printInt(b):
if pd.isnull(b):
return "NaN"
else:
return "{:d}".format(int(b))
Now you can use this to produce your string:
df.to_string(formatters={"b": printInt}, na_rep="NaN")
which gives:
' a b\n0 1.25 1424380449437\n1 2.54 1425510731187\n2 NaN NaN'
You can see that there is still the problem that this is not comma separated and to_string actually has no parameter to set a custom delimiter, but this can easily be fixed by a regex:
import re
re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN"))
gives:
',a,b\n0,1.25,1424380449437\n1,2.54,1425510731187\n2,,'
This can now be written into the file:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
which results in what you wanted:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,,
If you want to keep the NaN's in the csv-file, you can just change the regex:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
will give:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN
If your DataFrame contained strings with whitespaces before, a robust solution is not as easy. You could insert another character in front of every value, that indicates the start of the next entry. If you have only single whitespaces in all strings you could use another whitespace for example. This would change the code to this:
import pandas as pd
import numpy as np
import re
df = pd.DataFrame({'a a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
df.loc[2] = np.NaN
def printInt(b):
if pd.isnull(b):
return " NaN"
else:
return " {:d}".format(int(b))
def printFloat(a):
if pd.isnull(a):
return " NaN"
else:
return " {}".format(a)
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t][ \t]+", ",",
df.to_string(formatters={"a": printFloat, "b": printInt},
na_rep="NaN", col_space=2)),
file=f)
which would give:
,a a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN

Maybe this could work:
pd.set_option('precision',15)
df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
fg = df.applymap(lambda x: str(x))
fg.loc[2] = np.NaN
fg.to_csv('test.csv', na_rep='NaN')
Your output should be something like this (I'm on a mac):

I had the same problems with large numbers, this is the correct way for excel files
df = "\t" + df

python datetime convert, dates may contains whitespaces

I have a .csv file with a date column, and the date looks like this.
date
2016年 4月 1日 <-- there are whitespaces in thie row
...
2016年10月10日
The date format is Japanese date format. I'm trying to convert this column to 'YYYY-MM-DD', and the python code I'm using is below.
data['date'] = [datetime.datetime.strptime(d, '%Y年%m月%d日').date() for d in data['date']]
There is one problem, the date column in the .csv may contain whitespace when the month/day is a single digit. And my code doesn't work well when there is a whitespace.
Anyone solutions?

In pandas is best avoid list comprehension if exist vectorized solutions because performance and no support NaNs.
I think need replace by \s+ : one or more whitespaces with pandas.to_datetime for converting to datetimes and last for dates add date:
data['date'] = (pd.to_datetime(data['date'].str.replace('\s+', ''), format='%Y年%m月%d日')
.dt.date)
Performance:
The plot was created with perfplot:
def list_compr(df):
df['date1'] = [datetime.datetime.strptime(d.replace(" ", ""), '%Y年%m月%d日').date() for d in df['date']]
return df
def vector(df):
df['date2'] = (pd.to_datetime(df['date'].str.replace('\s+', ''), format='%Y年%m月%d日').dt.date)
return df
def make_df(n):
df = pd.DataFrame({'date':['2016年 4月 1日','2016年10月10日']})
df = pd.concat([df] * n, ignore_index=True)
return df
perfplot.show(
setup=make_df,
kernels=[list_compr, vector],
n_range=[2**k for k in range(2, 13)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')

I don't know Python actually, but wouldn't something like replacing d in strptime with d.replace(" ", "") do the trick?

function to join strings and some values from columns

I have a pandas dataframe with columns and rows. Now I want to create another column which will be a concatenation of two strings and a column from the dataframe.
so the way it would work is i have string one (see the below dictionary)+ colx (from dataframe) + string two
stringList = {
'one': """ AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+""",
'two': """ FRQ:4 NOT:1 PX:C PXND:1E-6:DOWN RDTE:MAT RP:1 SET:0WW XD:NO """
}
i tried to create a function but I think this is not working as I want. I want this to be a function so i can call it in another function.
def fun(final):
for i in dm:
c = stringList['one'] + str(dm[i]) + stringList['two']
final.append(c)
Please help with this as I am stuck with this problem for now.
Required Output:
str1 |QM |str2 |output
AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+ |0.0125 | RQ:4 NOT:1 PX:C PXND:1E-6:DOWN RDTE:MAT RP:1 SET:0WW XD:NO| AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+0.0125RQ:4 NOT:1 PX:C PXND:1E-6:DOWN RDTE:MAT RP:1 SET:0WW XD:NO
AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+ 0.016 RQ:4 NOT:1 PX:C PXND:1E-
Hope this helps explain. I know it is not a very good representation but I have this problem which is critical to solve
THanks

After looking at your output, I realized that you want to combine three columns str1, QM, and str2. I am assuming here that str1 and str2 have dtype str and QM has dtype float. You can use the following code to get the output column as below
df["output"] = df["str1"] + df["QM"].astype(str) + df["str2"]

How to add thousand separator to numbers in pandas

Assuming that I have a pandas dataframe and I want to add thousand separators to all the numbers (integer and float), what is an easy and quick way to do it?

When formatting a number with , you can just use '{:,}'.format:
n = 10000
print '{:,}'.format(n)
n = 1000.1
print '{:,}'.format(n)
In pandas, you can use the formatters parameter to to_html as discussed here.
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
return {
column:format
for column, dtype in df.dtypes.items()
if dtype in [ np.dtype('int64'), np.dtype('float64') ]
}
formatters = build_formatters(data_frame, num_format)
data_frame.to_html(formatters=formatters)
Adding the thousands separator has actually been discussed quite a bit on stackoverflow. You can read here or here.

Use Series.map or Series.apply with this solutions:
df['col'] = df['col'].map('{:,}'.format)
df['col'] = df['col'].map(lambda x: f'{x:,}')
df['col'] = df['col'].apply('{:,}'.format)
df['col'] = df['col'].apply(lambda x: f'{x:,}')

Assuming you just want to display (or render to html) the floats/integers with a thousands separator you can use styling which was added in version 0.17.1:
import pandas as pd
df = pd.DataFrame({'int': [1200, 320], 'flt': [5300.57, 12000000.23]})
df.style.format('{:,}')
To render this output to html you use the render method on the Styler.

Steps
use df.applymap() to apply a function to every cell in your dataframe
check if cell value is of type int or float
format numbers using f'{x:,d}' for integers and f'{x:,f}' for floats
Here is a simple example for integers only:
df = df.applymap(lambda x: f'{x:,d}' if isinstance(x, int) else x)

If you want "." as thousand separator and "," as decimal separator this will works:
Data = pd.read_Excel(path)
Data[my_numbers] = Data[my_numbers].map('{:,.2f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
If you want three decimals instead of two you change "2f" --> "3f"
Data[my_numbers] = Data[my_numbers].map('{:,.3f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")

The formatters parameter in to_html will take a dictionary.
Click the example link for reference

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String operation on Series [duplicate] - python

Assuming that I have a pandas dataframe and I want to add thousand separators to all the numbers (integer and float), what is an easy and quick way to do it?

Use Series.map or Series.apply with this solutions: df['col'] = df['col'].map('{:,}'.format) df['col'] = df['col'].map(lambda x: f'{x:,}') df['col'] = df['col'].apply('{:,}'.format) df['col'] = df['col'].apply(lambda x: f'{x:,}')

Steps use df.applymap() to apply a function to every cell in your dataframe check if cell value is of type int or float format numbers using f'{x:,d}' for integers and f'{x:,f}' for floats Here is a simple example for integers only: df = df.applymap(lambda x: f'{x:,d}' if isinstance(x, int) else x)

The formatters parameter in to_html will take a dictionary. Click the example link for reference

Related

Can pandas findall() return a str instead of list?

Preserve string when writing pandas data frame to csv [duplicate]

python datetime convert, dates may contains whitespaces

function to join strings and some values from columns

How to add thousand separator to numbers in pandas

Categories

Resources