How can I copy a DataFrame to_clipboard and paste it in excel with commas as decimal?
In R this is simple.
write.table(obj, 'clipboard', dec = ',')
But I cannot figure out in pandas to_clipboard.
I unsuccessfully tried changing:
import locale
locale.setlocale(locale.LC_ALL, '')
Spanish_Argentina.1252
or
df.to_clipboard(float_format = '%,%')
Since Pandas 0.16 you can use
df.to_clipboard(decimal=',')
to_clipboard() passes extra kwargs to to_csv(), which has other useful options.
There are some different ways to achieve this. First, it is possible with float_format and your locale, although the use is not so straightforward (but simple once you know it: the float_format argument should be a function that can be called):
df.to_clipboard(float_format='{:n}'.format)
A small illustration:
In [97]: df = pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
In [98]: df
Out[98]:
A B
0 1.125438 -1.015477
1 0.900816 1.283971
2 0.874250 1.058217
3 -0.013020 0.758841
4 -0.030534 -0.395631
In [99]: df.to_clipboard(float_format='{:n}'.format)
gives:
A B
0 1,12544 -1,01548
1 0,900816 1,28397
2 0,87425 1,05822
3 -0,0130202 0,758841
4 -0,0305337 -0,395631
If you don't want to rely on the locale setting but still have comma decimal output, you can do this:
class CommaFloatFormatter:
def __mod__(self, x):
return str(x).replace('.',',')
df.to_clipboard(float_format=CommaFloatFormatter())
or simply do the conversion before writing the data to clipboard:
df.applymap(lambda x: str(x).replace('.',',')).to_clipboard()
Related
I am having a recurring problem with saving large numbers in Python to csv. The numbers are millisecond epoch time stamps, which I cannot convert or truncate and have to save in this format. As the columns with the millisecond timestamps also contain some NaN values, pandas casts them automatically to float (see the documentation in the Gotchas under "Support for integer NA".
I cannot seem to avoid this behaviour, so my question is, how can I save these numbers as an integer value when using df.to_csv, i.e. with no decimal point or trailing zeros? I have columns with numbers of different floating precision in the same dataframe and I do not want to lose the information there. Using the float_format parameter in to_csv seems to apply the same format for ALL float columns in my dataframe.
An example:
>>> df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
>>> df['b'].dtype
Out[1]: dtype('int64')
>>> df.loc[2] = np.NaN
>>> df
Out[1]:
a b
0 1.25 1.424380e+12
1 2.54 1.425511e+12
2 NaN NaN
>>> df['b'].dtype
dtype('float64')
>>> df.to_csv('test.csv')
>>> with open ('test.csv') as f:
... for line in f:
... print(line)
,a,b
0,1.25,1.42438044944e+12
1,2.54,1.42551073119e+12
2,,
As you can see, I lost the precision of the last two digits of my epoch time stamp.
While pd.to_csv does not have a parameter to change the format of individual columns, pd.to_string does. It is a little cumbersome and might be a problem for very large DataFrames but you can use it to produce a properly formatted string and then write that string to a file (as suggested in this answer to a similar question). to_string's formatters parameter takes for example a dictionary of functions to format individual columns. In your case, you could write your own custom formatter for the "b" column, leaving the defaults for the other column(s). This formatter might look somewhat like this:
def printInt(b):
if pd.isnull(b):
return "NaN"
else:
return "{:d}".format(int(b))
Now you can use this to produce your string:
df.to_string(formatters={"b": printInt}, na_rep="NaN")
which gives:
' a b\n0 1.25 1424380449437\n1 2.54 1425510731187\n2 NaN NaN'
You can see that there is still the problem that this is not comma separated and to_string actually has no parameter to set a custom delimiter, but this can easily be fixed by a regex:
import re
re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN"))
gives:
',a,b\n0,1.25,1424380449437\n1,2.54,1425510731187\n2,,'
This can now be written into the file:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
which results in what you wanted:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,,
If you want to keep the NaN's in the csv-file, you can just change the regex:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
will give:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN
If your DataFrame contained strings with whitespaces before, a robust solution is not as easy. You could insert another character in front of every value, that indicates the start of the next entry. If you have only single whitespaces in all strings you could use another whitespace for example. This would change the code to this:
import pandas as pd
import numpy as np
import re
df = pd.DataFrame({'a a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
df.loc[2] = np.NaN
def printInt(b):
if pd.isnull(b):
return " NaN"
else:
return " {:d}".format(int(b))
def printFloat(a):
if pd.isnull(a):
return " NaN"
else:
return " {}".format(a)
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t][ \t]+", ",",
df.to_string(formatters={"a": printFloat, "b": printInt},
na_rep="NaN", col_space=2)),
file=f)
which would give:
,a a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN
Maybe this could work:
pd.set_option('precision',15)
df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
fg = df.applymap(lambda x: str(x))
fg.loc[2] = np.NaN
fg.to_csv('test.csv', na_rep='NaN')
Your output should be something like this (I'm on a mac):
I had the same problems with large numbers, this is the correct way for excel files
df = "\t" + df
Using pandas.read_csv with parse_dates option and a custom date parser, I find Pandas has a mind of its own about the data type it's reading.
Sample csv:
"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
The actual datecleaner is here, but what I do boils down to this:
import pandas as pd
def dateclean(date):
return str(int(date)) # Note: we return A STRING
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
print(df.birth_date)
Output:
0 NaN
1 1625.0
2 1533.0
Name: birth_date, dtype: float64
I get type float64, even when I specified str. Also, take out the first line in the CSV, the one with the empty birth_date, and I get type int. The workaround is easy:
return '"{}"'.format(int(date))
Is there a better way?
In data analysis, I can imagine it's useful that Pandas will say 'Hey dude, you thought you were reading strings, but in fact they're numbers'. But what's the rationale for overruling me when I tell it not to?
Using parse_dates / date_parser looks a bit complicated for me, unless you want to generalise your import on many date columns. I think you have more control with converters parameter, where you can fit dateclean() function. You can also experiment with dtype parameter.
The problem with original dateclean() function is that it fails on "" value, because int("") raises ValueError. Pandas seem to resort to standard import when it encounters this problem, but it will fail explicitly with converters.
Below is the code to demonstrate a fix:
import pandas as pd
from pathlib import Path
doc = """"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"
"""
Path('my.csv').write_text(doc)
def dateclean(date):
try:
return str(int(date))
except ValueError:
return ''
df = pd.read_csv(
'my.csv',
parse_dates=['birth_date'],
date_parser=dateclean,
engine='python'
)
df2 = pd.read_csv(
'my.csv',
converters = {'birth_date': dateclean}
)
print(df2.birth_date)
Hope it helps.
The problem is date_parser is designed specifically for conversion to datetime:
date_parser : function, default NoneFunction to use for converting a sequence of string columns to an array of datetime
instances.
There is no reason you should expect this parameter to work for other types. Instead, you can use the converters parameter. Here we use toolz.compose to apply int and then str. Alternatively, you can use lambda x: str(int(x)).
from io import StringIO
import pandas as pd
from toolz import compose
mystr = StringIO('''"birth_date", "name"
"","Dr. Who"
"1625", "Rembrandt"
"1533", "Michel"''')
df = pd.read_csv(mystr,
converters={'birth_date': compose(str, int)},
engine='python')
print(df.birth_date)
0 NaN
1 1625
2 1533
Name: birth_date, dtype: object
If you need to replace NaN with empty strings, you can post-process with fillna:
print(df.birth_date.fillna(''))
0
1 1625
2 1533
Name: birth_date, dtype: object
I have a column Column1 in a pandas dataframe which is of type str, values which are in the following form:
import pandas as pd
df = pd.read_table("filename.dat")
type(df["Column1"].ix[0]) #outputs 'str'
print(df["Column1"].ix[0])
which outputs '1/350'. So, this is currently a string. I would like to convert it into a float.
I tried this:
df["Column1"] = df["Column1"].astype('float64', raise_on_error = False)
But this didn't change the values into floats.
This also failed:
df["Column1"] = df["Column1"].convert_objects(convert_numeric=True)
And this failed:
df["Column1"] = df["Column1"].apply(pd.to_numeric, args=('coerce',))
How do I convert all the values of column "Column1" into floats? Could I somehow use regex to remove the parentheses?
EDIT:
The line
df["Meth"] = df["Meth"].apply(eval)
works, but only if I use it twice, i.e.
df["Meth"] = df["Meth"].apply(eval)
df["Meth"] = df["Meth"].apply(eval)
Why would this be?
You need to evaluate the expression (e.g. '1/350') in order to get the result, for which you can use Python's eval() function.
By wrapping Panda's apply() function around it, you can then execute the eval() function on every value in your column. Example:
df["Column1"].apply(eval)
As you're interpreting literals, you can also use the ast.literal_eval function as noted in the docs. Update: This won't work, as the use of literal_eval() is still restricted to additions and subtractions (source).
Remark: as mentioned in other answers and comments on this question, the use of eval() is not without risks, as you're basically executing whatever input is passed in. In other words, if your input contains malicious code, you're giving it a free pass.
Alternative option:
# Define a custom div function
def div(a,b):
return int(a)/int(b)
# Split each string and pass the values to div
df_floats = df['col1'].apply(lambda x: div(*x.split('/')))
Second alternative in case of unclean data:
By using regular expressions, we can remove any non-digits appearing resp. before the numerator and after the denominator.
# Define a custom div function (unchanged)
def div(a,b):
return int(a)/int(b)
# We'll import the re module and define a precompiled pattern
import re
regex = re.compile('\D*(\d+)/(\d+)\D*')
df_floats = df['col1'].apply(lambda x: div(*regex.findall(x)[0]))
We'll lose a bit of performance, but the upside is that even with input like '!erefdfs?^dfsdf1/350dqsd qsd qs d', we still end up with the value of 1/350.
Performance:
When timing both options on a dataframe with 100.000 rows, the second option (using the user defined div function) clearly wins:
using eval: 1 loop, best of 3: 1.41 s per loop
using div: 10 loops, best of 3: 159 ms per loop
using re: 1 loop, best of 3: 275 ms per loop
I hate advocating for the use of eval. I didn't want to spend time on this answer but I was compelled because I don't want you to use eval.
So I wrote this function that works on a pd.Series
def do_math_in_string(s):
op_map = {'/': '__div__', '*': '__mul__', '+': '__add__', '-': '__sub__'}
df = s.str.extract(r'(\d+)(\D+)(\d+)', expand=True)
df = df.stack().str.strip().unstack()
df.iloc[:, 0] = pd.to_numeric(df.iloc[:, 0]).astype(float)
df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2]).astype(float)
def do_op(x):
return getattr(x[0], op_map[x[1]])(x[2])
return df.T.apply(do_op)
Demonstration
s = pd.Series(['1/2', '3/4', '4/5'])
do_math_in_string(s)
0 0.50
1 0.75
2 0.80
dtype: float64
do_math_in_string(pd.Series(['1/2', '3/4', '4/5', '6+5', '11-7', '9*10']))
0 0.50
1 0.75
2 0.80
3 11.00
4 4.00
5 90.00
dtype: float64
Please don't use eval.
You can do it by applying eval to the column:
data = {'one':['1/20', '2/30']}
df = pd.DataFrame(data)
In [8]: df['one'].apply(eval)
Out[8]:
0 0.050000
1 0.066667
Name: one, dtype: float64
I've seen this and this on formatting floating-point numbers for display in pandas, but I'm interested in doing the same thing for integers.
Right now, I have:
pd.options.display.float_format = '{:,.2f}'.format
That works on the floats in my data, but will either leave annoying trailing zeroes on integers that are cast to floats, or I'll have plain integers that don't get formatted with commas.
The pandas docs mention a SeriesFormatter class about which I haven't been able to find any information.
Alternatively, if there's a way to write a single string formatter that will format floats as '{:,.2f}' and floats with zero trailing decimal as '{:,d}', that'd work too.
You could monkey-patch pandas.io.formats.format.IntArrayFormatter:
import contextlib
import numpy as np
import pandas as pd
import pandas.io.formats.format as pf
np.random.seed(2015)
#contextlib.contextmanager
def custom_formatting():
orig_float_format = pd.options.display.float_format
orig_int_format = pf.IntArrayFormatter
pd.options.display.float_format = '{:0,.2f}'.format
class IntArrayFormatter(pf.GenericArrayFormatter):
def _format_strings(self):
formatter = self.formatter or '{:,d}'.format
fmt_values = [formatter(x) for x in self.values]
return fmt_values
pf.IntArrayFormatter = IntArrayFormatter
yield
pd.options.display.float_format = orig_float_format
pf.IntArrayFormatter = orig_int_format
df = pd.DataFrame(np.random.randint(10000, size=(5,3)), columns=list('ABC'))
df['D'] = np.random.random(df.shape[0])*10000
with custom_formatting():
print(df)
yields
A B C D
0 2,658 2,828 4,540 8,961.77
1 9,506 2,734 9,805 2,221.86
2 3,765 4,152 4,583 2,011.82
3 5,244 5,395 7,485 8,656.08
4 9,107 6,033 5,998 2,942.53
while outside of the with-statement:
print(df)
yields
A B C D
0 2658 2828 4540 8961.765260
1 9506 2734 9805 2221.864779
2 3765 4152 4583 2011.823701
3 5244 5395 7485 8656.075610
4 9107 6033 5998 2942.530551
Another option for Jupyter notebooks is to use df.style.format('{:,}'), but it only works on a single dataframe as far as I know, so you would have to call this every time:
table.style.format('{:,}')
col1 col2
0s 9,246,452 6,669,310
>0 2,513,002 5,090,144
table
col1 col2
0s 9246452 6669310
>0 2513002 5090144
Styling — pandas 1.1.2 documentation
Starting with Pandas 1.3.0, you can specify df.style.format(thousands=',') to use commas to separate thousands in floats, complex numbers, and integers.
See: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.format.html.
Although it's been a years to give the answer
Like the following example, even though I set the format at the beginning, the format is changed after using add
We can try use asType to convert the format
As a follow up to this post python pandas complex number and now that complex works fine with pandas, I want to save the complex numbers but without the parentheses -
when I use the following command the last column (complex number) is printed inside parentheses
EDIT: here is the full code, to read the data file (sample here)
import numpy as np
import pandas as pd
df = pd.read_csv('final.dat', sep=",", header=None)
df.columns=['X.1', 'X.2', 'X.3', 'X.4','X.5', 'X.6', 'X.7', 'X.8']
df['X.8'] = df['X.8'].str.replace('i','j').apply(lambda x: np.complex(x))
df1 = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
df1['X.3'] = df["X.3"] #add extra columns
df1['X.4']=df["X.4"]
df1['X.6']=df["X.6"]
df1['X.7']=df["X.7"]
sorted_data = df1.reindex_axis(sorted(df1.columns), axis=1)
sorted_data.to_csv = ('final_sorted.dat', sep=',', header = False)
all works well, but the in the output csv file the complex are inside parentheses - and I cannot use them this way, so I want to remove them
Prob could have better support for reading/writing complex, but ATM this will work.
In [25]: df = DataFrame([[1+2j],[2-1j]],columns=list('A'))
In [26]: df
Out[26]:
A
0 (1+2j)
1 (2-1j)
In [27]: df['A'] = df['A'].apply(str).str.replace('\(|\)','')
In [28]: df
Out[28]:
A
0 1+2j
1 2-1j
In [29]: df.to_csv('test.csv')
In [30]: !cat test.csv
,A
0,1+2j
1,2-1j