I have a column Column1 in a pandas dataframe which is of type str, values which are in the following form:
import pandas as pd
df = pd.read_table("filename.dat")
type(df["Column1"].ix[0]) #outputs 'str'
print(df["Column1"].ix[0])
which outputs '1/350'. So, this is currently a string. I would like to convert it into a float.
I tried this:
df["Column1"] = df["Column1"].astype('float64', raise_on_error = False)
But this didn't change the values into floats.
This also failed:
df["Column1"] = df["Column1"].convert_objects(convert_numeric=True)
And this failed:
df["Column1"] = df["Column1"].apply(pd.to_numeric, args=('coerce',))
How do I convert all the values of column "Column1" into floats? Could I somehow use regex to remove the parentheses?
EDIT:
The line
df["Meth"] = df["Meth"].apply(eval)
works, but only if I use it twice, i.e.
df["Meth"] = df["Meth"].apply(eval)
df["Meth"] = df["Meth"].apply(eval)
Why would this be?
You need to evaluate the expression (e.g. '1/350') in order to get the result, for which you can use Python's eval() function.
By wrapping Panda's apply() function around it, you can then execute the eval() function on every value in your column. Example:
df["Column1"].apply(eval)
As you're interpreting literals, you can also use the ast.literal_eval function as noted in the docs. Update: This won't work, as the use of literal_eval() is still restricted to additions and subtractions (source).
Remark: as mentioned in other answers and comments on this question, the use of eval() is not without risks, as you're basically executing whatever input is passed in. In other words, if your input contains malicious code, you're giving it a free pass.
Alternative option:
# Define a custom div function
def div(a,b):
return int(a)/int(b)
# Split each string and pass the values to div
df_floats = df['col1'].apply(lambda x: div(*x.split('/')))
Second alternative in case of unclean data:
By using regular expressions, we can remove any non-digits appearing resp. before the numerator and after the denominator.
# Define a custom div function (unchanged)
def div(a,b):
return int(a)/int(b)
# We'll import the re module and define a precompiled pattern
import re
regex = re.compile('\D*(\d+)/(\d+)\D*')
df_floats = df['col1'].apply(lambda x: div(*regex.findall(x)[0]))
We'll lose a bit of performance, but the upside is that even with input like '!erefdfs?^dfsdf1/350dqsd qsd qs d', we still end up with the value of 1/350.
Performance:
When timing both options on a dataframe with 100.000 rows, the second option (using the user defined div function) clearly wins:
using eval: 1 loop, best of 3: 1.41 s per loop
using div: 10 loops, best of 3: 159 ms per loop
using re: 1 loop, best of 3: 275 ms per loop
I hate advocating for the use of eval. I didn't want to spend time on this answer but I was compelled because I don't want you to use eval.
So I wrote this function that works on a pd.Series
def do_math_in_string(s):
op_map = {'/': '__div__', '*': '__mul__', '+': '__add__', '-': '__sub__'}
df = s.str.extract(r'(\d+)(\D+)(\d+)', expand=True)
df = df.stack().str.strip().unstack()
df.iloc[:, 0] = pd.to_numeric(df.iloc[:, 0]).astype(float)
df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2]).astype(float)
def do_op(x):
return getattr(x[0], op_map[x[1]])(x[2])
return df.T.apply(do_op)
Demonstration
s = pd.Series(['1/2', '3/4', '4/5'])
do_math_in_string(s)
0 0.50
1 0.75
2 0.80
dtype: float64
do_math_in_string(pd.Series(['1/2', '3/4', '4/5', '6+5', '11-7', '9*10']))
0 0.50
1 0.75
2 0.80
3 11.00
4 4.00
5 90.00
dtype: float64
Please don't use eval.
You can do it by applying eval to the column:
data = {'one':['1/20', '2/30']}
df = pd.DataFrame(data)
In [8]: df['one'].apply(eval)
Out[8]:
0 0.050000
1 0.066667
Name: one, dtype: float64
Related
I have a data frame with strings in Column_A: row1:Anna, row2:Mark, row3:Emy
I would like to get something like:row1(Anna). row2:(Mark), row3:(Emy)
I have found some examples on how to remove the brackets, however have not found anything on how to add them.
Hence, any clue would be much appreciated.
Using apply form pandas you can create a function which adds the brackets. In this case the function is an lambda function using the f-string.
df['Column_A'] = df['Column_A'].apply(lambda x: f'({x})')
# Example:
l = ['Anna', 'Mark', 'Emy']
df = pd.DataFrame(l, columns=['Column_A'])
Column_A
0 Anna
1 Mark
2 Emy
df['Column_A'] = df['Column_A'].apply(lambda x: f'({x})')
Column_A
0 (Anna)
1 (Mark)
2 (Emy)
Approach 1: Using direct Concatenation (Simpler)
df['Column_A'] = '(' + df['Column_A'] + ')'
Approach 2: Using apply() function and f-strings
df.Column_A.apply(lambda val: f'({val})')
Approach 3: Using map() function
df = pd.DataFrame(map(lambda val: f'({val})', df.Column_A))
I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)
I've seen this and this on formatting floating-point numbers for display in pandas, but I'm interested in doing the same thing for integers.
Right now, I have:
pd.options.display.float_format = '{:,.2f}'.format
That works on the floats in my data, but will either leave annoying trailing zeroes on integers that are cast to floats, or I'll have plain integers that don't get formatted with commas.
The pandas docs mention a SeriesFormatter class about which I haven't been able to find any information.
Alternatively, if there's a way to write a single string formatter that will format floats as '{:,.2f}' and floats with zero trailing decimal as '{:,d}', that'd work too.
You could monkey-patch pandas.io.formats.format.IntArrayFormatter:
import contextlib
import numpy as np
import pandas as pd
import pandas.io.formats.format as pf
np.random.seed(2015)
#contextlib.contextmanager
def custom_formatting():
orig_float_format = pd.options.display.float_format
orig_int_format = pf.IntArrayFormatter
pd.options.display.float_format = '{:0,.2f}'.format
class IntArrayFormatter(pf.GenericArrayFormatter):
def _format_strings(self):
formatter = self.formatter or '{:,d}'.format
fmt_values = [formatter(x) for x in self.values]
return fmt_values
pf.IntArrayFormatter = IntArrayFormatter
yield
pd.options.display.float_format = orig_float_format
pf.IntArrayFormatter = orig_int_format
df = pd.DataFrame(np.random.randint(10000, size=(5,3)), columns=list('ABC'))
df['D'] = np.random.random(df.shape[0])*10000
with custom_formatting():
print(df)
yields
A B C D
0 2,658 2,828 4,540 8,961.77
1 9,506 2,734 9,805 2,221.86
2 3,765 4,152 4,583 2,011.82
3 5,244 5,395 7,485 8,656.08
4 9,107 6,033 5,998 2,942.53
while outside of the with-statement:
print(df)
yields
A B C D
0 2658 2828 4540 8961.765260
1 9506 2734 9805 2221.864779
2 3765 4152 4583 2011.823701
3 5244 5395 7485 8656.075610
4 9107 6033 5998 2942.530551
Another option for Jupyter notebooks is to use df.style.format('{:,}'), but it only works on a single dataframe as far as I know, so you would have to call this every time:
table.style.format('{:,}')
col1 col2
0s 9,246,452 6,669,310
>0 2,513,002 5,090,144
table
col1 col2
0s 9246452 6669310
>0 2513002 5090144
Styling — pandas 1.1.2 documentation
Starting with Pandas 1.3.0, you can specify df.style.format(thousands=',') to use commas to separate thousands in floats, complex numbers, and integers.
See: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.format.html.
Although it's been a years to give the answer
Like the following example, even though I set the format at the beginning, the format is changed after using add
We can try use asType to convert the format
How can I copy a DataFrame to_clipboard and paste it in excel with commas as decimal?
In R this is simple.
write.table(obj, 'clipboard', dec = ',')
But I cannot figure out in pandas to_clipboard.
I unsuccessfully tried changing:
import locale
locale.setlocale(locale.LC_ALL, '')
Spanish_Argentina.1252
or
df.to_clipboard(float_format = '%,%')
Since Pandas 0.16 you can use
df.to_clipboard(decimal=',')
to_clipboard() passes extra kwargs to to_csv(), which has other useful options.
There are some different ways to achieve this. First, it is possible with float_format and your locale, although the use is not so straightforward (but simple once you know it: the float_format argument should be a function that can be called):
df.to_clipboard(float_format='{:n}'.format)
A small illustration:
In [97]: df = pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
In [98]: df
Out[98]:
A B
0 1.125438 -1.015477
1 0.900816 1.283971
2 0.874250 1.058217
3 -0.013020 0.758841
4 -0.030534 -0.395631
In [99]: df.to_clipboard(float_format='{:n}'.format)
gives:
A B
0 1,12544 -1,01548
1 0,900816 1,28397
2 0,87425 1,05822
3 -0,0130202 0,758841
4 -0,0305337 -0,395631
If you don't want to rely on the locale setting but still have comma decimal output, you can do this:
class CommaFloatFormatter:
def __mod__(self, x):
return str(x).replace('.',',')
df.to_clipboard(float_format=CommaFloatFormatter())
or simply do the conversion before writing the data to clipboard:
df.applymap(lambda x: str(x).replace('.',',')).to_clipboard()
I try to import a csv and dealing with faulty values, e.x. wrong decimal seperator or strings in int/double columns. I use converters to do the error fixing. In case of strings in number columns the user sees a input box where he has to fix the value. Is it possible to get the column name and/or the row which is actually 'imported'? If not, is there a better way to do the same?
example csv:
------------
description;elevation
point a;-10
point b;10,0
point c;35.5
point d;30x
from PyQt4 import QtGui
import numpy
from pandas import read_csv
def fixFloat(x):
# return x as float if possible
try:
return float(x)
except:
# if not, test if there is a , inside, replace it with a . and return it as float
try:
return float(x.replace(",", "."))
except:
changedValue, ok = QtGui.QInputDialog.getText(None, 'Fehlerhafter Wert', 'Bitte korrigieren sie den fehlerhaften Wert:', text=x)
if ok:
return self.fixFloat(changedValue)
else:
return -9999999999
def fixEmptyStrings(s):
if s == '':
return None
else:
return s
converters = {
'description': fixEmptyStrings,
'elevation': fixFloat
}
dtypes = {
'description': object,
'elevation': numpy.float64
}
csvData = read_csv('/tmp/csv.txt',
error_bad_lines=True,
dtype=dtypes,
converters=converters
)
If you want to iterate over them, the built-in csv.DictReader is pretty handy. I wrote up this function:
import csv
def read_points(csv_file):
point_names, elevations = [], []
message = (
"Found bad data for {0}'s row: {1}. Type new data to use "
"for this value: "
)
with open(csv_file, 'r') as open_csv:
r = csv.DictReader(open_csv, delimiter=";")
for row in r:
tmp_point = row.get("description", "some_default_name")
tmp_elevation = row.get("elevation", "some_default_elevation")
point_names.append(tmp_point)
try:
tmp_elevation = float(tmp_elevation.replace(',', '.'))
except:
while True:
user_val = raw_input(message.format(tmp_point,
tmp_elevation))
try:
tmp_elevation = float(user_val)
break
except:
tmp_elevation = user_val
elevations.append(tmp_elevation)
return pandas.DataFrame({"Point":point_names, "Elevation":elevations})
And for the four-line test file, it gives me the following:
In [41]: read_points("/home/ely/tmp.txt")
Found bad data for point d's row: 30x. Type new data to use for this value: 30
Out[41]:
Elevation Point
0 -10.0 point a
1 10.0 point b
2 35.5 point c
3 30.0 point d
[4 rows x 2 columns]
Displaying a whole QT dialog box seems way overkill for this task. Why not just a command prompt? You can also add more conversion functions and change some things like the delimiter to be keyword arguments if you want it to be more customizable.
One question is how much data there is to iterate through. If it's a lot of data, this will be time consuming and tedious. In that case, you may just want to discard observations like the '30x' or write their point ID name to some other file so you can go back and deal with them all in one swoop inside something like Emacs or VIM where manipulating a big swath of text at once will be easier.
I would take a different approach here.
Rather than at read_csv time, I would read the csv naively and then fix / convert to float:
In [11]: df = pd.read_csv(csv_file, sep=';')
In [12]: df['elevation']
Out[12]:
0 -10
1 10,0
2 35.5
3 30x
Name: elevation, dtype: object
Now just iterate through this column:
In [13]: df['elevation'] = df['elevation'].apply(fixFloat)
This is going to make it much easier to reason about the code (which columns you're applying functions to, how to access other columns etc. etc.).