I have a data frame with strings in Column_A: row1:Anna, row2:Mark, row3:Emy
I would like to get something like:row1(Anna). row2:(Mark), row3:(Emy)
I have found some examples on how to remove the brackets, however have not found anything on how to add them.
Hence, any clue would be much appreciated.
Using apply form pandas you can create a function which adds the brackets. In this case the function is an lambda function using the f-string.
df['Column_A'] = df['Column_A'].apply(lambda x: f'({x})')
# Example:
l = ['Anna', 'Mark', 'Emy']
df = pd.DataFrame(l, columns=['Column_A'])
Column_A
0 Anna
1 Mark
2 Emy
df['Column_A'] = df['Column_A'].apply(lambda x: f'({x})')
Column_A
0 (Anna)
1 (Mark)
2 (Emy)
Approach 1: Using direct Concatenation (Simpler)
df['Column_A'] = '(' + df['Column_A'] + ')'
Approach 2: Using apply() function and f-strings
df.Column_A.apply(lambda val: f'({val})')
Approach 3: Using map() function
df = pd.DataFrame(map(lambda val: f'({val})', df.Column_A))
Related
How can I apply merge function or any other method on column A.
For example in layman term I want to convert this string "(A|B|C,D)|(A,B|C|D)|(B|C|D)" into a
"(D A|D B|D C)|(A B|A C|A D)|(B|C|D)"
This (B|C|D) will remain same as it doesn't have comma value to merge in it. Basically I want to merge the values which are in commas to rest of its other values.
I have below data frame.
import pandas as pd
data = {'A': [ '(A|B|C,D)|(A,B|C|D)|(B|C|D)'],
'B(Expected)': [ '(D A|D B|D C)|(A B|A C|A D)|(B|C|D)']
}
df = pd.DataFrame(data)
print (df)
My expected result is mentioned in column B(Expected)
Below method I tried:-
(1)
df['B(Expected)'] = df['A'].apply(lambda x: x.replace("|", " ").replace(",", "|") if "|" in x and "," in x else x)
(2)
# Split the string by the pipe character
df['string'] = df['string'].str.split('|')
df['string'] = df['string'].apply(lambda x: '|'.join([' '.join(i.split(' ')) for i in x]))
You can use a regex to extract the values in parentheses, then a custom function with itertools.product to reorganize the values:
from itertools import product
def split(s):
return '|'.join([' '.join(x) for x in product(*[x.split('|') for x in s.split(',')])])
df['B'] = df['A'].str.replace(r'([^()]+)', lambda m: split(m.group()), regex=True)
print(df)
Note that this requires non-nested parentheses.
Output:
A B
0 (A|B|C,D)|(A,B|C|D)|(B|C|D) (A D|B D|C D)|(A B|A C|A D)|(B|C|D)
For a particular column (dtype = object), how can I add '-' to the start of the string, given it ends with '-'.
i.e convert: 'MAY500-' to '-May500-'
(I need to add this to every element in the column)
Try something like this:
#setup
df = pd.DataFrame({'col':['aaaa','bbbb-','cc-','dddddddd-']})
mask = df.col.str.endswith('-'), 'col'
df.loc[mask] = '-' + df.loc[mask]
Output
df
col
0 aaaa
1 -bbbb-
2 -cc-
3 -dddddddd-
You can use np.select
Given a dataframe like this:
df
values
0 abcd-
1 a-bcd
2 efg-
You can use np.select as follows:
df['values'] = np.select([df['values'].str.endswith('-')], ['-' + df['values']], df['values'])
output:
df
values
0 -abcd-
1 a-bcd
2 -efg-
def add_prefix(text):
# If text is null or empty string the -1 index will result in IndexError
if text and text[-1] == "-":
return "-"+text
return text
df = pd.DataFrame(data={'A':["MAY500", "MAY500-", "", None, np.nan]})
# Change the column to string dtype first
df['A'] = df['A'].astype(str)
df['A'] = df['A'].apply(add_prefix)
0 MAY500
1 -MAY500-
2
3 None
4 nan
Name: A, dtype: object
I have a knack for using apply with lambda functions a lot. It just makes the code a lot easier to read.
df['value'] = df['value'].apply(lambda x: '-'+str(x) if str(x).endswith('-') else x)
I have a panda dataframe:
star = pd.DataFrame({'Country':['Canada','USA', 'Mexico'],'Rating':[1,2,3], 'Score':[70,80,90]})
I want to give Rating value 3 to Canada. And this code works.
star.loc[star['Country'] == 'Canada', 'Rating'] = 3
But I want to do it with lambda function:
star.Rating.map(lambda x: 3 if star.Country == 'Canada')
Gives a syntax error
File "<ipython-input-41-544a311d7f86>", line 1
star.Rating.map(lambda x: 3 if star.Country == 'Canada')
^
SyntaxError: invalid syntax
I want help in the lambda function
That is indeed a syntax error. You should do:
star.apply(lambda x: 3 if x.Country == 'Canada' else x.Rating, axis=1)
However, your original solution is much better.
I suggest you to avoid apply (or map) for these kind of problems. np.where is faster and easier to implement
star["Rating"] = np.where(star.Country=="Canada", 3, star.Rating)
Sometimes the string numbers in my DataFrames have commas in them representing either decimal or marking the thousand, some do not. The dataframe is an example of the range of price formats I receive via an API and vary depend on the currency. These are prices and the decimals will always be 2. So I need to output the string prices into float so I can sum them or separate them into other dataframes or use them for plotting graphs. I have created a loop to replace them, but is there a quicker way to do this without the loop?
My DataFrame and working loop is as follows:
data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)
fees = []
sales = []
for items in df['amount']:
if items[-7:-6] == ',':
items = float(items.replace(',', '').replace(' ',''))
if items[-3:-2] == ',':
items = float(items.replace(',', '.').replace(' ',''))
items = float(items)
if items <= 0:
fees.append(items)
else:
sales.append(items)
I have attempted to do this without the loop but can't seem to work out where I have gone wrong.
df["amount"] = np.where((df['amount'][-7:-6] == ','),
df["amount"][-7:-6].str.replace(',', '').replace(' ',''),
df["amount"])
df["amount"] = np.where((df['amount'][-3:-2] == ','),
df["amount"][-3:-2].str.replace(',', '').replace(' ',''),
df["amount"])
Any help would be much appreciated. Thank you in advance
Since you mention the last two digits are decimal points, so the ',' needs to be replaced with '.' to make it float, but you also have some values like 1,000.00 that will become irrelevant if the ',' is replaced with '.', hence you can use a regex to identify what values to be replaced:
data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)
df
First the regex will match all string with ',' and two decimal points, then the replace function will replace the match with a '.' & the captured values (99 from ,99)
df['amount'] = df['amount'].str.replace(r'(,)(\d{2}$)',r'.\2')
# here `r'.\2'`is second `captured group` in `regex`
Then to convert 1,000.00 to float we will replace the ',' with blank
df['amount'] = df['amount'].str.replace(',','')
And then convert the data type to float
df['amount'] = df['amount'].astype(float)
print(df)
amount
0 799.00
1 699.00
2 9.99
3 -699.00
4 1000.00
You can use lambdas instead of numpy:
lambda1 = lambda items: float(str(items).replace(',', '').replace(' ','')) if str(items)[-7:-6] == ',' else items
lambda2 = lambda items: float(str(items).replace(',', '.').replace(' ','')) if str(items)[-3:-2] == ',' else items
to_float = lambda items: float(items)
df['amount_clean'] = df["amount"].map(lambda1).map(lambda2).map(to_float)
=========================================================================
Edit: what are lambdas
In python, lambda functions are small anonymous functions with a single expression (see https://www.w3schools.com/python/python_lambda.asp)
Example with condition:
lambda x: x + 1 if x < 0 else x
This is equivalent to:
def my_lambda_function(x):
if x < 0:
return x + 1
else:
return x
When passed to the column of a pandas dataframe via the map function, the lambda expression will be applied to the value in each row of the column.
Hope this helps!
Try using split and join,
df.amount.str.split(',').str.join('').astype(float)
Output
0 799.00
1 699.00
2 9.99
3 -699.00
4 1000.00
Name: amount, dtype: float64
I have a column Column1 in a pandas dataframe which is of type str, values which are in the following form:
import pandas as pd
df = pd.read_table("filename.dat")
type(df["Column1"].ix[0]) #outputs 'str'
print(df["Column1"].ix[0])
which outputs '1/350'. So, this is currently a string. I would like to convert it into a float.
I tried this:
df["Column1"] = df["Column1"].astype('float64', raise_on_error = False)
But this didn't change the values into floats.
This also failed:
df["Column1"] = df["Column1"].convert_objects(convert_numeric=True)
And this failed:
df["Column1"] = df["Column1"].apply(pd.to_numeric, args=('coerce',))
How do I convert all the values of column "Column1" into floats? Could I somehow use regex to remove the parentheses?
EDIT:
The line
df["Meth"] = df["Meth"].apply(eval)
works, but only if I use it twice, i.e.
df["Meth"] = df["Meth"].apply(eval)
df["Meth"] = df["Meth"].apply(eval)
Why would this be?
You need to evaluate the expression (e.g. '1/350') in order to get the result, for which you can use Python's eval() function.
By wrapping Panda's apply() function around it, you can then execute the eval() function on every value in your column. Example:
df["Column1"].apply(eval)
As you're interpreting literals, you can also use the ast.literal_eval function as noted in the docs. Update: This won't work, as the use of literal_eval() is still restricted to additions and subtractions (source).
Remark: as mentioned in other answers and comments on this question, the use of eval() is not without risks, as you're basically executing whatever input is passed in. In other words, if your input contains malicious code, you're giving it a free pass.
Alternative option:
# Define a custom div function
def div(a,b):
return int(a)/int(b)
# Split each string and pass the values to div
df_floats = df['col1'].apply(lambda x: div(*x.split('/')))
Second alternative in case of unclean data:
By using regular expressions, we can remove any non-digits appearing resp. before the numerator and after the denominator.
# Define a custom div function (unchanged)
def div(a,b):
return int(a)/int(b)
# We'll import the re module and define a precompiled pattern
import re
regex = re.compile('\D*(\d+)/(\d+)\D*')
df_floats = df['col1'].apply(lambda x: div(*regex.findall(x)[0]))
We'll lose a bit of performance, but the upside is that even with input like '!erefdfs?^dfsdf1/350dqsd qsd qs d', we still end up with the value of 1/350.
Performance:
When timing both options on a dataframe with 100.000 rows, the second option (using the user defined div function) clearly wins:
using eval: 1 loop, best of 3: 1.41 s per loop
using div: 10 loops, best of 3: 159 ms per loop
using re: 1 loop, best of 3: 275 ms per loop
I hate advocating for the use of eval. I didn't want to spend time on this answer but I was compelled because I don't want you to use eval.
So I wrote this function that works on a pd.Series
def do_math_in_string(s):
op_map = {'/': '__div__', '*': '__mul__', '+': '__add__', '-': '__sub__'}
df = s.str.extract(r'(\d+)(\D+)(\d+)', expand=True)
df = df.stack().str.strip().unstack()
df.iloc[:, 0] = pd.to_numeric(df.iloc[:, 0]).astype(float)
df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2]).astype(float)
def do_op(x):
return getattr(x[0], op_map[x[1]])(x[2])
return df.T.apply(do_op)
Demonstration
s = pd.Series(['1/2', '3/4', '4/5'])
do_math_in_string(s)
0 0.50
1 0.75
2 0.80
dtype: float64
do_math_in_string(pd.Series(['1/2', '3/4', '4/5', '6+5', '11-7', '9*10']))
0 0.50
1 0.75
2 0.80
3 11.00
4 4.00
5 90.00
dtype: float64
Please don't use eval.
You can do it by applying eval to the column:
data = {'one':['1/20', '2/30']}
df = pd.DataFrame(data)
In [8]: df['one'].apply(eval)
Out[8]:
0 0.050000
1 0.066667
Name: one, dtype: float64