Say I start with a Series of unformatted phone numbers (as strings), and I would like to format them as (XXX) YYY-ZZZZ.
I can get the sub-components of my input using regular expressions and str.match or str.extract. And I can perform the formatting using the result of either:
ser = pd.Series(data=['1234567890', '2345678901', '3456789012'])
matched = ser.str.match(r'(\d{3})(\d{3})(\d{4})')
extracted = ser.astype(str).str.extract(r'(?P<first>\d{3})(?P<second>\d{3})(?P<third>\d{4})')
formatmatched = matched.apply(lambda x: '({0}) {1}-{2}'.format(*x))
print 'formatmatched'
print formatmatched
formatextracted = extracted.apply(lambda x: '({first}) {second}-{third}'.format(**x.to_dict()), axis=1)
print 'formatextracted'
print formatextracted
Results:
formatmatched
0 (123) 456-7890
1 (234) 567-8901
2 (345) 678-9012
dtype: object
formatextracted
0 (123) 456-7890
1 (234) 567-8901
2 (345) 678-9012
dtype: object
Is there a vectorized way to apply that formatting command in either context?
You can do this directly with Series.str.replace():
In [47]: s = pandas.Series(["1234567890", "5552348866", "13434"])
In [49]: s
Out[49]:
0 1234567890
1 5552348866
2 13434
dtype: object
In [50]: s.str.replace(r"(\d{3})(\d{3})(\d{4})", r"(\1) \2-\3")
Out[50]:
0 (123) 456-7890
1 (555) 234-8866
2 13434
dtype: object
You could also imagine doing another transformation first to remove any non-digit characters.
Why don't you try this:
import pandas as pd
ser = pd.Series(data=['1234567890', '2345678901', '3456789012'])
def f(val):
return '({0}) {1}-{2}'.format(val[:3],val[3:6],val[6:])
print ser.apply(f)
Related
One Column of my dataset is like this:
0 10,000+
1 500,000+
2 5,000,000+
3 50,000,000+
4 100,000+
Name: Installs, dtype: object
and I want to change these 'xxx,yyy,zzz+' strings to integers.
first I tried this function:
df['Installs'] = pd.to_numeric(df['Installs'])
and I got this error:
ValueError: Unable to parse string "10,000" at position 0
and then I tried to remove '+' and ',' with this method:
df['Installs'] = df['Installs'].str.replace('+','',regex = True)
df['Installs'] = df['Installs'].str.replace(',','',regex = True)
but nothing changed!
How can I convert these strings to integers?
With regex=True, the + (plus) character is interepreted specially, as a regex feature. You can either disable regular expression replacement (regex=False), or even better, change your regular expression to match + or , and remove them at once:
df['Installs'] = df['Installs'].str.replace('[+,]', '', regex=True).astype(int)
Output:
>>> df['Installs']
0 10000
1 500000
2 5000000
3 50000000
4 100000
Name: 0, dtype: int64
+ is not a valid regex, use:
df['Installs'] = pd.to_numeric(df['Installs'].str.replace(r'\D', '', regex=True))
I have the following dataframe:
contract
0 Future(conId=482048803, symbol='ESTX50', lastT...
1 Future(conId=497000453, symbol='XT', lastTrade...
2 Stock(conId=321100413, symbol='SXRS', exchange...
3 Stock(conId=473087271, symbol='ETHEEUR', excha...
4 Stock(conId=80268543, symbol='IJPA', exchange=...
5 Stock(conId=153454120, symbol='EMIM', exchange...
6 Stock(conId=75776072, symbol='SXR8', exchange=...
7 Stock(conId=257200855, symbol='EGLN', exchange...
8 Stock(conId=464974581, symbol='VBTC', exchange...
9 Future(conId=478135706, symbol='ZN', lastTrade...
I want to create a new "symbol" column with all symbols (ESTX50, XT, SXRS...).
In order to extract the substring between "symbol='" and the following single quote, I tried the following:
df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
but I get a column of NaN.
What am I doing wrong? Thanks
It looks like that is a column of objects, not strings:
import pandas as pd
class Future:
def __init__(self, symbol):
self.symbol = symbol
def __repr__(self):
return f'Future(symbol=\'{self.symbol}\')'
df = pd.DataFrame({'contract': [Future(symbol='ESTX50'), Future(symbol='XT')]})
df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
print(df)
df:
contract symbol
0 Future(symbol='ESTX50') NaN
1 Future(symbol='XT') NaN
Notice pandas considers strings to be object type so the string accessor is still allowed to attempt to perform operations. However, it cannot extract because these are not strings.
We can either convert to string first with astype:
df['symbol'] = df.contract.astype(str).str.extract(r"symbol='(.*?)'")
df:
contract symbol
0 Future(symbol='ESTX50') ESTX50
1 Future(symbol='XT') XT
However, the faster approach is to try to extract the object property:
df['symbol'] = [getattr(x, 'symbol', None) for x in df.contract]
Or with apply (which can be slower than the comprehension)
df['symbol'] = df.contract.apply(lambda x: getattr(x, 'symbol', None))
Both produce:
contract symbol
0 Future(symbol='ESTX50') ESTX50
1 Future(symbol='XT') XT
I am fairly new to Pandas and I am working on project where I have a column that looks like the following:
AverageTotalPayments
$7064.38
$7455.75
$6921.90
ETC
I am trying to get the cost factor out of it where the cost could be anything above 7000. First, this column is an object. Thus, I know that I probably cannot do a comparison with it to a number. My code, that I have looks like the following:
import pandas as pd
health_data = pd.read_csv("inpatientCharges.csv")
state = input("What is your state: ")
issue = input("What is your issue: ")
#This line of code will create a new dataframe based on the two letter state code
state_data = health_data[(health_data.ProviderState == state)]
#With the new data set I search it for the injury the person has.
issue_data=state_data[state_data.DRGDefinition.str.contains(issue.upper())]
#I then make it replace the $ sign with a '' so I have a number. I also believe at this point my code may be starting to break down.
issue_data = issue_data['AverageTotalPayments'].str.replace('$', '')
#Since the previous line took out the $ I convert it from an object to a float
issue_data = issue_data[['AverageTotalPayments']].astype(float)
#I attempt to print out the values.
cost = issue_data[(issue_data.AverageTotalPayments >= 10000)]
print(cost)
When I run this code I simply get nan back. Not exactly what I want. Any help with what is wrong would be great! Thank you in advance.
Try this:
In [83]: df
Out[83]:
AverageTotalPayments
0 $7064.38
1 $7455.75
2 $6921.90
3 aaa
In [84]: df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000
Out[84]:
0 True
1 True
2 False
3 False
Name: AverageTotalPayments, dtype: bool
In [85]: df[df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000]
Out[85]:
AverageTotalPayments
0 $7064.38
1 $7455.75
Consider the pd.Series s
s
0 $7064.38
1 $7455.75
2 $6921.90
Name: AverageTotalPayments, dtype: object
This gets the float values
pd.to_numeric(s.str.replace('$', ''), 'ignore')
0 7064.38
1 7455.75
2 6921.90
Name: AverageTotalPayments, dtype: float64
Filter s
s[pd.to_numeric(s.str.replace('$', ''), 'ignore') > 7000]
0 $7064.38
1 $7455.75
Name: AverageTotalPayments, dtype: object
I have the following data frame (consisting of both negative and positive numbers):
df.head()
Out[39]:
Prices
0 -445.0
1 -2058.0
2 -954.0
3 -520.0
4 -730.0
I am trying to change the 'Prices' column to display as currency when I export it to an Excel spreadsheet. The following command I use works well:
df['Prices'] = df['Prices'].map("${:,.0f}".format)
df.head()
Out[42]:
Prices
0 $-445
1 $-2,058
2 $-954
3 $-520
4 $-730
Now my question here is what would I do if I wanted the output to have the negative signs BEFORE the dollar sign. In the output above, the dollar signs are before the negative signs. I am looking for something like this:
-$445
-$2,058
-$954
-$520
-$730
Please note there are also positive numbers as well.
You can use np.where and test whether the values are negative and if so prepend a negative sign in front of the dollar and cast the series to a string using astype:
In [153]:
df['Prices'] = np.where( df['Prices'] < 0, '-$' + df['Prices'].astype(str).str[1:], '$' + df['Prices'].astype(str))
df['Prices']
Out[153]:
0 -$445.0
1 -$2058.0
2 -$954.0
3 -$520.0
4 -$730.0
Name: Prices, dtype: object
You can use the locale module and the _override_localeconv dict. It's not well documented, but it's a trick I found in another answer that has helped me before.
import pandas as pd
import locale
locale.setlocale( locale.LC_ALL, 'English_United States.1252')
# Made an assumption with that locale. Adjust as appropriate.
locale._override_localeconv = {'n_sign_posn':1}
# Load dataframe into df
df['Prices'] = df['Prices'].map(locale.currency)
This creates a dataframe that looks like this:
Prices
0 -$445.00
1 -$2058.00
2 -$954.00
3 -$520.00
4 -$730.00
I have a column named "KL" with for example:
sem_0405M4209F2057_1.000
sem_A_0103M5836F4798_1.000
Now I want to extract the four digits after "M" and the four digits after "F". But with df["KL"].str.extract I can't get it to work.
Locations of M and F vary, thus just using the slice [9:13] won't work for the complete column.
If you want to use str.extract, here's how:
>>> df['KL'].str.extract(r'M(?P<M>[0-9]{4})F(?P<F>[0-9]{4})')
M F
0 4209 2057
1 5836 4798
Here, M(?P<M>[0-9]{4}) matches the character 'M' and then captures 4 digits following it (the [0-9]{4} part). This is put in the column M (specified with ?P<M> inside the capturing group). The same thing is done for F.
You could use split to achieve this, probably a better way exists:
In [147]:
s = pd.Series(['sem_0405M4209F2057_1.000','sem_A_0103M5836F4798_1.000'])
s
Out[147]:
0 sem_0405M4209F2057_1.000
1 sem_A_0103M5836F4798_1.000
dtype: object
In [153]:
m = s.str.split('M').str[1].str.split('F').str[0][:4]
f = s.str.split('M').str[1].str.split('F').str[1].str[:4]
print(m)
print(f)
0 4209
1 5836
dtype: object
0 2057
1 4798
dtype: object
You can also use regex:
import re
def get_data(x):
data = re.search( r'M(\d{4})F(\d{4})', x)
if data:
m = data.group(1)
f = data.group(2)
return m, f
df = pd.DataFrame(data={'a': ['sem_0405M4209F2057_1.000', 'sem_0405M4239F2027_1.000']})
df['data'] = df['a'].apply(lambda x: get_data(x))
>>
a data
0 sem_0405M4209F2057_1.000 (4209, 2057)
1 sem_0405M4239F2027_1.000 (4239, 2027)