I am fairly new to Pandas and I am working on project where I have a column that looks like the following:
AverageTotalPayments
$7064.38
$7455.75
$6921.90
ETC
I am trying to get the cost factor out of it where the cost could be anything above 7000. First, this column is an object. Thus, I know that I probably cannot do a comparison with it to a number. My code, that I have looks like the following:
import pandas as pd
health_data = pd.read_csv("inpatientCharges.csv")
state = input("What is your state: ")
issue = input("What is your issue: ")
#This line of code will create a new dataframe based on the two letter state code
state_data = health_data[(health_data.ProviderState == state)]
#With the new data set I search it for the injury the person has.
issue_data=state_data[state_data.DRGDefinition.str.contains(issue.upper())]
#I then make it replace the $ sign with a '' so I have a number. I also believe at this point my code may be starting to break down.
issue_data = issue_data['AverageTotalPayments'].str.replace('$', '')
#Since the previous line took out the $ I convert it from an object to a float
issue_data = issue_data[['AverageTotalPayments']].astype(float)
#I attempt to print out the values.
cost = issue_data[(issue_data.AverageTotalPayments >= 10000)]
print(cost)
When I run this code I simply get nan back. Not exactly what I want. Any help with what is wrong would be great! Thank you in advance.
Try this:
In [83]: df
Out[83]:
AverageTotalPayments
0 $7064.38
1 $7455.75
2 $6921.90
3 aaa
In [84]: df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000
Out[84]:
0 True
1 True
2 False
3 False
Name: AverageTotalPayments, dtype: bool
In [85]: df[df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000]
Out[85]:
AverageTotalPayments
0 $7064.38
1 $7455.75
Consider the pd.Series s
s
0 $7064.38
1 $7455.75
2 $6921.90
Name: AverageTotalPayments, dtype: object
This gets the float values
pd.to_numeric(s.str.replace('$', ''), 'ignore')
0 7064.38
1 7455.75
2 6921.90
Name: AverageTotalPayments, dtype: float64
Filter s
s[pd.to_numeric(s.str.replace('$', ''), 'ignore') > 7000]
0 $7064.38
1 $7455.75
Name: AverageTotalPayments, dtype: object
Related
I need to check whether the string ends with | or not.
Student,"Details"
Joe|"December 2017|maths"
Bob|"April 2018|History|Biology|Physics|"
sam|"December 2018|physics"
I have tried with the below code and it's not working as expected.
def Pipe_in_variant(path):
df = pd.read_csv(path, sep='|')
mask = (df['Details'])
result = mask.endswith(""|"")
print("...................")
print(result)
Your example input is unclear, however assuming you want to check is items in a column end with something, use str.endswith.
Example:
df = pd.DataFrame({'Details': ['ab|c', 'acb|']})
df['Details'].str.endswith('|')
output:
0 False
1 True
Name: Details, dtype: bool
printing the matching rows:
df[df['Details'].str.endswith('|')]
output:
Details
1 acb|
Note - I think first row of input should be Student|"Details" instead of Student,"Details".
Here is what you can do
import pandas as pd
dframe = pd.read_csv('input.txt', sep='|')
dframe['ends_with_vbar'] = dframe['Details'].str.endswith('|')
dframe
Output:
Student Details ends_with_vbar
0 Joe December 2017|maths False
1 Bob April 2018|History|Biology|Physics| True
2 sam December 2018|physics False
Then you can print the marked row as follows
for _, row in dframe[dframe['ends_with_vbar']].iterrows():
print(f'{row["Student"]} - {row["Details"]}')
Output:
Bob - April 2018|History|Biology|Physics|
I am creating a simulator for a football dice game and I am running in to an issue in getting values from the dataframe.
My code for the result of a play is this:
def off_play(off_team_chart):
off_UPCID = int(input('Enter UPCID for Offense Play: '))
off_play = off_team_chart[off_team_chart['UPCID'] == off_UPCID]
oRoll = dice.oDice()
oPlay = off_play[off_play['DieRoll']==oRoll]
print(oPlay)
oResult = oPlay['ResultCodeID']
oYards = oPlay['Yards']
return oResult, oYards
which when run outputs the following:
TeamChartDetailD TeamChartID UPCID ... ResultCodeID Yards OutOfBounds
108292 866811 874 8 ... 8 19 False
[1 rows x 7 columns]
108292 8
Name: ResultCodeID, dtype: int64 108292 19
Name: Yards, dtype: object
I would like to have oResult be the int 8 and oYards be the int 19 in this scenario, the documentation for pandas seemed to suggest that I will need to know the 108292 number in order to get the value. Is there a way around this?
The code is printing the row index.
In order to print the return correctly, you should use the .item() method as follows:
oResult = oPlay['ResultCodeID'].item()
oYards = oPlay['Yards'].item()
Assignment to DataFrame not working but dtypes changed.
New to data-science, I wanna assign the target_frame to the empty_frame, but it's not working until assign again. And during the assignments, the dtypes of empty_frame has changed from int32 to float64 and finally setup to int64.
I try to simplify my model as the code below, they have the same problem.
import pandas as pd
import numpy as np
dataset = [[[i for i in range(5)], ] for i in range(5)]
dataset = pd.DataFrame(dataset, columns=['test'])
empty_numpy = np.arange(25).reshape(5, 5)
empty_numpy.fill(np.nan)
# Solution 1: change the below code into 'empty_frame = pd.DataFrame(empty_numpy)' then everything will be fine
empty_frame = pd.DataFrame(empty_numpy, columns=[str(i) for i in range(5)])
series = dataset['test']
target_frame = pd.DataFrame(list(series))
# Solution 2: run `empty_frame[:] = target_frame` twice, work fine to me.
# ==================================================================
# First try.
empty_frame[:] = target_frame
print("="*40)
print(f"Data types of empty_frame: {empty_frame.dtypes}")
print("="*40)
print("Result of first try: ")
print(empty_frame)
print("="*40)
# Second try.
empty_frame[:] = target_frame
print(f"Data types of empty_frame: {empty_frame.dtypes}")
print("="*40)
print("Result of second try: ")
print(empty_frame)
print("="*40)
# ====================================================================
I expect the output of code above should be:
========================================
Data types of empty_frame: 0 int64
1 int64
2 int64
3 int64
4 int64
dtype: object
========================================
Result of first try:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
========================================
but it's not working when I first try.
There are two solutions for this problem but I don't know why:
as I showed in my code, try the assignment twice in one run.
remove the columns' name when creating empty_frame.
Two things I want to figure out:
why empty_frame's data types changed.
why the solutions showed in my code can solve this assignment problem.
Thanks.
if I understand your question correctly, then your problem starts when you create empty_numpy matrix.
My favourite solution would be to use empty_numpy = np.empty([5,5]) instead (default dtypes is float64 here). Then the "Result of first try: " is correct. It means:
import pandas as pd
import numpy as np
dataset = [[[i for i in range(5)],] for i in range(5)]
dataset = pd.DataFrame(dataset, columns=['test'])
empty_numpy = np.empty([5,5])
# here you may add empty_numpy.fill(np.nan) but it's not necessary,result is the same
empty_frame = pd.DataFrame(empty_numpy, columns=[str(i) for i in range(5)])
series = dataset['test']
target_frame = pd.DataFrame(list(series))
# following assignment is correct then
empty_frame[:] = target_frame
print('='*40)
print(f'Data types of empty_frame: {empty_frame.dtypes}')
print('='*40)
print("Result of first try: ")
print(empty_frame)
print("="*40)
Or just add dtype attribute to your np.arrange call, just like this:
empty_numpy = np.arange(25, dtype=float).reshape(5, 5)
Then it works too (but it's a little boring ;o).
I'm trying to convert a column in my DataFrame to numbers. The input is email domains extracted from email addresses. Sample:
>>> data['emailDomain']
0 [gmail]
1 [gmail]
2 [gmail]
3 [aol]
4 [yahoo]
5 [yahoo]
I want to create a new column where if the domain is gmail or aol, the column entry would be a 1 and 0 otherwise.
I created a method which goes like this:
def convertToNumber(row):
try:
if row['emailDomain'] == '[gmail]':
return 1
elif row['emailDomain'] == '[aol]':
return 1
elif row['emailDomain'] == '[outlook]':
return 1
elif row['emailDomain'] == '[hotmail]':
return 1
elif row['emailDomain'] == '[yahoo]':
return 1
else:
return 0
except TypeError:
print("TypeError")
and used it like:
data['validEmailDomain'] = data.apply(convertToNumber, axis=1)
However, my output column is 0 even when I know there are gmail and aol emails present in the input column.
Any idea what could be going wrong?
Also, I think this usage of conditional statements might not be the most efficient way to tackle this problem. Is there any other approach to getting this done?
you can use series.isin
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
data['emailDomain'].isin(providers)
searching the provider
instead of applying a re to each email in each row, you can use the Series.str methods to do it on a columns at a time
pattern2 = '(?<=#)([^.]+)(?=\.)'
df['email'].str.extract(pattern2, expand=False)
so this becomes something like this:
pattern2 = '(?<=#)([^.]+)(?=\.)'
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
df = pd.DataFrame(data={'email': ['test.1#gmail.com', 'test.2#aol.com', 'test3#something.eu']})
provider_serie = df['email'].str.extract(pattern2, expand=False)
0 gmail
1 aol
2 something
Name: email, dtype: object
interested_providers = df['email'].str.extract(pattern2, expand=False).isin(providers)
0 True
1 True
2 False
Name: email, dtype: bool
If you really want 0s and 1s, you can add a .astype(int)
Your code would work if your series contained strings. As such, they likely contain lists, in which case you need to extract the first element.
I would also utilise pd.Series.map instead of using any row-wise logic. Below is a complete example:
df = pd.DataFrame({'emailDomain': [['gmail'], ['gmail'], ['gmail'], ['aol'],
['yahoo'], ['yahoo'], ['else']]})
domains = {'gmail', 'aol', 'outlook', 'hotmail', 'yahoo'}
df['validEmailDomain'] = df['emailDomain'].map(lambda x: x[0]).isin(domains)\
.astype(int)
print(df)
# emailDomain validEmailDomain
# 0 [gmail] 1
# 1 [gmail] 1
# 2 [gmail] 1
# 3 [aol] 1
# 4 [yahoo] 1
# 5 [yahoo] 1
# 6 [else] 0
You could sum up the occurence checks of every Provider via list comprehensions and write the resulting list into data['validEmailDomain']:
providers = ['gmail', 'aol', 'outlook', 'hotmail', 'yahoo']
data['validEmailDomain'] = [np.sum([p in e for p in providers]) for e in data['emailDomain'].values]
I have the following data frame (consisting of both negative and positive numbers):
df.head()
Out[39]:
Prices
0 -445.0
1 -2058.0
2 -954.0
3 -520.0
4 -730.0
I am trying to change the 'Prices' column to display as currency when I export it to an Excel spreadsheet. The following command I use works well:
df['Prices'] = df['Prices'].map("${:,.0f}".format)
df.head()
Out[42]:
Prices
0 $-445
1 $-2,058
2 $-954
3 $-520
4 $-730
Now my question here is what would I do if I wanted the output to have the negative signs BEFORE the dollar sign. In the output above, the dollar signs are before the negative signs. I am looking for something like this:
-$445
-$2,058
-$954
-$520
-$730
Please note there are also positive numbers as well.
You can use np.where and test whether the values are negative and if so prepend a negative sign in front of the dollar and cast the series to a string using astype:
In [153]:
df['Prices'] = np.where( df['Prices'] < 0, '-$' + df['Prices'].astype(str).str[1:], '$' + df['Prices'].astype(str))
df['Prices']
Out[153]:
0 -$445.0
1 -$2058.0
2 -$954.0
3 -$520.0
4 -$730.0
Name: Prices, dtype: object
You can use the locale module and the _override_localeconv dict. It's not well documented, but it's a trick I found in another answer that has helped me before.
import pandas as pd
import locale
locale.setlocale( locale.LC_ALL, 'English_United States.1252')
# Made an assumption with that locale. Adjust as appropriate.
locale._override_localeconv = {'n_sign_posn':1}
# Load dataframe into df
df['Prices'] = df['Prices'].map(locale.currency)
This creates a dataframe that looks like this:
Prices
0 -$445.00
1 -$2058.00
2 -$954.00
3 -$520.00
4 -$730.00