Pandas take number out string - python

In my data, I have this column "price_range".
Dummy dataset:
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146', 'No pricing available']})
I am using pandas. What is the most efficient way to get the upper and lower bound of the price range in seperate columns?

Alternatively, you can parse the string accordingly (if you want to limits for each row, rather than the total range:
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146']})
def get_lower_limit(some_string):
a = some_string.split(' - ')
return int(a[0].split('€')[-1])
def get_upper_limit(some_string):
a = some_string.split(' - ')
return int(a[1].split('€')[-1])
df['lower_limit'] = df.price_range.apply(get_lower_limit)
df['upper_limit'] = df.price_range.apply(get_upper_limit)
Output:
Out[153]:
price_range lower_limit upper_limit
0 €4 - €25 4 25
1 €3 - €14 3 14
2 €25 - €114 25 114
3 €112 - €146 112 146

You can do the following. First create two extra columns lower and upper which contain the lower bound and the upper bound from each row. Then find the minimum from the lower column and maximum from the upper column.
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146', 'No pricing available']})
df.loc[df.price_range != 'No pricing available', 'lower'] = df['price_range'].str.split('-').str[0]
df.loc[df.price_range != 'No pricing available', 'upper'] = df['price_range'].str.split('-').str[1]
df['lower'] = df.lower.str.replace('€', '').astype(float)
df['upper'] = df.upper.str.replace('€', '').astype(float)
price_range = [df.lower.min(), df.upper.max()]
Output:
>>> price_range
[3.0, 146.0]

Related

Pyhthon: Getting "list index out of range" error; I know why but don't know how to resolve this

I am currently working on a data science project. The Idea is to clean the data from "glassdoor_jobs.csv", and present it in a much more understandable manner.
import pandas as pd
df = pd.read_csv('glassdoor_jobs.csv')
#salary parsing
#Removing "-1" Ratings
#Clean up "Founded"
#state field
#Parse out job description
df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df = df[df['Salary Estimate'] != '-1']
Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$',''))
minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))
I am getting the error at that last line. After digging a bit, I found out in minus_hr, some of the 'Salary Estimate' only has one number instead of range:
index
Salary Estimate
0
150
1
58
2
130
3
125-150
4
110-140
5
200
6
67- 77
And so on. Now I'm trying to figure out how to work around the "list index out of range", and make max_salary the same as the min_salary for the cells with only one value.
I am also trying to get average between the min and max salary, and if the cell only has a single value, make that value the average
So in the end, something like index 0 would look like:
index
min
max
average
0
150
150
150
You'll have to add in a conditional statement somewhere.
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)
The above might do it, or you can define a function.
def max_salary(cell_value):
if '-' in cell_value:
max_salary = split(cell_value, '-')[1]
else:
max_salary = cell_value
return max_salary
df['max_salary'] = minus_hr.apply(lambda x: max_salary(x))
def avg_salary(cell_value):
if '-' in cell_value:
salaries = split(cell_value,'-')
avg = sum(salaries)/len(salaries)
else:
avg = cell_value
return avg
df['avg_salary'] = minus_hr.apply(lambda x: avg_salary(x))
Swap in min_salary and repeat
Test the length of x.split('-') before accessing the elements.
salaries = x.split('-')
if len(salaries) == 1:
# only one salary number is given, so assign the same value to min and max
df['min_salary'] = df['max_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
else:
# two salary numbers are given
df['min_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(salaries[1]))
If you want to avoid .apply()...
Try:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
sals = df['Salary Estimate'].str.extractall(r'(?P<min_salary>\d+)[^0-9]*(?P<max_salary>\d*)?')
# reset the new frame's index
sals = sals.reset_index()
# join the extracted min/max salary columns to the original dataframe and fill any blanks with nan
df = df.join(sals[['min_salary', 'max_salary']].fillna(np.nan))
# fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see what you've got
print(df)
Or without using regex:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
df['sals'] = df['Salary Estimate'].str.split('-')
# expand the list in sals to two columns filling with nan
df[['min_salary', 'max_salary']] = pd.DataFrame(df.sals.tolist()).fillna(np.nan)
# delete the sals column
del df['sals']
# # fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# # set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# # calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see you've got
print(df)
Output:
Salary Estimate min_salary max_salary average_salary
0 150 150 150 150
1 58 58 58 58
2 130 130 130 130
3 125-150 125 150 137
4 110-140 110 140 125
5 200 200 200 200
6 67- 77 67 77 72

Pandas - positive volume index without for loop

I want to compute the positive volume index on a stock time series (https://www.investopedia.com/terms/p/pvi.asp), and I'd like to do it oneshot (without for loop).
The input is a classic stock dataframe with OHLC and Volume as columns and datetime as index.
This is what I was thinking:
df['PVI'] = 0
df['Volume'].diff() > 0
prev_c = df['Close'].shift()
prev_pvi = df['PVI'].shift()
df['PVI'] = np.where(df['Volume'].diff() > 0,
prev_pvi + (df['Close'] - prev_c / prev_c * df['PVI'].shift()),
df['PVI'].shift())
But I get a ValueError: cannot set a row with mismatched columns.
I can easily split it in a for loop with an if condition on volume but I wanted to write more pandonic coeds.
In order to have a working sample just:
import yfinance
df = yf.download(tickers='AAPL', start='2019-01-01', end='2021-01-01', period=max, interval=1d)
Thanks for any help/suggestion
ADDITION:
the for loop approach could be (given a df.reset_index):
df['pvi'] = 0.
for index,row in df.iterrows():
if index > 0:
prev_pvi = df.at[index-1, 'pvi']
prev_close = df.at[index-1, 'Close']
if row['Close'] > df.at[index-1, 'Volume']:
pvi = prev_pvi + (row['Close'] - prev_close / prev_close * prev_pvi)
else:
pvi = prev_pvi
else:
pvi = 1000 # dummy value
df.set_value(index, 'pvi', pvi)

Python compare and count row values

I'd like to compare two columns row by row and count when a specific value in each row is not correct. For instance:
group landing_page
control new_page
control old_page
treatment new_page
treatment old_page
control old_page
I'd like to count the number of times treatment is not equal to new_page or control is not equal to old_page. It could be the opposite too I guess, aka treatment is equal to new_page.
Use pandas groupby to find the counts of group/landing page pairs.
Use groupby again to find the group counts.
To find the count of other landing pages within each group, subtract each
landing page count from the group count.
df = pd.DataFrame({'group': ['control', 'control', 'treatment',
'treatment', 'control'],
'landing_page': ['new_page', 'old_page', 'new_page',
'old_page', 'old_page']})
# find counts per pairing
df_out = df.groupby(['group', 'landing_page'])['landing_page'].count().to_frame() \
.rename(columns={'landing_page': 'count'}).reset_index()
# find totals for groups
df_out['grp_total'] = df_out.groupby('group')['count'].transform('sum')
# find count not equal to landing page
df_out['inverse_count'] = df_out['grp_total'] - df_out['count']
print(df_out)
group landing_page count grp_total inverse_count
0 control new_page 1 3 2
1 control old_page 2 3 1
2 treatment new_page 1 2 1
3 treatment old_page 1 2 1
This sounds like a job for the zip() function.
First, setup your inputs and counters:
group = ["control", "control", "treatment", "treatment", "control"]
landingPage = ["new_page", "old_page", "new_page", "old_page", "old_page"]
treatmentNotNew = 0
controlNotOld = 0
Then zip the two inputs you are comparing into an iterator of tuples:
zipped = zip(group, landingPage)
Now you can iterate over the tuple values a (group) and b (landing apge) while counting each time that treatment != new_page and control != old_page:
for a, b in zipped:
if((a == "treatment") and (not b == "new_page")):
treatmentNotNew += 1
if((a == "control") and (not b == "old_page")):
controlNotOld += 1
Finally, print your result!
print("treatmentNotNew = " + str(treatmentNotNew))
print("controlNotOld = " + str(controlNotOld))
>> treatmentNotNew = 1
>> controlNotOld = 1
I would create a new column with map that maps your desired output given the input. Then you can easily test if the new mapping column equals the landing_page column.
df = pd.DataFrame({
'group': ['control', 'control', 'treatment', 'treatment', 'control'],
'landing_page': ['old_page', 'old_page', 'new_page', 'old_page', 'new_page']
})
df['mapping'] = df.group.map({'control': 'old_page', 'treatment': 'new_page'})
(df['landing_page'] != df['mapping']).sum()
# 2

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454

Binning in python

Example
Input have one column .
Time
02.10
02.40
02.50
Output
Since the
Ave time difference is 20 min ((30 min+10 min)/2),
I need a data frame which buckets the data by average .
It needs to add average time to first record , if the resultant time is there in data then it belongs to bin 1 , otherwise to bin 0.
and then continue.
Desired Output
Time - Bin
02.10 - 1
02.30 - 0
02.50 - 1
03.10 - 0
Thanks in advance.
First, you should always share what you have tried.
Anyways, try this. It should work
mean = df.Time.diff().mean()
start = df.loc[0, 'Time']
end = df.loc[df.shape[1] -1, 'Time']
len = int((end - start)/mean) + 1
timeSeries = [start + i*mean for i in range(len)]
df['Bin'] = 0
df.loc[df['Time'].isin(timeSeries), 'Bin'] = 1
This will create 'Bin' as expected by you, conditional, you create 'Time' properly as datetime.

Categories

Resources