I have this dataset where there is a column named 'Discount' and the values are given as '20% off', '25% off' etc.
What i want is to keep just the number in the column and remove the % symbol and the 'off' string.
I'm using this formula to do it.
df['discount'] = df['discount'].apply(lambda x: x.lstrip('%').rstrip('off')
However, when i apply that formula, all the values in the column 'discount' becomes "nan".
I even used this formula as well,
df['discount'] = df['discount'].str.replace('off' , '')
However, this does the same thing.
Is there any other way of handling this? I just want to make all the values in that column to be just the number which is like 25, 20, 10 and get rid of that percentage sign and the string value.
Try this:
d['discount'] = d['discount'].str.replace(r'(%|\s*off)', '', regex=True).astype(int)
Output:
>>> df
discount
0 20
1 25
I came up with this solution:
d['discount'] = d['discount'].split("%")[0]
or as int:
d['discount'] = int(d['discount'].split("%")[0])
We chop the string in two pieces at the %-sign and then take the first part, the number.
If you have a fixed % off suffix, the most efficient is to just remove the last 5 characters:
d['discount'] = d['discount'].str[:-5].astype(int)
Related
This is the column of a dataframe that I have (values are str):
Values
7257.5679
6942.0949714286
5780.0125476250005
This is how I want the record to go to the database:
Values
7.257,56
6.942,09
5.780,01
How can I do this? Thanks in advance!
df["Values"] = df["Values"].apply(lambda x: "{:,.2f}".format(float(x)))
Output:
Values
0 7,257.57
1 6,942.09
2 5,780.01
To get values in the format 7.257,56. You can make good use of the replace function:
df["Values"] = df["Values"].apply(lambda x: "{:,.2f}".format(float(x)).replace(".", ",").replace(",", ".", 1))
But replace might not be more efficient and concise when dealing with larger dataset, in that case you might want to look into translate, that will be the best approach to go with.
trans_column = str.maketrans(",.", ".,")
df["Values"] = df["Values"].apply(lambda x: "{:,.2f}".format(float(x)).translate(trans_column))
Output:
Values
0 7.257,57
1 6.942,09
2 5.780,01
I'm working on a dataset where a column is having numbers separated by commas.
I want to convert the values into integer and obtain their mean value to replace with the current anomaly.
ex: 50,45,30,20
I want to get the mean value and replace it with current value
You can simply define a function that unpack those values and then get the mean of those.
def get_mean(x):
#split into list of strings
splited = x.split(',')
#Transform into numbers
y = [float(n) for n in splited]
return sum(y)/len(y)
#Apply on desired column
df['col'] = df['col'].apply(get_mean)
from numpy import mean
data.apply(lambda x: mean(list(map(lambda y: int(y.strip()), x.split(",")))))
You can apply a custom function wike GabrielBoehme suggests, but if you are in control of the data import, handling the issue at the data import stage may be a bit cleaner.
import pandas as pd
data = pd.read_csv('foobar.csv', sep=',', thousands=',')
Obviously you are going to need to make sure everything is quoted appropriately so that the CSV is parsed correctly.
Mine is a longer explanation and the others here are probably better... but this might be easier to understand if you are newer to python.
cell_num = "1,2,3,4,5,6,7"
#Splitting the numbers by , and making a list of them
cell_numbers = cell_num.split(",")
#Run loop to sum the values in the list
sum_num = 0
for num in cell_numbers:
sum_num += int(num)
#getting the mean
mean = int(sum_num) / len(cell_numbers)
#now printing your final number
print(mean)
If you have decimals... be sure to swap int with float.
I have a list of phone numbers in pandas like this:
Phone Number
923*********
0923********
03**********
0923********
I want to clean the phone numbers based on two rules
If the length of string is 11, number should start with '03'
If the length of string is 12, number should start with '923'
I want to discard all other numbers.
So far I have tried creating two seperate columns by following code:
before_cust['digits'] = before_cust['Information'].str.len()
before_cust['starting'] = before_cust['Information'].astype(str).str[:3]
before_cust.loc[((before_cust['digits'] == 11) & before_cust[before_cust['starting'].str.contains("03")==True]) | ((before_cust['digits'] == 12) & (before_cust[before_cust['starting'].str.contains("923")==True]))]
However this code doesn't work. Is there a more efficient way to do this?
Create 2 boolean masks for each condition then filter out your dataframe:
# If the length of string is 11, number should start with '03'
m1 = df['Information'].str.len().eq(11) & df['Information'].str.startswith('03')
# If the length of string is 12, number should start with '923'
m2 = df['Information'].str.len().eq(12) & df['Information'].str.startswith('923')
out = df.loc[m1|m2]
print(out)
# Output:
Information
0 923*********
Note: I think it doesn't work because you use str.contains rather than str.startswith.
Assuming you want to get rid of all the rows that do not satisfy your condition(As you haven't included any other information regarding the dataframe), I'd go with this approach :
func = lambda num :(len(num) == 11 and num.startswith("03")) or (len(num) == 12 and num.startswith("923"))
df = df[df["Information"].apply(func)].reset_index(drop = True)
The lambda function simply returns the boolean that becomes True if your desired condition is met, else False.
Then simply apply this filter to your dataframe and get rid of all the other columns!
I'm looking to compare a database of movie reviews from IDMb and RT to each other, but to do that I'd like to convert the RT percentage score to a 0.0 to 10.0 score in order to graph them on the same axis properly. Is there a way to do this, by dropping the '%' in the RT column and then moving the decimal place to the left one place? For reference, the table looks like this.
Try this
df["Rotten Tomatoes"] = df["Rotten Tomatoes"].apply(lambda x: float(x.replace("%", ""))/10)
Something like
df = pd.DataFrame({
"IMDb":[8.8, 8.7,8.5,8.5,8.8],
"Rotten Tomatoes":['87%', '87%', '84%', '96%', '97%']
})
df["RT"] = df["Rotten Tomatoes"].apply(lambda x: float(x[:-1])/10)
The x[:-1] will strip the last character, the '%'.
float() will convert the string to a floating point number.
Finally divide by 10 to adjust the scale.
If you want a faster way of doing it the last line can change to:
df['Rotten Tomatoes'].str.replace('%','').astype(float).div(10)
Thanks #Rajith Thennakoon for idea to use replace
I have a data frame like this:
MONTH TIME PATH RATE
0 Feb 15:24:11 enp1s0 14.71Kb
I want to create a function which can identify if 'Kb' or 'Mb' is in the column RATE. If an entry in column RATE has 'Kb' or 'Mb' at the end, to strip it of 'Kb'/'Mb' and perform an operation to convert it into just b.
Here's my code so far where RATE is treated by the Dataframe as an object:
df=pd.DataFrame(listOfLists)
def strip(bytesData):
if "Kb" in bytesData:
bytesData/1000
elif "Mb" in bytesData:
bytesData/1000000
df['RATE']=df.apply(lambda x: strip(x['byteData']), axis=1)
How can I get it to change the value within the column while stripping it of unwanted characters and converting it into the format I need? I know once this operation is complete I'll have to change it to an int, however, I can't seem to alter the data in the way I need.
Thanks in advance!
I modified your function a bit and use map(lambda x:) instead of apply, since we are working with a series, and not the full dataframe. Also I added some additional lines as to provide examples for both Kb and Mb and if neither are present:
example_df = pd.DataFrame({'Month':[0,1,2,3],
'Time':['15:32','16:42','17:11','15:21'],
'Path':['xxxxx','yyyyy','zzzzz','aaaaa'],
'Rate':['14.71Kb','18.21Mb','19.01Kb','Error_1']})
def case_1(value):
if value[-2:] == 'Kb':
return float(value[:-2])*1000
elif value[-2:] == 'Mb':
return float(value[:-2])*100000
else:
return np.nan
example_df['Rate'] = example_df['Rate'].map(lambda x: case_1(x))
The logic for the function is, if it ends with Kb then multiply the value by 1000, else-if it ends with Mb multiply the value by 100000, otherwise simply return NaN (because neither of the two conditions are satisfied)
Output:
Month Time Path Rate
0 0 15:32 xxxxx 14710.0
1 1 16:42 yyyyy 1821000.0
2 2 17:11 zzzzz 19010.0
3 3 15:21 aaaaa NaN
Here's an alternative of how I may approach this. This solution handles other Abbreviations. It does rely on regex re standard lib package though.
This approach makes a new column called Bytes. I often find it helpful to keep the RATE column in this case to verify there aren't any edge cases I haven't thought of. I also use a mapping to obtain the necessary power to raise the value to to get the correct bytes. I did add the code required to drop the original RATE column and rename the new column.
import re
def convert_to_bytes(contents):
value, label, _ = re.split('([A-Za-z]+)', contents)
factors = {'Kb': 1, 'Mb': 2, 'Gb': 3, 'Tb': 4}
return float(value) * 1000**(factors[label])
df['Bytes'] = df['RATE'].map(convert_to_bytes)
# Drop original RATE column
df = df.drop('RATE', axis=1)
# Rename Bytes column to RATE
df = df.rename({'Bytes': 'RATE'}, axis='columns')