Grabbing characters after a certain value - python

Below is the dataframe. PIC_1 and Wgt are strings and p.lgth and p_lgth are integers. If p_lgth is not equal to 30, I want to find 42 in PIC_1 and grab 42 and the 15 digits that come after it.
PIC_1 Wgt p.lgth p_lgth
**PARTIAL-DECODE***P / 42011721930018984390078... 112 53 53
So the output from above should be 42011721930018984
My code that does not work follows:
def pic_mod(row):
if row['p_lgth'] !=30:
PIC_loc = row['PIC_1'].find('42')
PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 15]
elif row['p_lgth']==30:
PIC_2=PIC_1
return PIC_2
row_1 is just a row from the larger df that is identical to the example row given above
row_1 = df71[2:3]
pic_mod(row_1)
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool (), a.item(), a.any() or a.all().
I did type() on the variables and got
type(df71['PIC_1']) = pandas.core.series.Series
type(df71['p_lgth']) = pandas.core.series.Series
type(df71['Wgt']) = pandas.core.series.Series
I'm fairly new to Python. Should these data types come back as int and str? df71 is a df.

According to the error message in your post, perhaps try with this one:
def pic_mod(row):
if row['p_lgth'].any() != 30:
PIC_loc = row['PIC_1'].str.find('42')[0]
PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 17]
elif row['p_lgth'].any() == 30:
PIC_2=PIC_1
return PIC_2
However, if your data is already structured in a pandas dataframe, you normally wouldn't write such an explicit function.
E.g. the initial filtering of all rows in the dataset by p_legth not equal to 30 would be a single line like:
df_fltrd = df[df['p_lgth']!=30]
Having this done you could apply any arbitrary function to the entries in the PIC_1-column, e.g. in your case the substring of length 17 starting with '42':
df_fltrd['PIC_1'].apply(lambda x: x[x.find('42'):x.find('42')+17])

Related

Python: Trying to encode all DF strings into ints

I am trying to build a function that checks every column of the DF and if it finds a string in the column, I want to encode that string into the int. The function also ensures that the binary model label is rounded to 1 and 0.
However, this gets me an ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This error is caused by everything within an IF statement, which makes me wonder if there is a better way to approach this problem?
This is what I got so far:
credit_key = "consumer"
model_label = "sc"
def df_column_headers(df):
headers = []
for column in df.columns:
headers.append(column)
return headers
def control_df_modelling(path_df_modelling, credit_key, model_label):
path_df_modelling = path_df_modelling + f"{file_name_modelling}_{credit_key}.{file_format_modelling}"
df_modelling = pd.read_csv(path_df_modelling)
print(f"Conducting Quality Checks: DF Modelling {credit_key}")
count_string_values = 0
df_modelling[model_label] = [round(x) for x in df_modelling[model_label]]
columns_df_modelling = df_column_headers(df_modelling)
for column in df_modelling[columns_df_modelling]:
column_element = df_modelling[column]
if column_element != np.number:
print(f"String Value Found: {column_element}")
print(f"String Value {column_element} is Converted into {hash(column_element)}")
count_string_values += 1
column_element = hash(column_element)
else:
print(f"No String Values in DF {credit_key} Found")
print(f"Total Number of String Values Found in DF Modelling {credit_key}: {count_string_values}")
return df_modelling

I want to make a function if the common key is found in both dataframes

I have two dataframes df1 and df2 each of them have column containing product code and product price, I wanted to check the difference between prices in the 2 dataframes and store the result of this function I created in a new dataframe "df3" containing the product code and the final price, Here is my attempt :
Function to calculate the difference in the way I want:
def range_calc(z, y):
final_price = pd.DataFrame(columns = ["Market_price"])
res = z-y
abs_res = abs(res)
if abs_res == 0:
return (z)
if z>y:
final_price = (abs_res / z ) * 100
else:
final_price = (abs_res / y ) * 100
return(final_price)
For loop I created to check the two df and use the function:
Last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
for i in df1["product_ID"]:
for x in df2["product_code"]:
if i == x:
Last_df["Product_number"] = i
Last_df["Market_Price"] = range_calc(df1["full_price"],df2["tot_price"])
The problem is that I am getting this error every time:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Why you got the error message "The truth value of a Series is ambiguous"
You got the error message The truth value of a Series is ambiguous
because you tried input a pandas.Series into an if-clause
nums = pd.Series([1.11, 2.22, 3.33])
if nums == 0:
print("nums == zero (nums is equal to zero)")
else:
print("nums != zero (nums is not equal to zero)")
# AN EXCEPTION IS RAISED!
The Error Message is something like the following:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Somehow, you got a Series into the inside of the if-clause.
Actually, I know how it happened, but it will take me a moment to explain:
Well, suppose that you want the value out of row 3 and column 4 of a pandas dataframe.
If you attempt to extract a single value out of a pandas table in a specific row and column, then that value is sometimes a Series object, not a number.
Consider the following example:
# DATA:
# Name Age Location
# 0 Nik 31 Toronto
# 1 Kate 30 London
# 2 Evan 40 Kingston
# 3 Kyra 33 Hamilton
To create the dataframe above, we can write:
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
Now, let us try to get a specific row of data:
evans_row = df.loc[df['Name'] == 'Evan']
and we try to get a specific value out of that row of data:
evans_age = evans_row['Age']
You might think that evans_age is the integer 40, but you would be wrong.
Let us see what evans_age really is:
print(80*"*", "EVAN\'s AGE", type(Evans_age), sep="\n")
print(Evans_age)
We have:
EVAN's AGE
<class 'pandas.core.series.Series'>
2 40
Name: Age, dtype: int64
Evan's Age is not a number.
evans_age is an instance of the class stored as pandas.Series
After extracting a single cell out of a pandas dataframe you can write .tolist()[0] to extract the number out of that cell.
evans_real_age = evans_age.tolist()[0]
print(80*"*", "EVAN\'s REAL AGE", type(evans_real_age), sep="\n")
print(evans_real_age)
EVAN's REAL AGE
<class 'numpy.int64'>
40
The exception in your original code was probably thrown by if abs_res == 0.
If abs_res is a pandas.Series then abs_res == 0 returns another Series.
There is no way to compare if an entire list of numbers is equal to zero.
Normally people just enter one input to an if-clause.
if (912):
print("912 is True")
else:
print("912 is False")
When an if-statement receives more than one value, then the python interpreter does not know what to do.
For example, what should the following do?
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
if data:
print("data is true")
else:
print("data is false")
You should only input one value into an if-condition. Instead, you entered a pandas.Series object as input to the if-clause.
In your case, the pandas.Series only had one number in it. However, in general, pandas.Series contain many values.
The authors of the python pandas library assume that a series contains many numbers, even if it only has one.
The computer thought that you tired to put many different numbers inside of one single if-clause.
The difference between a "function definition" and a "function call"
Your original question was,
"I want to make a function if the common key is found"
Your use of the phrase "make a function" is incorrect. You probably meant, "I want to call a function if a common key is found."
The following are all examples of function "calls":
import pandas as pd
import numpy as np
z = foo(1, 91)
result = funky_function(811, 22, "green eggs and ham")
output = do_stuff()
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=dates, columns=list("ABCD"))
Suppose that you have two containers.
If you truly want to "make" a function if a common key is found, then you would have code like the following:
dict1 = {'age': 26, 'phone':"303-873-9811"}
dict2 = {'name': "Bob", 'phone':"303-873-9811"}
def foo(dict1, dict2):
union = set(dict2.keys()).intersection(set(dict1.keys()))
# if there is a shared key...
if len(union) > 0:
# make (create) a new function
def bar(*args, **kwargs):
pass
return bar
new_function = foo(dict1, dict2)
print(new_function)
If you are not using the def keyword, that is known as a function call
In python, you "make" a function ("define" a function) with the def keyword.
I think that your question should be re-titled.
You could write, "How do I call a function if two pandas dataframes have a common key?"
A second good question be something like,
"What went wrong if we see the error message, ValueError: The truth value of a Series is ambiguous.?"
Your question was worded strangely, but I think I can answer it.
Generating Test Data
Your question did not include test data. If you ask a question on stack overflow again, please provide a small example of some test data.
The following is an example of data we can use:
product_ID full_price
0 prod_id 1-1-1-1 11.11
1 prod_id 2-2-2-2 22.22
2 prod_id 3-3-3-3 33.33
3 prod_id 4-4-4-4 44.44
4 prod_id 5-5-5-5 55.55
5 prod_id 6-6-6-6 66.66
6 prod_id 7-7-7-7 77.77
------------------------------------------------------------
product_code tot_price
0 prod_id 3-3-3-3 34.08
1 prod_id 4-4-4-4 45.19
2 prod_id 5-5-5-5 56.30
3 prod_id 6-6-6-6 67.41
4 prod_id 7-7-7-7 78.52
5 prod_id 8-8-8-8 89.63
6 prod_id 9-9-9-9 100.74
Products 1 and 2 are unique to data-frame 1
Products 8 and 9 are unique to data-frame 2
Both data-frames contain data for products 3, 4, 5, ..., 7.
The prices are slightly different between data-frames.
The test data above is generated by the following code:
import pandas as pd
from copy import copy
raw_data = [
[
"prod_id {}-{}-{}-{}".format(k, k, k, k),
int("{}{}{}{}".format(k, k, k, k))/100
] for k in range(1, 10)
]
raw_data = [row for row in raw_data]
df1 = pd.DataFrame(data=copy(raw_data[:-2]), columns=["product_ID", "full_price"])
df2 = pd.DataFrame(data=copy(raw_data[2:]), columns=["product_code", "tot_price"])
for rowid in range(0, len(df2.index)):
df2.at[rowid, "tot_price"] += 0.75
print(df1)
print(60*"-")
print(df2)
Add some error checking
It is considered to be "best-practice" to make sure that your function
inputs are in the correct format.
You wrote a function named range_calc(z, y). I reccomend making sure that z and y are integers, and not something else (such as a pandas Series object).
def range_calc(z, y):
try:
z = float(z)
y = float(y)
except ValueError:
function_name = inspect.stack()[0][3]
with io.StringIO() as string_stream:
print(
"Error: In " + function_name + "(). Inputs should be
like decimal numbers.",
"Instead, we have: " + str(type(y)) + " \'" +
repr(str(y))[1:-1] + "\'",
file=string_stream,
sep="\n"
)
err_msg = string_stream.getvalue()
raise ValueError(err_msg)
# DO STUFF
return
Now we get error messages:
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
range_calc("I am supposed to be an integer", data)
# ValueError: Error in function range_calc(). Inputs should be like
decimal numbers.
# Instead, we have: <class 'str'> "I am supposed to be an integer"
Code which Accomplishes what you Wanted.
The following is some rather ugly code which computes what you wanted:
# You can continue to use your original `range_calc()` function unmodified
# Use the test data I provided earlier in this answer.
def foo(df1, df2):
last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
df1_ids = set(df1["product_ID"].tolist())
df2_ids = set(df2["product_code"].tolist())
pids = df1_ids.intersection(df2_ids) # common_product_ids
for pid in pids:
row1 = df1.loc[df1['product_ID'] == pid]
row2 = df2.loc[df2["product_code"] == pid]
price1 = row1["full_price"].tolist()[0]
price2 = row2["tot_price"].tolist()[0]
price3 = range_calc(price1, price2)
row3 = pd.DataFrame([[pid, price3]], columns=["Product_number", "Market_Price"])
last_df = pd.concat([last_df, row3])
return last_df
# ---------------------------------------
last_df = foo(df1, df2)
The result is:
Product_number Market_Price
0 prod_id 6-6-6-6 1.112595
0 prod_id 7-7-7-7 0.955171
0 prod_id 4-4-4-4 1.659659
0 prod_id 5-5-5-5 1.332149
0 prod_id 3-3-3-3 2.200704
Note that one of many reasons that my solution is ugly is in the following line of code:
last_df = pd.concat([last_df, row3])
if last_df is large (thousands of rows), then the code will run very slowly.
This is because instead of inserting a new row of data, we:
copy the original dataframe
append a new row of data to the copy.
delete/destroy the original data-frame.
It is really silly to copy 10,000 rows of data only to add one new value, and then delete the old 10,000 rows.
However, my solution has fewer bugs than your original code, relatively speaking.
sometimes when you check a condition on series or dataframes, your output is a series such as ( , False).
In this case you must use any, all, item,...
use print function for your condition to see the series.
Also I must tell your code is very very slow and has O(n**2). You can first calculate df3 as joining df1 and df2 then using apply method for fast calculating.

SMOTE in python

I am trying to use SMOTE in python and looking if there is any way to manually specify the number of minority samples.
Suppose we have 100 records of one class and 10 records of another class if we use ratio = 1 we get 100:100, if we use ratio 1/2, we get 100:200. But I am looking if there is any way to manually specify the number of instances to be generated for both the classes.
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = Ndf_class_0_records.DIED.value_counts()
Ndf_class_1_record_counts = Ndf_class_1_records.DIED.value_counts()
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
In the above code, I am trying to manually specify the number for each of the classes, but I am getting the following error at the last line of code
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If I understand you correctly and the documentation here, you are not passing numbers as ratio. You are instead passing a series object.
The accepted types for ratio are:
float, str, dict or callable, (default=’auto’)
Please try doing:
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = len(Ndf_class_0_records) ##### CHANGED THIS
Ndf_class_1_record_counts = len(Ndf_class_1_records) ##### CHANGED THIS
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
This should now work, please try!

How to apply a function in Pandas to a cell in every row where a different cell in that same row meets a condition?

I am trying to use the pandas string method "str.zfill" to add leading zeros to a cell in the same column for every row in the dataframe where another cell in that row meets a certain condition. So for any given row in my DataFrame "excodes", when the value in column "LOB_SVC_CD" is "MTG", apply the str.zfill(5) method to the cell in column "PS_CD". When the value in "LOB_SVC_CD" is not "MTG" leave the value in "PS_CD" as is.
I've tried a few custom functions, "np.where" and a few apply/map lambdas. I'm getting errors on all of them.
#Custom Function
def add_zero(column):
if excodes.loc[excodes.LOB_SVC_CD == 'MTG']:
excodes.PS_CD.str.zfill(5)
else:
return excodes.PS_CD
excodes['code'] = excodes.apply(add_zero)
#Custom Function with For Loop
def add_zero2(column):
code = []
for row(i) in column:
if excodes.LOB_SVC_CD == 'MTG':
code.append(excodes.PS_CD.str.zfill(5))
else:
code.append(excodes.PS_CD)
excodes['Code'] = code
excodes['code'] = excodes.apply(add_zero)
#np.Where
mask = excodes[excodes.LOB_SVC_CD == 'MTG']
excodes['code'] = pd.DataFrame[np.where(mask, excodes.PS_CD.str.zfill(5), excodes.PS_CD)]
#Lambda
excodes['code'] = excodes['LOB_SVC_CD'].map(lambda x: excodes.PS_CD.str.zfill(5)) if x[excodes.LOB_SVC_CD == 'MTG'] else excodes.PS_CD)
#Assign with a "Where"
excodes.assign((excodes.PS_CD.str.zfill(5)).where(excodes.LOB_SVC_CD == 'MTG'))
Expected results will be either:
create a new called "code" with all values in "PS_CD" are given leading zeroes in rows where excodes.LOB_SVC_CD == 'MTG'
adding leading zeroes to the values in excodes["PS_CD"] when the row excodes['LOB_SVC_CD'] == 'MTG'
Error Messages I'm getting are - on each of the approaches I've tried:
#Custom Function:
"ValueError: ('The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index PS_CD')"
# Custom Function with For Loop:
"SyntaxError: can't assign to function call"
#np.Where:
"ValueError: operands could not be broadcast together with shapes (152,7) (720,) (720,)"
#Apply Lambda:
"string indices must be integers"
#Assign with a "Where":
"TypeError: assign() takes 1 positional argument but 2 were given"
This seems to work :)
# Ensure the data in the PS_CD are strings
data["PS_CD"] = data["PS_CD"].astype(str)
# Iterate over all rows
for index in data.index:
# If the LOB_SVC_CD is "MTG"
if (data.loc[index, "LOB_SVC_CD"] == "MTG"):
# Apply the zfill(5) in the PS_CD on the same row (index)
data.loc[index, "PS_CD"] = data.loc[index, "PS_CD"].zfill(5)
# Print the result
print(data)
Alternative way (maybe a bit more Python-ish) :)
# Ensure the data in the PS_CD are strings
data["PS_CD"] = data["PS_CD"].astype(str)
# Custom function for applying the zfill
def my_zfill(x, y):
return y.zfill(5) if x == "MTG" else y
# Iterate over the data applying the custom function on each row
data["PS_CD"] = pd.Series([my_zfill(x, y) for x, y in zip(data["LOB_SVC_CD"], data["PS_CD"])])
My take:
>>> import pandas
>>> df = pandas.DataFrame(data = [['123', 'MTG'],['321', 'CLOC']], columns = ['PS_CD', 'LOB_SVC_CD'])
>>> df
PS_CD LOB_SVC_CD
0 123 MTG
1 321 CLOC
>>>
>>> df['PS_CD'] = df.apply(lambda row: row['PS_CD'].zfill(5) if row['LOB_SVC_CD'] == 'MTG' else row['PS_CD'], axis='columns')
>>> df
PS_CD LOB_SVC_CD
0 00123 MTG
1 321 CLOC
Using lambda will return value for every row, zfilled PS_CD if LOB_SVC_CD was MTG else original PS_CD.

Defining a variable between two points with a datetime pandas series

I have a pandas dataframe, and I want to calculate a variable based on certain hours of the day. I already pulled the hours as integers out of the datetime series. When I write my conditional statements between two hours and execute my script, I get the warning "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
When I put in any() or all() in my script, the script runs but it doesn't calculate the value between the two hours. I just get back a value that is not in the conditions. Can anyone help me out?
Here is my code so far
METdata = pd.read_csv('C:\Schoolwork\GEOL 701s_HW1\MET_station\MET_Data_3.26_hourly.csv', infer_datetime_format = True, na_values = '', header = [1], skiprows = [2, 3], index_col = [0])
hour = METdata.index.hour
NET_rad_Wm2 = np.array(METdata['NR_Wm2_Avg'])
Nr = NET_rad_Wm2 * 0.0036
g_day = Nr * 0.1
g_night = Nr * 0.5
def func(hour):
if ((hour > 8) and (hour < 17)):
return g_night
else:
return g_day
g = func(hour)
If you want a series as return, then you just need to call apply instead of calling the function directly
hour.apply(func)

Categories

Resources