Problem transforming a variable in logs, python - python

I am using Python. I would like to create a new column which is the log transformation of column 'lights1992'.
I am using the following code:
log_lights1992 = np.log(lights1992)
I obtain the following error:
I have tried two things: 1) adding a 1 to each value and transform the column 'lights1992' to numeric.
city_join['lights1992'] = pd.to_numeric(city_join['lights1992'])
city_join["lights1992"] = city_join["lights1992"] + 1
However, that two solution has not worked. Variable 'lights1992' is a float64 type. Do you know what can be the problem?
Edit:
The variable 'lights1992' comes from doing a zonal_statistics from a raster 'junk1992', maybe this affect.
zs1 = zonal_stats(city_join, junk1992, stats=['mean'], nodata=np.nan)
city_join['lights1992'] = [x['mean'] for x in zs1]

the traceback states:
'DatasetReader' object has no attribute'log'.
Did you re-assign numpy to something else at some point? I can't find much about 'DatasetReader' is that a custom class?
EDIT:
I think you would need to pass the whole column because your edit doesn't show a variable named 'lights1992'
so instead of:
np.log(lights1992)
can you try passing in the Dataframe's column to log?:
np.log(city_join['lights1992'])
2ND EDIT:
Since you've reported back that it works I'll dive into the why a little bit.
In your original statement you called the log function and gave it an argument, then you assigned the result to a variable name:
log_lights1992 = np.log(lights1992)
The problem here is that when you give python text without any quotes it thinks you are giving it a variable name (see how you have log_lights1992 on the left of the equal sign? You wanted to assign the results of the operation on the right hand side of the equal sign to the variable name log_lights1992) but in this case I don't think lights1992 had any value!
So there were two ways to make it work, either what I said earlier:
Instead of giving it a variable name you give .log the column of the city_join dataframe (that's what city_join["lights1992"]) directly.
Or
You assign the value of that column to the variable name first then you pass it in to .log, like this:
lights1992 = city_join["lights1992"]
log_lights1992 = np.log(lights1992)
Hope that clears it up for you!

Related

Looping Through Data Frames with Dynamic withColumn Injection

I'm looking to create a dynamic .withColumn.
with the column "rules" being replaced by a list depending on the file being processed.
for example: File A has a column called "Validated" that is based on a different condition to File B but has the same column name A. So can we loop through all files A-Z applying different rules for the same column in each file?
Here I am trying to validate many dataframes. Creating an EmailAddress_Validation field on each dataframe. Each data frame has a different email validation rule set. The rules are stored in a list called EmailRuleList. As we loop through each data set the corresponding rule "EmailRuleList[i]" is passed in from the list.
code below has the syntax. Also commented out with an "#" (hash) is an example of a rule.
Interestingly if I supply the rule with out the loop (the # comment) the code works except it then obviously applies the same rule to all files.
i=0
for FileProcessName in FileProcessListName:
EmailAddress_Validation = EmailRuleList[i]
#EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
print(EmailAddress_Validation)
print(FileProcessName)
i=i+1
vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", EmailAddress_Validation)
Error Message: col should be Column
EmailRuleList is something like...
['when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),1).otherwise(0)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx2,0))==(col("EmailAddress")),0).otherwise(1)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx3,0))==(col("EmailAddress")),0).otherwise(1)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx4,0))==(col("EmailAddress")),0).otherwise(1)']
tried lots of different things but am a bit stuck
The error is in the last line of the for loop. The when condition that you want to check in the .withColumn() is actually a string (each element of EmailRuleList which is a string).
Since withColumn expects the send argument to be a column, it is giving the error. Look at a similar error when I try to give something similar to your code (in withColumn()):
from pyspark.sql.functions import when,col
df.withColumn("check","when(col('gname')=='Ana','yes').otherwise('No')").show()
To make it work, I have used eval function. So, using the following code wouldn't throw an error:
from pyspark.sql.functions import when,col
df.withColumn("check",eval("when(col('gname')=='Ana','yes').otherwise('No')")).show()
So, modify your code to the one given below to make it work:
i=0
for FileProcessName in FileProcessListName:
EmailAddress_Validation = EmailRuleList[i]
#EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
print(EmailAddress_Validation)
print(FileProcessName)
i=i+1
vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", eval(EmailAddress_Validation))

The reason of using ''temp'' variables?

The question I'm asking is why do we use temporary variables after taking an input? For example: In the code down below, we've requested a 'num' from the customer. Then we've changed into it a 'temp'. Why don't we simply continue with 'num'? I can't see any aim into changing it a different variable. Why don't the code work if we don't make this swap? Thanks.
It is beacause in the while cycle in the last row you change the value of temp so if instead of temp you use num you will change its value and in the if else statement you can't compare the sum with the input number.

python string as variable reporting nan

I am finding my name content/ variable value inside one document with the below:
find_name = re.search(r'^[^\d]*', clean_content)
Name = find_name.group(0)
NameUp = Name.upper()
Which works fine... it equals DAN STEPP as needed.
I then open up an excel file:
data1 = pd.read_excel(config.Excel1)
Pass into a data frame, give them headers; all this works:
df = pd.DataFrame(data1)
header = df.iloc[0]
Now when I do the search; with the below it returns nan erroneously
row_numberd1 = df[df['Member Name'].str.contains(NameUp)].index.min()
With my NameUp var, which equals DAN STEPP in value when I print and test, so it does contain correct value. However, when I use the variable above to search, I get nan.
When I replace NameUp with "DAN STEPP" like that, not using the variable, it becomes found - any thoughts on this? i.e. '.str.contains("DAN STEPP")'
Would you mind doing repr(NameUp)? It's slightly different from str(NameUp) in that it will print exactly what's in the string. Besides that I have no idea what to make of
row_numberd1 = df[df['Member Name'].str.contains(NameUp)].index.min()
I don't use pandas but that's up... that's a lot of stuff in one line? I would check each process individually as to see what's wrong. Since you said that it was throwing the wrong thing with the NameUp variable, I would deconstruct df['Member Name'].str.contains(NameUp) to see what it spits out, and make sure that it's consistent with your testing. Have you tried with any other names/values?
TL;DR: if the variable is not working, and manually inputting the string is, there is one of two things happening. Either the two strings are different in some minor way, or the process of which you are testing the two are not the same.

How to populate new column with function output on existing column

I have a datafrom where one of the columns is a First Name. I would like to pass the first name through the gender-guesser library to get the best guess of the name's gender. However, when I attempt to create a new 'Gender' column and pass the data from the 'First Name' column with:
df_names['Gender'] = gender.Detector().get_gender(df_names['First Name'])
I get the error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I think it has something to do with what the gender guesser is doing under the hood, but I'm not 100% sure. I get tracebacks to both the gender-guesser and pandas. I am able to pass strings to the guesser and get a return without issue. I am also able to write my own super simple function to concatenate the 'First Name' data with another string and get a valid output; like:
def concat(x):
return x+" something more"
df_names['More'] = concat(df_names['First Name'])
And that works as expected as well; creating a new column with the matching contents.
I am also able to get a single, correct, return using iloc. I have been able to get a for loop to work, but it takes too long to be practical.
It looks like you're running into an implementation detail of the get_gender method, it is most likely trying to use the First Name as the key to a dictionary, which would cause python to call the first name object's __hash__ method and throw the error (which you can see in the code).
As you've already observed with your concat method, the key to getting around this might be just to cast the first name object to a string:
df_names['Gender'] = gender.Detector().get_gender(
str(df_names['First Name']) # make First Name a generic str instance
)
Never used gender detector but I'm guessing this should work
gd = gender.Detector()
df_names['Gender'] = df_names['First Name'].apply(gd.get_gender)

NetCDF variables that have the fill value/ missing value

I have a variable in a NetCDF file that has a default value if the variable is null. How do you remove this value or change it to 0 when the variable is missing a value?
It sounds like the problem is that when the variable is populated into the NetCDF file, it is set to insert some default value for values that are missing. Now, I am assuming that you need to remove these default values after the file has been written and you are working with the data.
So (depending on how you are accessing the variable) I would pull the variable out of the NetCDF file and assign it to a python variable. This is the first method that comes to mind.
Use a for loop to step through and replace that default value with 0
variable=NetCDF_variable #Assume default value is 1e10
cleaned_list=[]
for i in variable:
if i == 1e10:
cleaned_list.append(0) #0 or whatever you want to fill here
else:
cleaned_list.append(i)
If the default value is a float, you may want to look into numpy.isclose if the above code isn't working. You might also be interested in masking your data in case any computations you do would be thrown off by inserting a 0.
EDIT: User N1B4 provided a much cleaner and efficient way of doing the exact same thing as above.
variable[variable == 1e10] = 0

Categories

Resources