How to populate new column with function output on existing column

How to populate new column with function output on existing column - python

I have a datafrom where one of the columns is a First Name. I would like to pass the first name through the gender-guesser library to get the best guess of the name's gender. However, when I attempt to create a new 'Gender' column and pass the data from the 'First Name' column with:
df_names['Gender'] = gender.Detector().get_gender(df_names['First Name'])
I get the error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I think it has something to do with what the gender guesser is doing under the hood, but I'm not 100% sure. I get tracebacks to both the gender-guesser and pandas. I am able to pass strings to the guesser and get a return without issue. I am also able to write my own super simple function to concatenate the 'First Name' data with another string and get a valid output; like:
def concat(x):
return x+" something more"
df_names['More'] = concat(df_names['First Name'])
And that works as expected as well; creating a new column with the matching contents.
I am also able to get a single, correct, return using iloc. I have been able to get a for loop to work, but it takes too long to be practical.

It looks like you're running into an implementation detail of the get_gender method, it is most likely trying to use the First Name as the key to a dictionary, which would cause python to call the first name object's __hash__ method and throw the error (which you can see in the code).
As you've already observed with your concat method, the key to getting around this might be just to cast the first name object to a string:
df_names['Gender'] = gender.Detector().get_gender(
str(df_names['First Name']) # make First Name a generic str instance
)

Never used gender detector but I'm guessing this should work
gd = gender.Detector()
df_names['Gender'] = df_names['First Name'].apply(gd.get_gender)

Related

Looping Through Data Frames with Dynamic withColumn Injection

I'm looking to create a dynamic .withColumn.
with the column "rules" being replaced by a list depending on the file being processed.
for example: File A has a column called "Validated" that is based on a different condition to File B but has the same column name A. So can we loop through all files A-Z applying different rules for the same column in each file?
Here I am trying to validate many dataframes. Creating an EmailAddress_Validation field on each dataframe. Each data frame has a different email validation rule set. The rules are stored in a list called EmailRuleList. As we loop through each data set the corresponding rule "EmailRuleList[i]" is passed in from the list.
code below has the syntax. Also commented out with an "#" (hash) is an example of a rule.
Interestingly if I supply the rule with out the loop (the # comment) the code works except it then obviously applies the same rule to all files.
i=0
for FileProcessName in FileProcessListName:
EmailAddress_Validation = EmailRuleList[i]
#EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
print(EmailAddress_Validation)
print(FileProcessName)
i=i+1
vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", EmailAddress_Validation)
Error Message: col should be Column
EmailRuleList is something like...
['when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),1).otherwise(0)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx2,0))==(col("EmailAddress")),0).otherwise(1)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx3,0))==(col("EmailAddress")),0).otherwise(1)',
'when((regexp_extract(col("EmailAddress"),EmailRegEx4,0))==(col("EmailAddress")),0).otherwise(1)']
tried lots of different things but am a bit stuck

The error is in the last line of the for loop. The when condition that you want to check in the .withColumn() is actually a string (each element of EmailRuleList which is a string).
Since withColumn expects the send argument to be a column, it is giving the error. Look at a similar error when I try to give something similar to your code (in withColumn()):
from pyspark.sql.functions import when,col
df.withColumn("check","when(col('gname')=='Ana','yes').otherwise('No')").show()
To make it work, I have used eval function. So, using the following code wouldn't throw an error:
from pyspark.sql.functions import when,col
df.withColumn("check",eval("when(col('gname')=='Ana','yes').otherwise('No')")).show()
So, modify your code to the one given below to make it work:
i=0
for FileProcessName in FileProcessListName:
EmailAddress_Validation = EmailRuleList[i]
#EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
print(EmailAddress_Validation)
print(FileProcessName)
i=i+1
vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", eval(EmailAddress_Validation))

Problem transforming a variable in logs, python

I am using Python. I would like to create a new column which is the log transformation of column 'lights1992'.
I am using the following code:
log_lights1992 = np.log(lights1992)
I obtain the following error:
I have tried two things: 1) adding a 1 to each value and transform the column 'lights1992' to numeric.
city_join['lights1992'] = pd.to_numeric(city_join['lights1992'])
city_join["lights1992"] = city_join["lights1992"] + 1
However, that two solution has not worked. Variable 'lights1992' is a float64 type. Do you know what can be the problem?
Edit:
The variable 'lights1992' comes from doing a zonal_statistics from a raster 'junk1992', maybe this affect.
zs1 = zonal_stats(city_join, junk1992, stats=['mean'], nodata=np.nan)
city_join['lights1992'] = [x['mean'] for x in zs1]

the traceback states:
'DatasetReader' object has no attribute'log'.
Did you re-assign numpy to something else at some point? I can't find much about 'DatasetReader' is that a custom class?
EDIT:
I think you would need to pass the whole column because your edit doesn't show a variable named 'lights1992'
so instead of:
np.log(lights1992)
can you try passing in the Dataframe's column to log?:
np.log(city_join['lights1992'])
2ND EDIT:
Since you've reported back that it works I'll dive into the why a little bit.
In your original statement you called the log function and gave it an argument, then you assigned the result to a variable name:
log_lights1992 = np.log(lights1992)
The problem here is that when you give python text without any quotes it thinks you are giving it a variable name (see how you have log_lights1992 on the left of the equal sign? You wanted to assign the results of the operation on the right hand side of the equal sign to the variable name log_lights1992) but in this case I don't think lights1992 had any value!
So there were two ways to make it work, either what I said earlier:
Instead of giving it a variable name you give .log the column of the city_join dataframe (that's what city_join["lights1992"]) directly.
Or
You assign the value of that column to the variable name first then you pass it in to .log, like this:
lights1992 = city_join["lights1992"]
log_lights1992 = np.log(lights1992)
Hope that clears it up for you!

Calling a specific Pandas Dataframe from user input to use in a function?

This might have been answered, but I can't quite find the right group of words to search for to find the answer to the problem I'm having.
Situation:
I have a several data frames that could be plugged into a function. The function requires that I name the data frame so that it can take the shape.
def heatmap(table, cmap=cm.inferno, vmin=None, vmax=None, inner_r=0.25, pie_args={}:
n, m = table.shape
I want the user to be able to specify the data frame to use as the table like this:
table_name= input('specify Table to Graph: ')
heatmap(table_name)
Expectation: If the user input was TableXYZ then the variable table_name would reference TableXYZ so the function would be able to find the shape of TableXYZ to use that information in other parts of the function.
What actually happens: When I try to run the code I get an "AttribureError: 'str' has not attribute 'shape'." I see that the table_name input is a string object, but I'm trying to reference the data frame itself, not the name.
I feel like I'm missing a step to turn the user's input to something the function can actually take the shape of.

I'd recommend assigning each of your dataframes to a dictionary, then retrieving the dataframe by name from the dictionary to pass it to the heatmap function.
For example:
df_by_name = {"df_a": pd.DataFrame(), "df_b": pd.DataFrame()}
table_name= input('specify Table to Graph: ')
df = df_by_name[table_name]
heatmap(df)
Replace pd.DataFrame() in the first line with the actual data frames that you want to select from.

How can I manipulate a DataFrame name within a function?

How can I manipulate a DataFrame name within a function so that I can have a new DataFrame with a new name that is derived from the input DataFrame name in return?
let say I have this:
def some_func(df):
# some operations
return(df_copy)
and whatever df I put inside this function it should return the new df as ..._copy, e.g. some_func(my_frame) should return my_frame_copy.
Things that I considered are as follows:
As in string operations;
new_df_name = "{}_copy".format(df) -- I know this will not work since the df refers to an object but it just helps to explain what I am trying to do.
def date_timer(df):
df_copy = df.copy()
dates = df_copy.columns[df_copy.columns.str.contains('date')]
for i in range(len(dates)):
df_copy[dates[i]] = pd.to_datetime(df_copy[dates[i]].str.replace('T', ' '), errors='coerce')
return(df_copy)
Actually this was the first thing that I tried, If only DataFrame had a "name" attribute which allowed us to manipulate the name but this also not there:
df.name
Maybe f-string or any kind of string operations could be able to make it happen. if not, it might not be possible to do in python.
I think this might be related to variable name assignment rules in python. And in a sense what I want is reverse engineer that but probably not possible.
Please advice...

It looks like you're trying to access / dynamically set the global/local namespace of a variable from your program.
Unless your data object belongs to a more structured namespace object, I'd discourage you from dynamically setting names with such a method since a lot can go wrong, as per the docs:
Changes may not affect the values of local and free variables used by the interpreter.
The name attribute of your df is not an ideal solution since the state of that attribute will not be set on default. Nor is it particularly common. However, here is a solid SO answer which addresses this.
You might be better off storing your data objects in a dictionary, using dates or something meaningful as keys. Example:
my_data = {}
for my_date in dates:
df_temp = df.copy(deep=True) # deep copy ensures no changes are translated to the parent object
# Modify your df here (not sure what you are trying to do exactly
df_temp[my_date] = "foo"
# Now save that df
my_data[my_date] = df_temp
Hope this answers your Q. Feel free to clarify in the comments.

Instancing maya objects with a sequential suffix, object name string not seen by cmds.instance

I have a question about string usage in lists in python for Maya. I am writing a script meant to take a selected object, then instance it 100 times with random translate, scale, and orient attributes. The script itself works and does what it's meant to, however I'm not being able to decipher how to instance the objects with the original object name, and then add a suffix that ends with "_instance#", where # assigns 1, 2, 3, etc. in order to the copies of the original mesh. This is where I'm at so far:
#Capture selected objects, sort into list
thing = MC.ls(sl=True)
print thing
#Create instances of objects
instanceObj = MC.instance(thing, name='thing' + '_instance#')
This returns a result that looks like "thing_instance1, thing_instance2".
Following this, I figured the single quote around the string for the object was causing it to just name it "thing", so I attempted to write it as follows
MC.instance(thing, name=thing + '_instance1'
I guess because instance uses a list, it's not accepting the second usage of the string as valid and returns a concatenate error. I've tried rewriting this a few times and the closest I get is with
instanceObj = MC.instance(thing)
which results in a list of (pCube1,2,3,4), but is lacking the suffix.
I'm not sure where to go from here to end up with a result where the instanced objects are named with the convention "pCube1_instance1, pCube1_instance2" etc.
Any assistance would be appreciated.

It is not clear if you want to use only one source object or more. In any case the
MC.ls(sl=True)
returns a list of strings. And concatenating a list and a string does not work. So use thing[0] or simply
MC.ls(sl=True)[0]
If you get errormessages, please always include the message in your question, it helps a lot to see what error appears.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to populate new column with function output on existing column - python

Never used gender detector but I'm guessing this should work gd = gender.Detector() df_names['Gender'] = df_names['First Name'].apply(gd.get_gender)

Related

Looping Through Data Frames with Dynamic withColumn Injection

Problem transforming a variable in logs, python

Calling a specific Pandas Dataframe from user input to use in a function?

How can I manipulate a DataFrame name within a function?

Instancing maya objects with a sequential suffix, object name string not seen by cmds.instance

Categories

Resources