Using .apply in python to apply a mapper - python

this should be very simple but I can't figure it out.
I have a 'mapper' DataFrame that looks something like this:
mapper={'old_values':[105,312,269],'new_values':[849,383,628]}
df=pd.DataFrame(mapper)
I then have another dataframe with a column that contains old values. I simply want to convert them all to their new values (e.g. all 105's should become 849's). I think I need to use df.apply but I can't find an example of how to do this.
Thanks in advance.

It's better to choose Series.map method which performs similar to a python dictionary in functionality to aid in the mapping of values from one series to another than go for a slow apply function here.
df['old_values'].map(df.set_index('old_values')['new_values'])
Out[12]:
0 849
1 383
2 628
Name: old_values, dtype: int64
The only modification you need to make here is:
new_df['old_values'].map(old_df.set_index('old_values')['new_values'])
But do note that this introduces NaN for keys not found in the original DF.(Any unseen value encountered by the new DF would be coerced to NaN).
If this is the behavior you'd expect then map is an ideal method.
Although, if your intention is to simply replace the values and leave the missing keys as they were before, you can opt for Series.replace method.
new_df['old_values'].replace(old_df.set_index('old_values')['new_values'])

Related

How do I pull only the rows where the values are in the 1980s and 1990s?

I'm currently working in Pandas on python and I was wondering how to pull the values listed above from this dataframe. They will be used to create a new dataframe called data1_CBS_80s_90s. For reference I also attached a screenshot.
As a complement to #Yu-Sheng Li's probably best answer, if the column type is string you can match with a regex:
data1_CBS_80s_90s = data1_CBS[data1_CBS['Year'].str.match('19[89].')]
EDIT: upper bound modified from 1990 to 1999. Thanks #mozway for the comment.
EDIT: given the column Year is of type int.
If it is of type str then you need to convert it first.
Not sure if the following is what you want.
data1_CBS_80s_90s = data1_CBS[data1_CBS['Year'].between(1980, 1999)]

pandas cleaning 1+1 values in a column

I have a column that has the following data
column
------
1+1
2+3
4+5
How do I get pandas to sum these values so that the out put is 2,5,9 instead of the above?
Many thanks
You column obviously contains strings, so, you must somehow evaluate them. Use pd.eval function. Eg
frame['column'].apply(pd.eval)
If interested in performance, probably use an alternative method, like ast.literal_eval. Thanks to user #Serge Ballesta for mentioning

What is the best way to calculate the mean of the values of a pandas dataframe with np.nan in it?

I'm trying to calculate the mean of the values (all of them numeric, not like in the 'How to calculate the mean of a pandas DataFrame with NaN values' question) of a pandas dataframe containing a lot of np.nan in it.
I've came with this code, that works quite well by the way :
my_df = pd.DataFrame ([[0,10,np.nan,220],\
[1,np.nan,21,221],[2,12,22,np.nan],[np.nan,13,np.nan,np.nan]])
print(my_df.values.flatten()[~np.isnan(my_df.values.flatten())].mean())
However, I found that this line of code gives the same result, which I don't understand why :
print(my_df.values[~np.isnan(my_df.values)].mean())
Is this really the same, and can I use it safely ?
I mean, my_df.values[~np.isnan(my_df.values) is still an array that is not flat and so what happened to the np.nan in it ?
Any improvement is welcome if you see a more efficient and pythonic way to do that.
Thanks a lot.
Is this really the same, and can I use it safely ?
Yes, since numpy here masks away the NaNs, and it will then calculate the mean over that array. But you make it overcomplicated here.
You can use numpy's nanmean(..) [numpy-doc] here:
>>> np.nanmean(my_df)
52.2
The NaN values are thus not take into account (not in the sum nor in the count of the mean). I think this is probably more declarative than calculating the mean with masking, since the above says what you are doing, and not that much how you are doing that.
In case you want to count the NaNs, we can replace these with 0 like #abdullah.cu says, like:
>>> my_df.fillna(0).values.mean()
32.625

How to ignore NaN in the dataframe for Mann-whitney u test?

I have a dataframe as below.
I want p-value of Mann-whitney u test by comparing each column.
As an example, I tried below.
from scipy.stats import mannwhitneyu
mannwhitneyu(df['A'], df['B'])
This results in the following values.
MannwhitneyuResult(statistic=3.5, pvalue=1.8224273379076809e-05)
I wondered whether NaN affected the result, thus I made the following df2 and df3 dataframes as described in the figure and tried below.
mannwhitneyu(df2, df3)
This resulted in
MannwhitneyuResult(statistic=3.5, pvalue=0.00025322465545184154)
So I think NaN values affected the result.
Does anyone know how to ignore NaN values in the dataframe?
you can use df.dropna() you can find extensive documentation here dropna
As per your example, the syntax would go something like this:
mannwhitneyu(df['A'].dropna(),df['B'])
As you can see, there is no argument in the mannwhitneyu function allowing you to specify its behavior when it encounters NaN values, but if you inspect its source code, you can see that it doesn't take NaN values into account when calculating some of the key values (n1, n2, ranked, etc.). This makes me suspicious of any results that you'd get when some of the input values are missing. If you don't feel like implementing the function yourself with NaN-ignoring capabilities, probably the best thing to do is to either create new arrays without missing values as you've done, or use df['A'].dropna() as suggested in the other answer.

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Categories

Resources