Cell-wise calculations in a Pandas Dataframe - python

I have what I'm sure is a fundamental lack of understanding about how dataframes work in Python. I am sure this is an easy question, but I have looked everywhere and can't find a good explanation. I am trying to understand why sometimes dataframe calculations seem to run on a row-by-row (or cell by cell) basis, and sometimes seem to run for an entire column... For example:
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df
Which gives:
Name Depth
0 49-037-23094 20
1 49-029-21476 21
2 49-029-20812 7
3 49-041-21318 18
Now I know I can do:
df['DepthDouble']=df['Depth']*2
And get:
Name Depth DepthDouble
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
Which is what I would expect. But this doesn't always work, and I'm trying to understand why. For example, I am trying to run this code to modify the name:
df['newName']=''.join(re.findall('\d',str(df['Name'])))
which gives:
Name Depth DepthDouble \
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
newName
0 04903723094149029214762490292081234904121318
1 04903723094149029214762490292081234904121318
2 04903723094149029214762490292081234904121318
3 04903723094149029214762490292081234904121318
So it is taking all the values in my name column, removing the dashes, and concatenating them. Of course, I'd just like it to be a new name column exactly the same as the original "Name" column, but without the dashes.
So, can anyone help me understand what I am doing wrong here? I Don't understand why sometimes Dataframe calculations for one column are done row by row (e.g., the Depth Doubled column) and sometimes Python seems to take all values in the entire column and run the calculation (e.g., the newName column).
Surely the way to get around this isn't by making a loop for every index in the df to force it to run individually for each row for a given column?

If the output you're looking for is:
Name Depth newName
0 49-037-23094 20 4903723094
1 49-029-21476 21 4902921476
2 49-029-20812 7 4902920812
3 49-041-21318 18 4904121318
The way to get this is:
df['newName']=df['Name'].map(lambda name: ''.join(re.findall('\d', name)))
map is like apply but specifically for Series objects. Since you're applying to only the Name column you are operating on a Series.
If the lambda part is confusing, an equivalent way to write it is:
def find_digits(name):
return ''.join(re.findall('\d', name))
df['newName']=df['Name'].map(find_digits)
The equivalent operation in traditional for loops is:
newNameSeries = pd.Series(name='newName')
for name in df['Name']:
newNameSeries = newNameSeries.append(pd.Series(''.join(re.findall('\d', name))), ignore_index=True)
pd.concat([df, newNameSeries], axis=1).rename(columns={0:'newName'})
While there might be a slightly cleaner way to do the loop, you can see how much simpler the first approach is compared to trying to use for-loops. It's also faster. As you already have indicated you know, avoid for loops when using pandas.

The issue is that with str(df['Name']) you are converting the entire Name-column of your DataFrame to one single string. What you want to do instead is to use one of pandas' own methods for strings, which will be applied to every single element of the column.
For example, you could use pandas' replace method for strings:
import pandas as pd
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df['newName'] = df['Name'].str.replace('-', '')

Related

Using an `if` statement inside a Pandas DataFrame's `assign` method

Intro and reproducible code snippet
I'm having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else statement.
More specifically, I'm trying to perform this check within the confines of the assign method of a Pandas Dataframe. Here is an example of what I'm trying to do
# Importing Pandas
import pandas as pd
# Creating synthetic data
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[11,22,33,44,55,66,77,88,99,1010]})
# Creating a separate output DataFrame that doesn't overwrite
# the original input DataFrame
out_df = my_df.assign(
# Successfully creating a new column called `col3` using a lambda function
col3=lambda row: row['col1'] + row['col2'],
# Using a new lambda function to perform an operation on the newly
# generated column.
bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')
The code above yeilds a ValueError:
ValueError: The truth value of a Series is ambiguous
When trying to investigate the error, I found this SO thread. It seems that lambda functions don't always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame's attempt to deal with things as Series.
A few dirty workarounds
Use apply
A dirty workaround would be to make col3 using the assign method as indicated above, but then create the bleep_bloop column using an apply method instead:
out_sr = (my_df.assign(
col3=lambda row: row['col1'] + row['col2'])
.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1))
The problem here is that the code above returns only a Series with the results of the bleep_bloop column instead of a new DataFrame with both col3 and bleep_bloop.
On the fly vs. multiple commands
Yet another approach would be to break one command into two:
out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1)
This also works, but I'd really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.
Back to the main question
Given that the workarounds I showed above are messy and don't really get the job done like I need, is there any other way I can create a new column that's based on using a conditional if/else statement?
The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row)), where my_func is some complex function that uses several other columns from the same row as inputs).
Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:
import numpy as np
out_df = my_df.assign(
col3=lambda d: d['col1'] + d['col2'],
bleep_bloop=lambda d: np.where(d['col3']%8, 'bloop', 'bleep')
)
print(out_df)
Output:
col1 col2 col3 bleep_bloop
0 1 11 12 bloop
1 2 22 24 bleep
2 3 33 36 bloop
3 4 44 48 bleep
4 5 55 60 bloop
5 6 66 72 bleep
6 7 77 84 bloop
7 8 88 96 bleep
8 9 99 108 bloop
9 10 1010 1020 bloop
Or for more than 2 conditions you can use np.select:
import numpy as np
out_df=(my_df.assign(
col3 = lambda df_ : df_['col1'] + df_['col2'],
bleep_bloop=lambda df_: np.select(condlist=[df_['col3']%8==0,
df_['col3']%8==1,
df_['col3']>100 ],
choicelist=['bleep',
'bloop',
'bliip'],
default='bluup')))
The good thing about np.select is that it works like where(vectorized functions therefore faster) and you can put as many condition you want.
Since you will be needing a complex logic in your final column, as you mentioned it makes sense to create a separate function for it and apply it to the rows.
def my_func(x):
if (x['col1'] + x['col2']) % 8 == 0:
return 'bleep'
else:
return 'bloop'
my_df['bleep_bloop'] = my_df.apply(lambda x: my_func(x), axis=1)
When you pass the x to the function, you are in fact passing each row and can use any of the column values inside your function like x['col1'] and so on. This way you can create as complex a function as you need. Note that axis=1 is required here to pass the rows.
I did not include creation of col3 just to provide a sample.

Iterating over dataframe and using replace method based on condtions

I am attempting to iterate over a specific column in my dataframe.
The column is:
df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']
I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).
To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.
I also attempted to use pd.DataFrame.where, but am getting an error:
for i,row in df.iterrows():
df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])
``AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).
Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:
x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')
This'll give you:
0 1
0 1.4 million
1 1235000 NaN
2 100 million
3 NaN NaN
4 14 million
5 2.5 mill
Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:
res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)
That'll give you:
0 1400000.0
1 1235000.0
2 100000000.0
3 NaN
4 14000000.0
5 2500000.0
Try this:
df['column'].apply(lambda x : x.replace('million','00000'))
Make sure your dtype is string before applying this
For the given data:
df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
if 'million' in str(x) or 'mill' in str(x) else x)
If there may be many forms of million in the column, then regex search.

Python Pandas: Indexing Instances of a Dataframe Separately within a For Loop

I have the following Dataframe:
Rec Channel Value1 Value2
Pre 10 20
Pre 35 42
Event A 23 39
FF 50 75
Post A 79 11
Post B 88 69
And have got to the point where with the following code:
res = df[df['Channel'].isin({'A', 'B'})
I am able to find all the instances in the Dataframe where the column 'Channel' has values of either A or B. I now am trying to determine a way to use a For Loop so that it will go through and print each row where A or B is found separately.
The reasoning for a For Loop is that this is just a sample Dataframe, my application is going to have a dynamic value of A and B's found depending on the Dataframe and I would like to be able to call upon each individually regardless of the number of instances.
Additionally, I would like an easy way to index upon the first and last instance where an A or B is found (again, the location is going to be changing from Dataframe to Dataframe) so I can't just do:
res1 = res.loc[4]
to identify the first one in this case, I need something that is going to be more robust regardless of the index I can call upon the first and last instance. Can someone please assist?
It would go something like this:
res = df[df.Channel.isin(['A', 'B'])]
for row in df[df.Channel.isin(['A', 'B'])].iterrows():
row_index = row[0]

Apply Feature Hashing to specific columns from a DataFrame

I'm a bit lost with the use of Feature Hashing in Python Pandas .
I have the a DataFrame with multiple columns, with many information in different types. There is one column that represent a class for the data.
Example:
col1 col2 colType
1 1 2 'A'
2 1 1 'B'
3 2 4 'C'
My goal is to apply FeatureHashing for the ColType, in order to be able to apply a Machine Learning Algorithm.
I have created a separate DataFrame for the colType, having something like this:
colType value
1 'A' 1
2 'B' 2
3 'C' 3
4 'D' 4
Then, applied Feature Hashing for this class Data Frame. But I don't understand how to add the result of Feature Hashing to my DataFrame with the info, in order to use it as an input in a Machine Learning Algorithm.
This is how I use FeatureHashing:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=10, input_type='string')
result = fh.fit_transform(categoriesDF)
How do I insert this FeatureHasher result, to my DataFrame? How bad is my approach? Is there any better way to achieve what I am doing?
Thanks!
I know this answer comes in late, but I stumbled upon the same problem and found this works:
fh = FeatureHasher(n_features=8, input_type='string')
sp = fh.fit_transform(df['colType'])
df = pd.DataFrame(sp.toarray(), columns=['fh1', 'fh2', 'fh3', 'fh4', 'fh5', 'fh6', 'fh7', 'fh8'])
pd.concat([df1, df], axis=1)
This creates a dataframe out of the sparse matrix retrieved by the FeatureHasher and concatenates the matrix to the existing dataframe.
I have switched to One Hot Coding, using something like this:
categoriesDF = pd.get_dummies(categoriesDF)
This function will create a column for every non-category value, with 1 or 0.

I need to create a python list object, or any object, out of a pandas DataFrame object grouping pieces of values from different rows

My DataFrame has a string in the first column, and a number in the second one:
GEOSTRING IDactivity
9 wydm2p01uk0fd2z 2
10 wydm86pg6r3jyrg 2
11 wydm2p01uk0fd2z 2
12 wydm80xfxm9j22v 2
39 wydm9w92j538xze 4
40 wydm8km72gbyuvf 4
41 wydm86pg6r3jyrg 4
42 wydm8mzt874p1v5 4
43 wydm8mzmpz5gkt8 5
44 wydm86pg6r3jyrg 5
45 wydm8w1q8bjfpcj 5
46 wydm8w1q8bjfpcj 5
What I want to do is to manipulate this DataFrame in order to have a list object that contains a string, made out of the 5th character for each "GEOSTRING" value, for each different "IDactivity" value.
So in this case, I have 3 different "IDactivity" values, and I will have in my list object 3 strings that look like this:
['2828', '9888','8888']
where again, the symbols you see in each string, are the 5th value of each "GEOSTRING" value.
What I'm asking is a solution, or an approach, that doesn't involve a too complicated for loop and have it as efficient as possible since I have to manipulate lots of data. I'd like it to be clean and fast.
I hope it's clear enough.
this can be done easily as follows as a one liner: (considered to be pretty fast too)
result = df.groupby('IDactivity')['GEOSTRING'].apply(lambda x:''.join(x.str[4])).tolist()
this groups the dataframe by values of IDactivity then select from each corresponding string of GEOSTRING column the 5th element (index 4) and joins it with the other corresponding strings. Finally we add tolist() method to get the output as list not pandas Series.
output:
['2828', '9888', '8888']
Documentation:
pandas.groupby
pandas.apply
Here's a solution involving a temp column, and taking inspiration for the key operation from this answer:
# create a temp column with the character we want from each string
dframe['Temp'] = dframe['GEOSTRING'].apply(lambda x: x[4])
# groupby ID and then concatenate using a sneaky call to .sum()
dframe.groupby('IDactivity')['Temp'].sum().tolist()
Result:
['2828', '9888', '8888']

Categories

Resources