How to apply UDF to dataframe? - python

I am trying to create a function that will cleanup and dataframe that I put through the function. But I noticed that the df returned is cleanued up but not in place of the original df.
How can I run a UDF on a dataframe and keep the updated dataframe saved in place?
p.s. I know I can combine these rules into one line but the function I am creating is a lot more complex so I don't want to combine for this example
df = pd.DataFrame({'Key': ['3', '9', '9', '9', '9','34','34', '34'],
'LastFour': ['2290', '0087', 'M433','M433','25','25','25','25'],
'NUM': [20120528, 20120507, 20120615,20120629,20120621,20120305,20120506,20120506]})
def cleaner(x):
x = x[x['Key'] == '9']
x = x[x['LastFour'] == 'M433']
x = x[x['NUM'] == 20120615]
return x
cleaner(df)
Result from the UDF:
Key LastFour NUM
2 9 M433 20120615
But if I run the df after the function then I still get the original dataset:
Key LastFour NUM
0 3 2290 20120528
1 9 0087 20120507
2 9 M433 20120615
3 9 M433 20120629
4 9 25 20120621
5 34 25 20120305
6 34 25 20120506
7 34 25 20120506

You need to assign the result of cleaner(df) back to df as so:
df = cleaner(df)
An alternative method is to use pd.DataFrame.pipe to pass your dataframe through a function:
df = df.pipe(cleaner)

Related

How to merge an itertools generated dataframe and a normal dataframe in pandas?

I have generated a dataframe containing all the possible two combinations of electrocardiogram (ECG) leads using itertools using the code below
source = [ 'I-s', 'II-s', 'III-s', 'aVR-s', 'aVL-s', 'aVF-s', 'V1-s', 'V2-s', 'V3-s', 'V4-s', 'V5-s', 'V6-s', 'V1Long-s', 'IILong-s', 'V5Long-s', 'Information-s' ]
target = [ 'I-t', 'II-t', 'III-t', 'aVR-t', 'aVL-t', 'aVF-t', 'V1-t', 'V2-t', 'V3-t', 'V4-t', 'V5-t', 'V6-t', 'V1Long-t', 'IILong-t', 'V5Long-t', 'Information-t' ]
from itertools import product
test = pd.DataFrame(list(product(source, target)), columns=['source', 'target'])
The test dataframe contains 256 rows/lines containing all the two possible combinations.
The value for each combination is zero as follows
test['value'] = 0
The test df looks like this:
I have another dataframe called diagramDF that contains the combinations where the value column is non-zero. The diagramDF is significanntly smaller than the test dataframe.
source target value
0 I-s II-t 137
1 II-s I-t 3
2 II-s III-t 81
3 II-s IILong-t 13
4 II-s V1-t 21
5 III-s II-t 3
6 III-s aVF-t 19
7 IILong-s II-t 13
8 IILong-s V1Long-t 353
9 V1-s aVL-t 11
10 V1Long-s IILong-t 175
11 V1Long-s V3-t 4
12 V1Long-s aVF-t 4
13 V2-s V3-t 8
14 V3-s V2-t 6
15 V3-s V6-t 2
16 V5-s aVR-t 5
17 V6-s III-t 4
18 aVF-s III-t 79
19 aVF-s V1Long-t 235
20 aVL-s I-t 1
21 aVL-s aVF-t 16
22 aVR-s aVL-t 1
Note that the first two columns source and target have the same notations
I have tried to replace the zero values of the test dataframe with the nonzero values of the diagramDF using merge like below:
df = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
However, I get an error informing me that:
ValueError: The column label 'source' is not unique. For a
multi-index, the label must be a tuple with elements corresponding to
each level
Is there something that I am getting wrong? Is there a more efficient and fast way to do this?
Might help,
pd.merge(test, diagramDF, how='left', on=['source', 'target'],right_index=True,left_index=True)
Check this:
test = test.reset_index()
diagramDF = diagramDF.reset_index()
new = pd.merge(test, diagramDF, how='left', on=['source', 'target'])

How to multiply first two digit in a single dataframe column in pandas?

I have tried many ways in doing this but it doesn't work for my case. Many of them are multiplied by columns, for my case is need to get the first two digit from a single column and multiply.
this is a column in a dataset and I need to get the first two-digit and multiply with each other
For example: for the first row I need to get 4 multiply by 5 and the result will store in a new column
May I know how to do it?
Thank you in advanced^^
For following dataframe
import pandas as pd
d={'locationID':[12,234,34]}
data=pd.DataFrame(data=d)
print(data)
locationID
0 12
1 234
2 34
If you want to multiply all the digits
function to multiply all the digits,
def multiplyall(number):
result=1
for i in range(len(str(number))):
result=int(str(number)[i])*result
return result
create column and add values according to the function in one line with insert(location, column name, column values)
data.insert(len(data.columns), 'new_column', data['locationID'].apply(lambda x: multiply_all(x)).tolist())
you'll get the following output
print(data)
locationID new_column
0 12 2
1 234 24
2 34 12
If you want to multiply ONLY 1st and 2nd digits
function to multiply 1st and 2nd digits,
def multiply_firstsecond(number):
result=number
if len(str(number))>1:
result=int(str(number)[0])* int(str(number)[1])
return result
Similarly,
data.insert(len(data.columns), 'new_column', data['locationID'].apply(lambda x: multiply_firstsecond(x)).tolist())
output for this one,
print(data)
locationID new_column
0 12 2
1 234 6
2 34 12
PLEASE make sure you have no NaN or non-numeric values in the column to avoid errors.
Like this:
ID = ['45.0',
'141.0',
'191.0',
'143.0',
'243.0']
N = [f"{int(s[0])*int(s[1])}" for s in ID]
print(N)
Output:
['20',
'4',
'9',
'4',
'8']
This should work
data = DataFrame([
(54423),
(2023),
(4353),
(76754)
], columns=["number_1"])
data["number_2"] = 0
def calculation(num):
mult = num
if len(str(num)) >= 2:
str_num = str(num)
mult = int(str_num[0]) * int(str_num[1])
return mult
data["number_2"] = data["number_1"].apply(calculation)
print(data)
number_1 number_2
0 54423 20
1 2023 0
2 4353 12
3 76754 42

Python: Populate new df column based on if statement condition

I'm trying something new. I want to populate a new df column based on some conditions affecting another column with values.
I have a data frame with two columns (ID,Retailer). I want to populate the Retailer column based on the ids in the ID column. I know how to do this in SQL, using a CASE statement, but how can I go about it in python?
I've had look at this example but it isn't exactly what I'm looking for.
Python : populate a new column with an if/else statement
import pandas as pd
data = {'ID':['112','5898','32','9985','23','577','17','200','156']}
df = pd.DataFrame(data)
df['Retailer']=''
if df['ID'] in (112,32):
df['Retailer']='Webmania'
elif df['ID'] in (5898):
df['Retailer']='DataHub'
elif df['ID'] in (9985):
df['Retailer']='TorrentJunkie'
elif df['ID'] in (23):
df['Retailer']='Apptronix'
else: df['Retailer']='Other'
print(df)
The output I'm expecting to see would be something along these lines:
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Use numpy.select and for test multiple values use Series.isin, also if need test strings like in sample data change numbers to numeric like 112 to '112':
m1 = df['ID'].isin(['112','32'])
m2 = df['ID'] == '5898'
m3 = df['ID'] == '9985'
m4 = df['ID'] == '23'
vals = ['Webmania', 'DataHub', 'TorrentJunkie', 'Apptronix']
masks = [m1, m2, m3, m4]
df['Retailer'] = np.select(masks, vals, default='Other')
print(df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
If many catagories also is possible use your solution with custom function:
def get_data(x):
if x in ('112','32'):
return 'Webmania'
elif x == '5898':
return 'DataHub'
elif x == '9985':
return 'TorrentJunkie'
elif x == '23':
return 'Apptronix'
else: return 'Other'
df['Retailer'] = df['ID'].apply(get_data)
print (df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Or use map by dictionary, if no match get NaN, so added fillna:
d = {'112': 'Webmania','32':'Webmania',
'5898':'DataHub',
'9985':'TorrentJunkie',
'23':'Apptronix'}
df['Retailer'] = df['ID'].map(d).fillna('Other')

Python Pandas fillna doesn't work in for loop?

Given a set up such as below:
import pandas as pd
import numpy as np
#Create random number dataframes
df1 = pd.DataFrame(np.random.rand(10,4))
df2 = pd.DataFrame(np.random.rand(10,4))
df3 = pd.DataFrame(np.random.rand(10,4))
#Create list of dataframes
data_frame_list = [df1, df2, df3]
#Introduce some NaN values
df1.iloc[4,3] = np.NaN
df2.iloc[1:4,2] = np.NaN
#Create loop to ffill any NaN values
for df in data_frame_list:
df = df.fillna(method='ffill')
This still leaves df2 (for example) as:
0 1 2 3
0 0.946601 0.492957 0.688421 0.582571
1 0.365173 0.507617 NaN 0.997909
2 0.185005 0.496989 NaN 0.962120
3 0.278633 0.515227 NaN 0.868952
4 0.346495 0.779571 0.376018 0.750900
5 0.384307 0.594381 0.741655 0.510144
6 0.499180 0.885632 0.13413 0.196010
7 0.245445 0.771402 0.371148 0.222618
8 0.564510 0.487644 0.121945 0.095932
9 0.401214 0.282698 0.0181196 0.689916
Although the individual line of code:
df2 = df2.fillna(method='ffill)
Does work. I thought the issue may be due to the way I was naming variables so I introduced global()[df], but this didn't seem to work either.
Wondering if it possible to do a ffill of an entire dataframe in a for loop, or am I going wrong somewhere in my approach?
No, it unfortunately does not. You are calling fillna not in place and it results in the generation of a copy, which you then reassign back to the variable df. You should understand that reassigning this variable does not change the contents of the list.
If you want to do that, iterate over the index or use a list comprehension.
data_frame_list = [df.ffill() for df in data_frame_list]
Or,
for i in range(len(data_frame_list)):
data_frame_list[i].ffill(inplace=True)
You can change only DataFrame in list of DataFrames, so df1 - df3 are not changed with ffill and parameter inplace=True:
data_frame_list = [df1, df2, df3]
for df in data_frame_list:
df.ffill(inplace=True)
print (data_frame_list)
[ 0 1 2 3
0 0.506726 0.057531 0.627580 0.132553
1 0.131085 0.788544 0.506686 0.412826
2 0.578009 0.488174 0.335964 0.140816
3 0.891442 0.086312 0.847512 0.529616
4 0.550261 0.848461 0.158998 0.529616
5 0.817808 0.977898 0.933133 0.310414
6 0.481331 0.382784 0.874249 0.363505
7 0.384864 0.035155 0.634643 0.009076
8 0.197091 0.880822 0.002330 0.109501
9 0.623105 0.999237 0.567151 0.487938, 0 1 2 3
0 0.104856 0.525416 0.284066 0.658453
1 0.989523 0.644251 0.284066 0.141395
2 0.488099 0.167418 0.284066 0.097982
3 0.930415 0.486878 0.284066 0.192273
4 0.210032 0.244598 0.175200 0.367130
5 0.981763 0.285865 0.979590 0.924292
6 0.631067 0.119238 0.855842 0.782623
7 0.815908 0.575624 0.037598 0.532883
8 0.346577 0.329280 0.606794 0.825932
9 0.273021 0.503340 0.828568 0.429792, 0 1 2 3
0 0.491665 0.752531 0.780970 0.524148
1 0.635208 0.283928 0.821345 0.874243
2 0.454211 0.622611 0.267682 0.726456
3 0.379144 0.345580 0.694614 0.585782
4 0.844209 0.662073 0.590640 0.612480
5 0.258679 0.413567 0.797383 0.431819
6 0.034473 0.581294 0.282111 0.856725
7 0.352072 0.801542 0.862749 0.000285
8 0.793939 0.297286 0.441013 0.294635
9 0.841181 0.804839 0.311352 0.171094]
Or you can concat
df=pd.concat([df1,df2,df3],keys=['df1','df2','df3'])
[x for _,x in df.groupby(level=0).ffill().groupby(level=0)]

Pandas groupby expanding optimization of syntax

I am using the data from the example shown here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html. Go to the subheading: New syntax to window and resample operations
At the command prompt, the new syntax works as shown in the pandas documentation. But I want to add a new column with the expanded data to the existing dataframe, as would be done in a saved program.
Before a syntax upgrade to the groupby expanding code, I was able to use the following single line code:
df = pd.DataFrame({'A': [1] * 10 + [5] * 10, 'B': np.arange(20)})
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
This gives the expected results, but also gives an 'expanding_sum is deprecated' message. Expected results are:
A B Sum of B
0 1 0 0
1 1 1 1
2 1 2 3
3 1 3 6
4 1 4 10
5 1 5 15
6 1 6 21
7 1 7 28
8 1 8 36
9 1 9 45
10 5 10 10
11 5 11 21
12 5 12 33
13 5 13 46
14 5 14 60
15 5 15 75
16 5 16 91
17 5 17 108
18 5 18 126
19 5 19 145
I want to use the new syntax to replace the deprecated syntax. If I try the new syntax, I get the error message:
df['Sum of B'] = df.groupby('A').expanding().B.sum()
TypeError: incompatible index of inserted column with frame index
I did some searching on here, and saw something that might have helped, but it gave me a different message:
df['Sum of B'] = df.groupby('A').expanding().B.sum().reset_index(level = 0)
ValueError: Wrong number of items passed 2, placement implies 1
The only way I can get it to work is to assign the result to a temporary df, then merge the temporary df into the original df:
temp_df = df.groupby('A').expanding().B.sum().reset_index(level = 0).rename(columns = {'B' : 'Sum of B'})
new_df = pd.merge(df, temp_df, on = 'A', left_index = True, right_index = True)
print (new_df)
This code gives the expected results as shown above.
I've tried different variations using transform as well, but have not been able to come up with coding this in one line as I did before the deprecation. Is there a single line syntax that will work? Thanks.
It seems you need a cumsum:
df.groupby('A')['B'].cumsum()
TL;DR
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: x.expanding().sum())
Explanation
We start from the offending line:
df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
Let's read carefully the warning you mentioned:
FutureWarning: pd.expanding_sum is deprecated for Series and will be
removed in a future version, replace with
Series.expanding(min_periods=1).sum()
After reading Pandas 0.17.0: pandas.expanding_sum it becomes clear that the Series the warning is talking about is the first parameter of the pd.expanding_sum. I.e. in our case it is x.
Now we apply the code transformation suggested in the warning. So pd.expanding_sum(x) becomes x.expanding(min_periods=1).sum().
According to Pandas 0.22.0: pandas.Series.expanding min_periods has a default value of 1 so in your case it can be omitted altogether, hence the final result.

Categories

Resources