Adding a column in Pandas based on multiple conditions on another column - python

I'm trying to add a Year column to my DataFrame based on the value that already exists within the Rk column. I've tried using the code below, however it automatically sets all values to 0.
df['Year'] = np.where(df['Rk'] <= 540, '2017/2018', 0)
df['Year'] = np.where((df['Rk'] >= 541) & (df['Rk'] <= 1135), '2016/2017', 0)
df['Year'] = np.where((df['Rk'] >= 1136) & (df['Rk'] <= 1713), '2016/2017', 0)

Use cut with specify bins:
df = pd.DataFrame({
'Rk': [10, 540,541,1135,1136,1713,1714,2000],
})
labs = ['2017/2018','2016/2017','2015/2016', '0']
df['Year'] = pd.cut(df['Rk'], bins=[-np.inf,540, 1135, 1713, np.inf], labels=labs)
print (df)
Rk Year
0 10 2017/2018
1 540 2017/2018
2 541 2016/2017
3 1135 2016/2017
4 1136 2015/2016
5 1713 2015/2016
6 1714 0
7 2000 0

Related

How to write a for-loop/if-statement for a dataframe (integer) column

I have a dataframe with a column of integers that symbolise birthyears. Each row has 20xx or 19xx in it but some rows have only the xx part.
What I wanna do is add 19 in front of those numbers with only 2 "elemets" if the integer is bigger than 22(starting from 0), or/and add 20 infront of those that are smaller or equal to 22.
This is what I wrote;
for x in DF.loc[DF["Year"] >= 2022]:
x + 1900
if:
x >= 22
else:
x + 2000
You can also change the code completely, I would just like you to maybe explain what exactly your code does.
Thanks for everybody who takes time to answer this.
Instead of iterating through the rows, use where to change the whole column:
y = df["Year"] # just to save typing
df["Year"] = y.where(y > 99, (y + 1900).where(y > 22, y + 2000))
or indexing:
df["Year"][df["Year"].between(0, 21)] += 2000
df["Year"][df["Year"].between(22, 99)] += 1900
or loc:
df.loc[df["Year"].between(0, 21), "Year"] += 2000
df.loc[df["Year"].between(22, 99), "Year"] += 1900
You can do it in one line with the apply method.
Example:
df = pd.DataFrame({'date': [2002, 95, 1998, 3, 56, 1947]})
print(df)
date
0 2002
1 95
2 1998
3 3
4 56
5 1947
Then:
df = df.date.apply(lambda x: x+1900 if (x<100) & (x>22) else (x+2000 if (x<100)&(x<22) else x) )
print(df)
date
0 2002
1 1995
2 1998
3 2003
4 1956
5 1947
It is basically what you did, an if inside a for:
new_list_of_years = []
for year in DF.loc[DF["Year"]:
full_year = year+1900 if year >22 else year+2000
new_list_of_years.append(full_year)
DF['Year'] = pd.DataFrame(new_list_of_years)
Edit: You can do that with for-if list comprehension also:
DF['Year'] = [year+1900 if year > 22 else year+2000 for year in DF.loc[DF["Year"]]]

Merge_asof but only let the nearest merge on key

I am currently trying to merge two data frames using the merge_asof method. However, when using this method I stumbled upon the issue that if I have a empty gap in any of my data then there will be issues with duplicate cells in the merged dataframe. For clarification, I two dataframes that look like this:
1.
index Meter_Indication (km) Fuel1 (l)
0 35493 245
1 35975 267
2 36000 200
3 36303 160
4 36567 300
5 38653 234
index Meter_Indication (km) Fuel2 (l)
0 35494 300
1 35980 203
2 36573 323
3 38656 233
These two dataframes contain data about refueling vehicles where the fuel column is refueled amount in liters and the Meter_Indication indicate how many km the car in total has driven (something that is impossible to become less over time, and is why it is a great key to merge on). However, as you can see there are less rows in df2 than in df1 which currently (in my case makes it so that the values merge on the nearest value. Like this:
(merged df)
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
2 36000 200 203
3 36303 160 323
4 36567 300 323
5 38653 234 233
As you can see there are duplicates of the value 203 and 323. My goal is to instead of the dataframe containing all the 5 rows, instead excluding the ones that dont have a "nearest"-match. I want only the actually nearest to merge with the value. With other words my desired data frame is:
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
3 36567 300 323
4 38653 234 233
You can see here that the values that were not a "closest" match with another value were dropped.
I have tried looking for this everywhere but cant find anything to match my desired outcome.
My current code is:
#READS PROVIDED DOCUMENTS.
df1 = pd.read_excel(
filepathname1, "CWA107 Event", na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=1,
usecols=["Fuel2 (l)", "Unnamed: 3", "Meter_Indication"],)
# Drop NaN rows.
df2.dropna(inplace=True)
df1.dropna(inplace=True)
#Filters out rows with the keywords listed in 'blacklist'.
df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
df1 = df1[~df1.Bränsle.isin(blacklist)]
df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)
#Creates new column for the difference in fuellevel column.
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match.
df2.rename(columns={"Unnamed: 3": "Tid"}, inplace=True)
# Drops rows where the difference is equal to 0.
df1filt = df1[(df1["Difference (%)"] != 0)]
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
df1filt.reset_index(level=0, inplace=True)
#Renames the index column to "row" in order to later use the "row" column
df1filt.rename(columns={"index": "row"}, inplace=True)
# Creates a new column for the difference in total driven kilometers (used for matching)
df1filt["Match"] = df1filt["Vägmätare (km)"]
df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()
#Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
thevalue = 0
thevalue += df1filt.loc[currentrow,'Difference (%)']
df1filt.loc[currentrow,'Match'] = "SUMMED"
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
thevalue += df1filt.loc[currentrow,'Difference (%)']
if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"
thevalue = 0
#Removes single "refuels" that are lower than 5
df1filt = df1filt[(df1filt['Difference (%)'] > 5)]
#Creates a new dataframe for the summed values
df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]
#Creates a estimated refueled amount column for the automatic
df1filt2["Fuel1 (l)"] = df1filt2["Difference (%)"]
df1filt2["Fuel1 (l)"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity
#Renames total kilometer column so that the two documents can match
df1filt2.rename(columns={"Vägmätare (km)": "Meter_Indication"}, inplace=True)
#Filters out rows where refuel and kilometer = NaN (Manual)
df2filt = df2[(df2['Fuel2 (l)'] != NaN) & (df2['Meter_Indication'] != NaN)]
#Drops first row
df2filt.drop(df2filt.index[0], inplace=True)
#Adds prefix for the time column so that they match (not used anymore because km is used to match)
df2filt['Tid'] = '20' + df2filt['Tid'].astype(str)
#Rounds numeric columns
decimals = 0
df2filt['Meter_Indication'] = pd.to_numeric(df2filt['Meter_Indication'],errors='coerce')
df2filt['Fuel2 (l)'] = pd.to_numeric(df2filt['Fuel2 (l)'],errors='coerce')
df2filt['Meter_Indication'] = df2filt['Meter_Indication'].apply(lambda x: round(x, decimals))
df2filt['Fuel2 (l)'] = df2filt['Fuel2 (l)'].apply(lambda x: round(x, decimals))
#Removes last number (makes the two excels matchable)
df2filt['Meter_Indication'] //= 10
df1filt2['Meter_Indication'] //= 10
#Creates merged dataframe with the two
merged_df = df1filt2.merge(df2filt, on='Meter_Indication')
Hopefully this was enough information! Thank you in advance.
Try this:
# Assign new column to keep meter indication from df2
df = pd.merge_asof(df1, df2.assign(meter_indication_2=df2['Meter_Indication (km)']), on='Meter_Indication (km)', direction='nearest')
# Calculate absolute difference
df['meter_indication_diff'] = df['Meter_Indication (km)'].sub(df['meter_indication_2']).abs()
# Sort values, drop duplicates (keep the ones with the smallest diff) and do some clean up
df = df.sort_values(by=['meter_indication_2', 'meter_indication_diff']).drop_duplicates(subset=['meter_indication_2']).sort_index().drop(['meter_indication_2', 'meter_indication_diff'], axis=1)
# Output
Meter_Indication (km) Fuel1 (l) Fuel2 (l)
0 35493 245 300
1 35975 267 203
4 36567 300 323
5 38653 234 233

Applying lambda row on df with if statement

I have two data frames, trying to use entries from df1 to limit amounts in df2, then add them up. It seems like my code is limiting right, but not adding the amounts up.
Code:
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','45','65']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85']})
df1['Capped'] = df1.apply(lambda row: df2['Amounts'].where(
df2['Amounts'] <= row['Caps'], row['Caps']).sum(), axis=1)
Output:
>>> df1
Caps Capped
0 25 2525252525
1 45 4525453545
2 65 4525653565
First is necessary convert values to integers by Series.astype:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
df1['Capped'] = df1.apply(lambda row: df2['Amounts'].where(
df2['Amounts'] <= row['Caps'], row['Caps']).sum(), axis=1)
print (df1)
Caps Capped
0 25 125
1 45 195
2 65 235
For improve performance is possible use numpy.where with broadcasting:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
df1['Capped'] = np.where(am <= ca[:, None], am[None, :], ca[:, None]).sum(axis=1)
print (df1)
Caps Capped
0 25 125
1 45 195
2 65 235

Python: Populate new df column based on if statement condition

I'm trying something new. I want to populate a new df column based on some conditions affecting another column with values.
I have a data frame with two columns (ID,Retailer). I want to populate the Retailer column based on the ids in the ID column. I know how to do this in SQL, using a CASE statement, but how can I go about it in python?
I've had look at this example but it isn't exactly what I'm looking for.
Python : populate a new column with an if/else statement
import pandas as pd
data = {'ID':['112','5898','32','9985','23','577','17','200','156']}
df = pd.DataFrame(data)
df['Retailer']=''
if df['ID'] in (112,32):
df['Retailer']='Webmania'
elif df['ID'] in (5898):
df['Retailer']='DataHub'
elif df['ID'] in (9985):
df['Retailer']='TorrentJunkie'
elif df['ID'] in (23):
df['Retailer']='Apptronix'
else: df['Retailer']='Other'
print(df)
The output I'm expecting to see would be something along these lines:
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Use numpy.select and for test multiple values use Series.isin, also if need test strings like in sample data change numbers to numeric like 112 to '112':
m1 = df['ID'].isin(['112','32'])
m2 = df['ID'] == '5898'
m3 = df['ID'] == '9985'
m4 = df['ID'] == '23'
vals = ['Webmania', 'DataHub', 'TorrentJunkie', 'Apptronix']
masks = [m1, m2, m3, m4]
df['Retailer'] = np.select(masks, vals, default='Other')
print(df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
If many catagories also is possible use your solution with custom function:
def get_data(x):
if x in ('112','32'):
return 'Webmania'
elif x == '5898':
return 'DataHub'
elif x == '9985':
return 'TorrentJunkie'
elif x == '23':
return 'Apptronix'
else: return 'Other'
df['Retailer'] = df['ID'].apply(get_data)
print (df)
ID Retailer
0 112 Webmania
1 5898 DataHub
2 32 Webmania
3 9985 TorrentJunkie
4 23 Apptronix
5 577 Other
6 17 Other
7 200 Other
8 156 Other
Or use map by dictionary, if no match get NaN, so added fillna:
d = {'112': 'Webmania','32':'Webmania',
'5898':'DataHub',
'9985':'TorrentJunkie',
'23':'Apptronix'}
df['Retailer'] = df['ID'].map(d).fillna('Other')

Python pandas resampling

I have the following dataframe:
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
The table goes on and on like that. The first column is a timestamp which is in milliseconds. S_time1 and End_time_1 are the duration where a particular sign (number) appear. For example, if we take the 5th row, S_time1 is 2526631, End_time_1 is 2520631, and the corresponding sign_1 is 10, which means from 2526631 to 2520631 the sign 10 will be displayed. And the same thing goes to S_time2 and End_time_2. The corresponding values in sign_2 will appear in the duration from S_time2 to End_time_2.
I want to resample the index column (Timestamp) in 100-millisecond bin time and check in which bin times the signs belong. For instance, between each start time and end time there is 2000 milliseconds difference. So the corresponding sign number will appear repeatedly in around 20 consecutive bin times because each bin time is 100 millisecond. So I need to have two columns only: one with the bin times and the second with the signs. Looks like the following table: (I am just making up the bin time just for example)
Bin_time signs
...100 0
...200 0
...300 10
...400 10
...500 10
...600 10
The sign 10 will be for the duration of the corresponding S_time1 to End_time_1. Then the next sign which is 80 continues for the duration of S_time2 to End_time_2. I am not sure if this can be done in pandas or not. But I really need help either in pandas or other methods.
Thanks for your help and suggestion in advance.
Input:
print df
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
2 approaches:
In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop
In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop
def m(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df['i'] = 1
df = df.set_index('Timestamp')
df1 = df[[]].resample('100ms', how='first').reset_index()
df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#felper column i for merging
df1['i'] = 1
#print df1
out = df1.merge(df,on='i', how='left')
out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]
out1 = out1.rename(columns={'Sign_1':'Bin_time'})
out2 = out2.rename(columns={'Sign_2':'Bin_time'})
df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
df1 = df1.set_index('Timestamp')
df = df.set_index('Timestamp')
df = df.reindex(df1.index).reset_index()
#print df.head(10)
def s(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df = df.set_index('Timestamp')
out = df[[]].resample('100ms', how='first')
out = out.reset_index()
out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#print out.head(10)
#search start end
def search(x):
mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
#if at least one True return first value of series
if mask1.any():
return df.loc[mask1].Sign_1[0]
#check second start and end time
else:
mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
if mask2.any():
#if at least one True return first value
return df.loc[mask2].Sign_2[0]
else:
#if all False return NaN
return np.nan
out['Bin_time'] = out.apply(search, axis=1)
#print out.head(10)

Categories

Resources