User input to create a column in Pandas DataFrame - python

I have a pandas DataFrame:
sample_data = {'Sample': ['A', 'B', 'A', 'B'],
'Surface': ['Top', 'Bottom', 'Top', 'Bottom'],
'Intensity' : [21, 32, 14, 45]}
sample_dataframe = pd.DataFrame(data=sample_data)
And I have a function to get user input to create a column with a 'Condition' for each 'Sample'
def get_choice(df, column):
#df['Condition'] = user_input
user_input = []
for i in df[column]:
print('\n', i)
user_input.append(input('Condition= '))
df['Condition'] = user_input
return df
get_choice(group_fname, 'Sample')
This works, however the the user is prompted for each row that a 'Sample' exists. It is not a problem in this example where the Samples have two rows each, but when the DataFrame is larger and there are multiple samples that occupy multiple rows then it gets tedious.
How do I create a function that will fill the 'Condition' column for each row that a 'Sample' occupies by just getting the input once.
I tried creating the function to return a dictionary then .apply() that to the DataFrame, but when I do that it still asks for input for each instance of the 'Sample'.

If I understand your question right, you want to get user input only once for each unique value and then create column 'Condition':
sample_data = {'Sample': ['A', 'B', 'A', 'B'],
'Surface': ['Top', 'Bottom', 'Top', 'Bottom'],
'Intensity' : [21, 32, 14, 45]}
sample_dataframe = pd.DataFrame(data=sample_data)
def get_choice(df, column):
m = {}
for v in df[column].unique():
m[v] = input('Condition for [{}] = '.format(v))
df['Condition'] = df[column].map(m)
return df
print( get_choice(sample_dataframe, 'Sample') )
Prints (for example)
Condition for [A] = 1
Condition for [B] = 2
Sample Surface Intensity Condition
0 A Top 21 1
1 B Bottom 32 2
2 A Top 14 1
3 B Bottom 45 2

Related

python pandas dataframe add colour to adjusted and inserted row

I have the following data-frame
import pandas as pd
df = pd.DataFrame()
df['number'] = (651,651,651,4267,4267,4267,4267,4267,4267,4267,8806,8806,8806,6841,6841,6841,6841)
df['name']=('Alex','Alex','Alex','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Abhishek','Abhishek','Abhishek','Blake','Blake','Blake','Blake')
df['hours']=(8.25,7.5,7.5,7.5,14,12,15,11,6.5,14,15,15,13.5,8,8,8,8)
df['loc']=('Nar','SCC','RSL','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNI','UNI','UNI','UNKING','UNKING','UNKING','UNKING')
print(df)
If the running balance of an individuals hours reach 38 an adjustment to the cell that reached the 38th hour is made, a duplicate row is inserted and the balance of hours is added to the following row. The following code performs this and the difference in output of original data to adjusted data can be seen.
s = df.groupby('number')['hours'].cumsum()
m = s.gt(38)
idx = m.groupby(df['number']).idxmax()
delta = s.groupby(df['number']).shift().rsub(38).fillna(s)
out = df.loc[df.index.repeat((df.index.isin(idx)&m)+1)]
out.loc[out.index.duplicated(keep='last'), 'hours'] = delta
out.loc[out.index.duplicated(), 'hours'] -= delta
print(out)
I then output to csv with the following.
out.to_csv('Output.csv', index = False)
I need to have the row that got adjusted and the row that got inserted highlighted in a color (any color) when it is exported to csv.
UPDATE: as csv does not accept colours to output, any way to tag the adjusted and insert rows is acceptable
You can't add any kind of formatting, including colors, to a CSV. You can however color records in a dataframe.
# single-index:
# Load a dataset
import seaborn as sns
df = sns.load_dataset('planets')# Now let's group the data
groups = df.groupby('method').mean()
groups
# Highlight the Maximum values
groups.style.highlight_max(color = 'lightgreen')
# multi-index:
import pandas as pd
df = pd.DataFrame([['one', 'A', 100,3], ['two', 'A', 101, 4],
['three', 'A', 102, 6], ['one', 'B', 103, 6],
['two', 'B', 104, 0], ['three', 'B', 105, 3]],
columns=['c1', 'c2', 'c3', 'c4']).set_index(['c1', 'c2']).sort_index()
print(df)
def highlight_min(data):
color= 'red'
attr = 'background-color: {}'.format(color)
if data.ndim == 1: # Series from .apply(axis=0) or axis=1
is_min = data == data.min()
return [attr if v else '' for v in is_min]
else:
is_min = data.groupby(level=0).transform('min') == data
return pd.DataFrame(np.where(is_min, attr, ''),
index=data.index, columns=data.columns)
df = df.apply(highlight_min, axis=0)
df

Replace value based on a corresponding value but keep value if criteria not met

Given the following dataframe,
INPUT df:
Cost_centre
Pool_costs
90272
A
92705
A
98754
A
91350
A
Replace Pool_costs value with 'B' given the Cost_centre value but keep the Pool_costs value if the Cost_centre value does not appear in list.
OUTPUT df:
Cost_centre
Pool_costs
90272
B
92705
A
98754
A
91350
B
Current Code:
This code works up until the else side of lambda; finding the Pool_costs value again is the hard part.
df = pd.DataFrame({'Cost_centre': [90272, 92705, 98754, 91350],
'Pool_costs': ['A', 'A', 'A', 'A']})
pool_cc = ([90272,91350])
pool_cc_set = set(pool_cc)
df['Pool_costs'] = df['Cost_centre'].apply(lambda x: 'B' if x in pool_cc_set else df['Pool_costs'])
print (df)
I have used the following and have found success but it gets hard to read and modify when there are a lot of cost_centre's to change.
df = pd.DataFrame({'Cost_centre': [90272, 92705, 98754, 91350],
'Pool_costs': ['A', 'A', 'A', 'A']})
filt = df['Cost_centre'] == '90272'|df['Cost_centre'] == '91350')
df.loc[filt, 'Pool_costs'] = 'B'
IIUC, you can use isin
filt = df['Cost_centre'].isin([90272, 91350])
df.loc[filt, 'Pool_costs'] = 'B'
print(df)
Cost_centre Pool_costs
0 90272 B
1 92705 A
2 98754 A
3 91350 B

handling a geohash dict look up with spatial joins

I have a dictionary with geohash as keys and a value associated with them. I am looking up values from the dict to create a new column in my pandas dataframe.
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
df['value'] = df.apply(lambda x : geo_dict[x.geohash], axis=1)
I need to be able to handle non-matches, i.e geohashes that do not exist in the dictionary. Expected handling below:
Find k-number of geohashes nearby and compute the mean value
Assign the mean of neighboring geohashes to pandas column
Questions -
Is there a library I can use to find nearby geohashes?
How do I code up this solution?
The module pygeodesy has several functions to calculate distance between geohashes. We can wrap this in a function that first checks is a match exists in the dict, else returns the mean value of the n closest geohashes:
import pygeodesy as pgd
import pandas as pd
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
geo_df = pd.DataFrame(zip(geo_dict.keys(), geo_dict.values()), columns=['geohash', 'value'])
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
def approximate_distance(geohash1, geohash2):
return pgd.geohash.distance_(geohash1, geohash2)
#return pgd.geohash.equirectangular_(geohash1, geohash2) #alternative ways to calculate distance
#return pgd.geohash.haversine_(geohash1, geohash2)
def get_value(x, n=2): #set number of closest geohashes to use for approximation with n
val = geo_df.loc[geo_df['geohash'] == x]
if not val.empty:
return val['value'].iloc[0]
else:
geo_df['tmp_dist'] = geo_df['geohash'].apply(lambda y: approximate_distance(y,x))
return geo_df.nlargest(n, 'tmp_dist')['value'].mean()
df['value'] = df['geohash'].apply(get_value)
result:
geohash
label
value
0
9q5dx
a
10
1
9qh0g
b
12.5
2
9q9hv
c
15
3
9q5dv
d
20

How to transpose values from top few rows in python dataframe into new columns

I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()

Can I join two data frames using one column in df1 and one of any values in a cell in df2?

I'm working with some geospatial data, df_geo and am have a CSV of values I'd like to join to the location data frame, called df_data.
My issue, however, is that there are multiple ways to spell the values in the column I'd like to join the two data frames on (region names). Look at the Catalonia example below, in df_geo: there are 6 different ways to spell the region name, depending on the language.
My question is this: if the row is named "Catalonia" in df_data, how would I go about joining df_data to df_geo?
Since the rows are unique to a region, you can create a dictionary that maps any name in 'VARNAME_1' to the index from df_geo.
Then use this to map the the names in df_data to a dummy column and you can do a simple merge on the index in df_geo and the mapped column in df_data.
To get the dictionary do:
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
Sample Data:
import pandas as pd
df_geo = pd.DataFrame({'VARNAME_1': ['Catalogna\Catalogne\Catalonia', 'A\B\C\D\E\F\G']})
df_data = pd.DataFrame({'Name': ['Catalogna', 'Seven', 'E'],
'Vals': [1,2,3]})
Code
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
#{'A': 1,
# 'B': 1,
# 'C': 1,
# 'Catalogna': 0,
# 'Catalogne': 0,
# 'Catalonia': 0,
# 'D': 1,
# 'E': 1,
# 'F': 1,
# 'G': 1}
df_data['ID'] = df_data.Name.map(d)
df_data.merge(df_geo, left_on='ID', right_index=True, how='left').drop(columns='ID')
Output:
Name Vals VARNAME_1
0 Catalogna 1 Catalogna\Catalogne\Catalonia
1 Seven 2 NaN
2 E 3 A\B\C\D\E\F\G
How the dictionary works.
df_geo.VARNAME_1.str.split(r'\\').values splits the string in VARNAME_1 on the '\' character and places all the separated values in a Series of lists. Using .items on the Series gives you a tuple (which we unpacked into two separate values), with the first value being the index, which is the same as the index of the original DataFrame, and the second item being the
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
print(f'id:{ids} and val:{val}')
#id:0 and val:['Catalogna', 'Catalogne', 'Catalonia']
#id:1 and val:['A', 'B', 'C', 'D', 'E', 'F', 'G']
So now val is a list, which we again want to iterate over to create out dictionary.
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
for y in val:
print(f'id:{ids} and y:{y}')
#id:0 and y:Catalogna
#id:0 and y:Catalogne
#id:0 and y:Catalonia
#id:1 and y:A
#id:1 and y:B
#id:1 and y:C
#id:1 and y:D
#id:1 and y:E
#id:1 and y:F
#id:1 and y:G
And so the dictionary I created was with y as the key, and the original DataFrame index ids as the value.

Categories

Resources