Here is the code to filter a dataframe based on field and wellname using two dropdowns. The filter is applied to pandas dataframe and I want filtered output (common_filter) also to be a type of pandas dataframe. Currently, when the is of widgets type. Is there any way of getting it as dataframe?
The code below is taken from TowardsDataScience and modified a bit.
"unique_sorted_values" function simply returns a list of unique sorted values of passed array, in this case FieldID and WellnameID
import ipywidgets as widgets
# dummy data
df = pd.DataFrame({'FieldID': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'WellnameID':['1_A', '1_A', '2_A', '1_B', '1_B', '2_B', '2_B'],
'value': [1, 2, 3, 4, 5, 6, 7]})
output = widgets.Output()
dropdown_field = widgets.Dropdown(options = unique_sorted_values(df.FieldID))
dropdown_wellname = widgets.Dropdown(options = unique_sorted_values(df[df.FieldID==dropdown_field.value].WellnameID))
def common_filtering(field, wellname):
output.clear_output()
common_filter = df[(df.FieldID == field) & (df.WellnameID == wellname)]
with output:
display(common_filter)
def dropdown_field_eventhandler(change):
common_filtering(change.new, dropdown_wellname.value)
def dropdown_wellname_eventhandler(change):
common_filtering(dropdown_field.value, change.new)
dropdown_field.observe(dropdown_field_eventhandler, names='value')
dropdown_wellname.observe(dropdown_wellname_eventhandler, names='value')
input_widgets = widgets.HBox([dropdown_field, dropdown_wellname])
display(input_widgets)
display(output)
You cannot use the return value of the function for the dataframe as the return value is not assigned to anything in the main body of the code (it is passed to the interact as a callback). As you want to create a whole new dataframe (rather than modify an existing one), a simple way would be to use the global keyword on a copied version of the initial data.
After choosing the dropdowns, you should be able to get the filtered dataframe in a cell below and see the impact of the filters. If you need anything more complex, you probably want to construct a class object to track the state of data, apply filters etc.
import ipywidgets as widgets
import pandas as pd
# dummy data
df = pd.DataFrame({'FieldID': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'WellnameID':['1_A', '1_A', '2_A', '1_B', '1_B', '2_B', '2_B'],
'value': [1, 2, 3, 4, 5, 6, 7]})
filtered = pd.DataFrame()
output = widgets.Output()
dropdown_field = widgets.Dropdown(options = sorted(df.FieldID.unique()))
dropdown_wellname = widgets.Dropdown(options = sorted(df[df.FieldID==dropdown_field.value].WellnameID.unique()))
def common_filtering(field, wellname):
global filtered
output.clear_output()
filtered = df[(df.FieldID == field) & (df.WellnameID == wellname)]
with output:
display(filtered)
def dropdown_field_eventhandler(change):
common_filtering(change.new, dropdown_wellname.value)
def dropdown_wellname_eventhandler(change):
common_filtering(dropdown_field.value, change.new)
dropdown_field.observe(dropdown_field_eventhandler, names='value')
dropdown_wellname.observe(dropdown_wellname_eventhandler, names='value')
input_widgets = widgets.HBox([dropdown_field, dropdown_wellname])
display(input_widgets)
display(output)
Related
I have the following data-frame
import pandas as pd
df = pd.DataFrame()
df['number'] = (651,651,651,4267,4267,4267,4267,4267,4267,4267,8806,8806,8806,6841,6841,6841,6841)
df['name']=('Alex','Alex','Alex','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Abhishek','Abhishek','Abhishek','Blake','Blake','Blake','Blake')
df['hours']=(8.25,7.5,7.5,7.5,14,12,15,11,6.5,14,15,15,13.5,8,8,8,8)
df['loc']=('Nar','SCC','RSL','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNI','UNI','UNI','UNKING','UNKING','UNKING','UNKING')
print(df)
If the running balance of an individuals hours reach 38 an adjustment to the cell that reached the 38th hour is made, a duplicate row is inserted and the balance of hours is added to the following row. The following code performs this and the difference in output of original data to adjusted data can be seen.
s = df.groupby('number')['hours'].cumsum()
m = s.gt(38)
idx = m.groupby(df['number']).idxmax()
delta = s.groupby(df['number']).shift().rsub(38).fillna(s)
out = df.loc[df.index.repeat((df.index.isin(idx)&m)+1)]
out.loc[out.index.duplicated(keep='last'), 'hours'] = delta
out.loc[out.index.duplicated(), 'hours'] -= delta
print(out)
I then output to csv with the following.
out.to_csv('Output.csv', index = False)
I need to have the row that got adjusted and the row that got inserted highlighted in a color (any color) when it is exported to csv.
UPDATE: as csv does not accept colours to output, any way to tag the adjusted and insert rows is acceptable
You can't add any kind of formatting, including colors, to a CSV. You can however color records in a dataframe.
# single-index:
# Load a dataset
import seaborn as sns
df = sns.load_dataset('planets')# Now let's group the data
groups = df.groupby('method').mean()
groups
# Highlight the Maximum values
groups.style.highlight_max(color = 'lightgreen')
# multi-index:
import pandas as pd
df = pd.DataFrame([['one', 'A', 100,3], ['two', 'A', 101, 4],
['three', 'A', 102, 6], ['one', 'B', 103, 6],
['two', 'B', 104, 0], ['three', 'B', 105, 3]],
columns=['c1', 'c2', 'c3', 'c4']).set_index(['c1', 'c2']).sort_index()
print(df)
def highlight_min(data):
color= 'red'
attr = 'background-color: {}'.format(color)
if data.ndim == 1: # Series from .apply(axis=0) or axis=1
is_min = data == data.min()
return [attr if v else '' for v in is_min]
else:
is_min = data.groupby(level=0).transform('min') == data
return pd.DataFrame(np.where(is_min, attr, ''),
index=data.index, columns=data.columns)
df = df.apply(highlight_min, axis=0)
df
I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()
I am writing logic to compare a few values.
I have three lists of values and one rule list
new_values = [1,0,0,0,1,1]
old_1 = [1,1,1,0,0,1]
old_2 = [1,0,1,0,1,1]
# when a is correct #when b is correct # if both are correct # if both are wrong
rules = ['a', 'b', 'combine', 'leave']
What I am looking for is, compare new_values to old_1 and old_2 values based on that select rule from rules list.
something like this:
def logic(new_values, old_values, rules):
rules_result = []
for new_value, old_value_1, old_value_2 in zip(new_values, old_values[0], old_values[1]):
if new_value == old_value_1 and new_value == old_value_2:
# if both are correct
rules_result.append(rules[2])
elif new_value == old_value_1:
# if a is correct
rules_result.append(rules[0])
elif new_value == old_value_2:
# if b is correct
rules_result.append(rules[1])
elif new_value!= old_value_1 and new_value!= old_value_2:
# if both are wrong
rules_result.append(rules[3])
return rules_result
Running this code with one rule list gives me this result :
logic(new_values, [old_1, old_2], rules)
output
['combine', 'b', 'leave', 'combine', 'b', 'combine']
I am facing issue to make this code dynamic if I have to compare more than two old values list, let say If I have three lists of old values and then my rule list will expand for each combination
new_values = [1,0,0,0,1,1]
old_1 = [1,1,1,0,0,1]
old_2 = [1,0,1,0,1,1]
old_3 = [0,0,0,1,1,1]
# when a is correct #when b is correct # if a and b are correct # if a and c are correct #if b and c are correct' #if all three are correct # if all three are wrong
rules = ['a', 'b', 'combine a_b', 'select c', 'combine b_c', 'select a', 'combine']
I am getting rules and values from a different function, I am looking for a rule selection function, where pass the list of old values ( example 2,3,4 list ) with new value and rule list, then dynamically compare each old list with new value list and select the rule from rule list.
How to make logic function dynamic to work on more than two old list values?
This problem could be solved easily if you use the concept of truth table. Your rules list defines the outcome for some boolean values. It doesn't consist of 1's and 0's so it can't be expressed by truth functions like and, or, xor but it's not a problem. You can simply rearrange your list by considering the order in the truth table:
# for 2 boolean variables, there are 2 ^ 2 = 4 possibilities
# ab ab ab ab
# 00 01 10 11
rules = ["leave", "b", "a", "combine"]
You can also turn this into a dict so you don't need to comment them to remember which one is what (and as a bonus, it will look like a truth table :)):
# ab
rules = {"00": "leave",
"01": "b",
"10": "a",
"11": "combine"}
Now, define a function to get the related key value for your boolean variables:
def get_rule_key(reference, values):
""" compares all the values against reference and returns a string for the result"""
return "".join(str(int(value == reference)) for value in values)
And your logic function will be simply this:
def logic(new_values, old_values, rules):
rules_result = []
for new_value, *old_values in zip(new_values, *old_values):
key = get_rule_key(new_value, old_values)
rules_result.append(rules.get(key))
return rules_result
print(logic(new_values, [old_1, old_2], rules))
# ['combine', 'b', 'leave', 'combine', 'b', 'combine']
For triples update your rules accordingly:
# for 3 boolean variables, there are 2 ^ 3 = 8 possibilities
# abc
rules = { "000": "combine",
# "001": Not defined in your rules,
"010": "b",
"011": "combine b_c",
"100": "a",
"101": "select c",
"110": "combine a_b"}
"111": "select a"}
print(logic(new_values, [old_1, old_2, old_3], rules))
# ['combine a_b', 'combine b_c', None, 'combine a_b', 'combine b_c', 'select a']
Notes:
None appears in the output because your rules doesn't define what is the output for "001" and dict.get returns None by default.
If you want to use a list to define the rules you have to define all the rules in order and convert the result of get_rule_key to an integer: "011" -> 3. You can manage this with int(x, base=2).
With unknown inputs it will be difficult to get this labels you specify.
It would be easy to map which ouf the old values corresponds to the same new value (positionally speaking). You can use a generic test function that gets a "new" value and all "old" values on that position, map the old values to 'a'... and return which ones correlate:
new_values = [1,0,0,0,1,1]
old_1 = [1,1,1,0,0,1]
old_2 = [1,0,1,0,1,1]
old_3 = [0,0,0,1,1,1]
old_4 = [0,0,0,1,1,1]
old_5 = [0,0,0,1,1,1]
def test(args):
nv, remain = args[0], list(args[1:])
start = ord("a")
# create dict from letter to its corresponding value
rv = {chr(start + i):v for i,v in enumerate(remain)}
# return tuples of the input and the matching outputs
return ((nv,remain), [k for k,v in rv.items() if v == nv])
rv = []
for values in zip(new_values, old_1, old_2, old_3, old_4, old_5):
rv.append(test(values))
print(*rv,sep="\n")
print([b for a,b in rv])
Output (manually spaced out):
# nv old_1 old_2 old_3 old_4 old_5
# a b c d e
((1, [ 1, 1, 0, 0, 0]), ['a', 'b'])
((0, [ 1, 0, 0, 0, 0]), ['b', 'c', 'd', 'e'])
((0, [ 1, 1, 0, 0, 0]), ['c', 'd', 'e'])
((0, [ 0, 0, 1, 1, 1]), ['a', 'b'])
((1, [ 0, 1, 1, 1, 1]), ['b', 'c', 'd', 'e'])
((1, [ 1, 1, 1, 1, 1]), ['a', 'b', 'c', 'd', 'e'])
[['a', 'b'], ['b', 'c', 'd', 'e'], ['c', 'd', 'e'], ['a', 'b'],
['b', 'c', 'd', 'e'], ['a', 'b', 'c', 'd', 'e']]
You could then map the joined results to some output:
# mapping needs completion - you'll need to hardcode that
# but if you consider inputs up to 5 old values, you need to specify those
# combinations that can happen for 5 as 4,3,2 are automagically included
# iif you omit "all of them" as result.
mapper = {"a": "only a", "ab": "a and b", "ac":"a and c", "abcde": "all of them",
# complete to your statisfaction on your own}
for inp, outp in rv:
result = ''.join(outp)
print(mapper.get(result , f"->'{result}' not mapped!"))
to get an output of:
a and b
->'bcde' not mapped!
->'cde' not mapped!
a and b
->'bcde' not mapped!
all of them
I have a pandas DataFrame:
sample_data = {'Sample': ['A', 'B', 'A', 'B'],
'Surface': ['Top', 'Bottom', 'Top', 'Bottom'],
'Intensity' : [21, 32, 14, 45]}
sample_dataframe = pd.DataFrame(data=sample_data)
And I have a function to get user input to create a column with a 'Condition' for each 'Sample'
def get_choice(df, column):
#df['Condition'] = user_input
user_input = []
for i in df[column]:
print('\n', i)
user_input.append(input('Condition= '))
df['Condition'] = user_input
return df
get_choice(group_fname, 'Sample')
This works, however the the user is prompted for each row that a 'Sample' exists. It is not a problem in this example where the Samples have two rows each, but when the DataFrame is larger and there are multiple samples that occupy multiple rows then it gets tedious.
How do I create a function that will fill the 'Condition' column for each row that a 'Sample' occupies by just getting the input once.
I tried creating the function to return a dictionary then .apply() that to the DataFrame, but when I do that it still asks for input for each instance of the 'Sample'.
If I understand your question right, you want to get user input only once for each unique value and then create column 'Condition':
sample_data = {'Sample': ['A', 'B', 'A', 'B'],
'Surface': ['Top', 'Bottom', 'Top', 'Bottom'],
'Intensity' : [21, 32, 14, 45]}
sample_dataframe = pd.DataFrame(data=sample_data)
def get_choice(df, column):
m = {}
for v in df[column].unique():
m[v] = input('Condition for [{}] = '.format(v))
df['Condition'] = df[column].map(m)
return df
print( get_choice(sample_dataframe, 'Sample') )
Prints (for example)
Condition for [A] = 1
Condition for [B] = 2
Sample Surface Intensity Condition
0 A Top 21 1
1 B Bottom 32 2
2 A Top 14 1
3 B Bottom 45 2
I'm working with some geospatial data, df_geo and am have a CSV of values I'd like to join to the location data frame, called df_data.
My issue, however, is that there are multiple ways to spell the values in the column I'd like to join the two data frames on (region names). Look at the Catalonia example below, in df_geo: there are 6 different ways to spell the region name, depending on the language.
My question is this: if the row is named "Catalonia" in df_data, how would I go about joining df_data to df_geo?
Since the rows are unique to a region, you can create a dictionary that maps any name in 'VARNAME_1' to the index from df_geo.
Then use this to map the the names in df_data to a dummy column and you can do a simple merge on the index in df_geo and the mapped column in df_data.
To get the dictionary do:
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
Sample Data:
import pandas as pd
df_geo = pd.DataFrame({'VARNAME_1': ['Catalogna\Catalogne\Catalonia', 'A\B\C\D\E\F\G']})
df_data = pd.DataFrame({'Name': ['Catalogna', 'Seven', 'E'],
'Vals': [1,2,3]})
Code
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
#{'A': 1,
# 'B': 1,
# 'C': 1,
# 'Catalogna': 0,
# 'Catalogne': 0,
# 'Catalonia': 0,
# 'D': 1,
# 'E': 1,
# 'F': 1,
# 'G': 1}
df_data['ID'] = df_data.Name.map(d)
df_data.merge(df_geo, left_on='ID', right_index=True, how='left').drop(columns='ID')
Output:
Name Vals VARNAME_1
0 Catalogna 1 Catalogna\Catalogne\Catalonia
1 Seven 2 NaN
2 E 3 A\B\C\D\E\F\G
How the dictionary works.
df_geo.VARNAME_1.str.split(r'\\').values splits the string in VARNAME_1 on the '\' character and places all the separated values in a Series of lists. Using .items on the Series gives you a tuple (which we unpacked into two separate values), with the first value being the index, which is the same as the index of the original DataFrame, and the second item being the
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
print(f'id:{ids} and val:{val}')
#id:0 and val:['Catalogna', 'Catalogne', 'Catalonia']
#id:1 and val:['A', 'B', 'C', 'D', 'E', 'F', 'G']
So now val is a list, which we again want to iterate over to create out dictionary.
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
for y in val:
print(f'id:{ids} and y:{y}')
#id:0 and y:Catalogna
#id:0 and y:Catalogne
#id:0 and y:Catalonia
#id:1 and y:A
#id:1 and y:B
#id:1 and y:C
#id:1 and y:D
#id:1 and y:E
#id:1 and y:F
#id:1 and y:G
And so the dictionary I created was with y as the key, and the original DataFrame index ids as the value.