MultiIndex for pandas dataframe - python

I have this code:
new_dict = {'x':[1,2,3,4,5], 'y':[11,22,33,44,55], 'val':[100, 200, 300, 400, 500]}
df = pd.DataFrame.from_dict(new_dict)
val x y
0 100 1 11
1 200 2 22
2 300 3 33
3 400 4 44
4 500 5 55
I want to be able to use values of x and y in combination as index into val,
for example
df[3][33]
300
What's the best way to achieve this? I know it must have to do with multi index, but I am not sure exactly how.

You can either define 2 boolean conditions as a mask and use with .loc:
df.loc[(df['x']==3) & (df['y']==33), 'val']
otherwise just set the index and then you can use those values to index into the df:
In [233]:
df = df.set_index(['x','y'])
df.loc[3,33]
Out[233]:
val 300
Name: (3, 33), dtype: int64
You could wrap the first version into a func quite easily

You can define a function :
new_dict = {'x':[1,2,3,4,5], 'y':[11,22,33,44,55], 'val':[100, 200, 300, 400, 500]}
df = pd.DataFrame.from_dict(new_dict)
def multindex(x,y):
return df.set_index(['x','y']).loc[x,y]
multindex(1,11) #will return '100'

Related

Efficient way to approximate values based on position in range in pandas

I'm sure there's a better way to describe what I'm trying to do, but here's an example.
Say I have a dataframe:
d = {'col1': [1, 5, 10, 22, 36, 57], 'col2': [100, 450, 1200, 2050, 3300, 6000]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 100
1 5 450
2 10 1200
3 22 2050
and a second dataframe (or series I suppose):
d2 = {'col2': [100, 200, 450, 560, 900, 1200, 1450, 1800, 2050, 2600, 3300, 5000, 6000]}
df2 = pd.DataFrame(data=d2)
df2
col2
0 100
1 200
2 450
3 560
4 900
5 1200
6 1450
7 1800
8 2050
9 2600
10 3300
11 5000
12 6000
I need some efficient way to assign a value to a second column in df2 in the following way:
if the value in df2['col2'] matches a value in df['col2'], assign the value of df['col1'] in the same row.
if there isn't a matching value, find the range it fits in and approximate the value based on that. e.g for df2.loc[1,'col2'], the col2 value is 200, and it belongs between 100 and 450 in the first dataframe, so the new value would be (5-1)/(450-100) *200 = 2.2857
Edit: the correct example should be (5 - 1) / (450 - 100) * (200 - 100) +1 = 2.1429
Now that have you confirmed your requirement, we can make a solution. We can use a loop to find segments that are bounded by non-NaN values and linear-interpolate the points in-between.
This algorithm only works when col1 is anchored by non-NaN values on both ends, hence the assert statement.
col1, col2 = df2.merge(df, how='left', on='col2')[['col1', 'col2']].to_numpy().T
assert ~np.isnan(col1[[0, -1]]).any(), 'First and last elements of col1 must not be NaN'
n = len(col1)
i = 0
while i < n:
j = i + 1
while j < n and np.isnan(col1[j]):
j += 1
if j - i > 1:
# The linear equation
f = np.polyfit(col2[[i,j]], col1[[i,j]], deg=1)
# Apply the equation on all points between i and j
col1[i:j+1] = np.polyval(f, col2[i:j+1])
i = j
have you considered training a regression model on your first dataframe, then predicting the values on your 2nd?
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Dataframe filtering with multiple conditions on different columns

Let's say we have the following dataframe:
data = {'Item':['1', '2', '3', '4', '5'],
'A':[142, 11, 50, 60, 12],
'B':[55, 65, 130, 14, 69],
'C':[68, -18, 65, 16, 17],
'D':[60, 0, 150, 170, 130],
'E':[230, 200, 5, 10, 160]}
df = pd.DataFrame(data)
representing different items and the corresponding values related to some of their parameters (e.g. length, width, and so on). However, not all the reported values are acceptable, in fact each item has a different range of allowed values:
A and B go from -100 to +100
C goes from -70 to +70
D and E go from +100 to +300
So, as you can see, some items show values outside limits for just one parameter (e.g. item 2 is outside for parameter D), while others are outside for more than one parameter (e.g. item 1 is outside for both parameters A and D).
I need to analyze these data and get a table reporting:
how many items are outside for just one parameter and the name of this parameter
how many items are outside for more than one parameter and the name of those parameter
To be clearer: I need a simple simple way to know how many items are failed and for which parameters. For example: four items out of five are failed, and 2 of them (item #1 and item#3) for two parameters (item #1 for A and D, item #3 for B and E), while items #2 and #4 are out for one parameter (item #2 for D, item #4 for E)
I have tried to define the following masks:
df_mask_1 = abs(df['A'])>=100
df_mask_2 = abs(df['B'])>=100
df_mask_3 = abs(df['C'])>=70
df_mask_4 = ((df['D']<=110) | (df['D']>=800))
df_mask_5 = ((df['E']<=110) | (df['E']>=800))
to get the filtered dataframe:
filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5]
but what I obtain is just an empty dataframe. I have also tried with
filtered_df = df.loc[(df_mask_1) & (df_mask_2) & (df_mask_3) & (df_mask_4) & (df_mask_5)]
but the result does not change.
Any suggestion?
Use a condition list then flat your dataframe with melt then keep rows where condition is False (~x) then unpivot your dataframe with groupby_apply:
condlist = [df['A'].between(-100, 100),
df['B'].between(-100, 100),
df['C'].between(-70, 70),
df['D'].between(100, 300),
df['E'].between(100, 300)]
df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
.loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)
# Output:
Item A B C D E Fail
0 1 142 55 68 60 230 [A, D]
1 2 11 65 -18 0 200 [D]
2 3 50 130 65 150 5 [B, E]
3 4 60 14 16 170 10 [E]
4 5 12 69 17 130 160 NaN
Note: if your dataframe is large and you only need to display failed items, use : df[df['Fail'].notna()] to filter out your dataframe.
Note 2: variable and value are the default column names when you melt a dataframe.
The result from the example code is what it should be. The code filtered_df = df[df_mask_1 & df_mask_2 & df_mask_3 & df_mask_4 & df_mask_5] applies all masks to the data, hence df_mask_1 and df_mask_2 alone would result in empty table (as there are no items where A and B are both over 100).
If you are looking to create a table with information on how many items have parameters outside limits, I'd suggest doing according to Corralien's answer and summing per row to get the amount of limits broken.
Taking Corralien's answer, then
a = pd.concat([df['A'].between(-100, 100),
df['B'].between(-100, 100),
df['C'].between(-70, 70),
df['D'].between(100, 300),
df['E'].between(100, 300)], axis=1)
b = a.sum(axis=1)
where b gives per item the amount of limits "broken"
0 3
1 4
2 3
3 4
4 5
Another approach is to use the pandas.DataFrame.agg function like so:
import pandas as pd
# Defining a limit checker function
def limit_check(values, limit):
return (abs(values) >= limit).sum()
# Applying different limits checks to columns
res = df.agg({
'A': lambda x: limit_check(x, 100),
'B': lambda x: limit_check(x, 100),
'C': lambda x: limit_check(x, 70),
'D': lambda x: limit_check(x, 800),
'E': lambda x: limit_check(x, 800),
})
print(res)
For the sample data you provided it will result in
A 1
B 1
C 0
D 0
E 0
dtype: int64
Using the answer by Corralien and taking advantage of what's written by tiitinha, and considering the possibility to have some NaN values, here is how I put it all together:
df.replace([np.nan], np.inf,inplace=True)
condlist = [df['A'].between(-100, 100) | df['A'] == np.inf,
df['B'].between(-100, 100) | df['B'] == np.inf,
df['C'].between(-70, 70) | df['C'] == np.inf,
df['D'].between(100, 300) | df['D'] == np.inf,
df['E'].between(100, 300) | df['E'] == np.inf]
To get the total number of failed parameters for each item:
bool_df = ~pd.concat(condlist, axis=1).astype('bool')
df['#Fails'] = bool_df.sum(axis=1)
To know who are the parameters out of limits, for each item:
df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
.loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)
In this way I get two columns with the wanted results.

How can I make this kind of aggregation in pandas?

I have a dataframe that has categorical columns and numerical columns, and I want some agrupation on the values on numerical columns (max, min, sum...) depending on the value of the cateogorical ones (so I have to create new columns for each value that each cateogorical column can take).
To make it more understable, it's better to put a toy example.
Say that I have this dataframe:
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'value_type' : ['A', 'B', 'A', 'C', 'C', 'A'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
value_type amount
ref
1 A 100
1 B 50
1 A 20
2 C 300
2 C 150
3 A 70
And I want to group the amounts on the values of value_type, grouping also for each reference. The result in this case (supposing that only the sum was needed) will be this one:
df_result = pd.DataFrame({
'ref' : [1, 2, 3],
'sum_amount_A' : [120, 0, 70],
'sum_amount_B' : [50, 0, 0],
'sum_amount_C' : [0, 450, 0]
}).set_index('ref')
sum_amount_A sum_amount_B sum_amount_C
ref
1 120 50 0
2 0 0 450
3 70 0 0
I have tried something that works but it's extremely inefficient. It takes several minutes to process 30.000 rows aprox.
What I have done is this: (I have a dataframe with an only row for each index ref, called df_final)
df_grouped = df.groupby(['ref'])
for ref in df_grouped.groups:
df_aux = df.loc[[ref]]
column = 'A' # I have more columns, but for illustration one is enough
for value in df_aux[column].unique():
df_aux_column_value = df_aux.loc[df_aux[column] == value]
df_final.at[ref,'sum_' + column + '_' + str(value)] = np.sum(df_aux_columna_valor[column])
I'm sure there should be better ways of doing this aggretation... Thanks in advance!!
EDIT:
The answer given is correct when there is only one column to group by. In the real dataframe I have several columns on which I want to calculate some agg functions, but on the values on each column separately. I mean that I don't want an aggregated value for each combination of the values of the column, but only for the columns by themselves.
Let's make an example.
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'sexo' : ['Hombre', 'Hombre', 'Hombre', 'Mujer', 'Mujer', 'Hombre'],
'lugar_trabajo' : ['Campo', 'Ciudad', 'Campo', 'Ciudad', 'Ciudad', 'Campo'],
'dificultad' : ['Alta', 'Media', 'Alta', 'Media', 'Baja', 'Alta'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
This dataframe looks like that:
sexo lugar_trabajo dificultad amount
ref
1 Hombre Campo Alta 100
1 Hombre Ciudad Media 50
1 Hombre Campo Alta 20
2 Mujer Ciudad Media 300
2 Mujer Ciudad Baja 150
3 Hombre Campo Alta 70
If I group by several columns, or make a pivot table (which in a way is equivalent, as far as I know), doing this:
df.pivot_table(index='ref',columns=['sexo','lugar_trabajo','dificultad'],values='amount',aggfunc=[np.sum,np.min,np.max,len], dropna=False)
I will get a dataframe with 48 columns (because I have 3 * 2 * 2 different values, and 4 agg functions).
A way of achieve the result that I want is this:
df_agregado = pd.DataFrame(df.index).set_index('ref')
for col in ['sexo','lugar_trabajo','dificultad']:
df_agregado = pd.concat([df_agregado, df.pivot_table(index='ref',columns=[col],values='amount',aggfunc=[np.sum,np.min,np.max,len])],axis=1)
I do each group by alone, and concat all of them. In this way I get 28 columns (2 * 4 + 3 * 4 + 2 * 4). It works and it's fast, but it's not very elegant. Is there another way of getting this result??
The more efficient way is to use Pandas built-in functions instead of for loops. There are two main steps that you should take.
First, you need to groupby not only by index, but also by index and the column:
res = df.groupby(['ref','value_type']).sum()
print(res)
The output is like this at this step:
amount
ref value_type
1 A 120
B 50
2 C 450
3 A 70
Second, you need to unstack the multi index, as follows:
df2 = res.unstack(level='value_type',fill_value=0)
The output will be your desire output:
amount
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0
As an optional step you can use droplevel to flatten it:
df2.columns = df2.columns.droplevel()
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

What is the right way to loop over a pandas dataframe and apply a condition?

I am trying to loop through a list of dictionaries, comparing a value to a pair of columns in a Pandas dataframe and adding a value to a third column under a certain condition.
My list of dictionaries that looks like this:
dict_list = [{'type': 'highlight', 'id': 0, 'page_number': 4, 'location_number': 40, 'content': 'Foo'}, {'type': 'highlight', 'id': 1, 'page_number': 12, 'location_number': 96, 'content': 'Bar'}, {'type': 'highlight', 'id': 2, 'page_number': 128, 'location_number': 898, 'content': 'Some stuff'}]
My dataframe looks like this:
start end note_count
1 1 100 0
2 101 200 0
3 201 300 0
For each dictionary, I want to pull the "page_number" value and compare it to the "start" and "end" columns in the dataframe rows. If page_number is within the range of those two values in a row, I want to +1 to the "note_count" column for that row. This is my current code:
for dict in dict_list:
page_number = dict['page_number']
for index, row in ventile_frame.iterrows():
ventile_frame["note_count"][(ventile_frame["start"] <= page_number) & (ventile_frame["end"] >= page_number)] += 1
print (ventile_frame)
I would expect to see a result like this.
start end note_count
1 1 100 2
2 101 200 1
3 201 300 0
Instead, I am seeing this.
start end note_count
1 1 100 9
2 101 200 0
3 201 300 0
Thanks for any help!
You don't need to iterate on the rows of ventile_frame - and that's the beauty of it!
(ventile_frame["start"] <= page_number) & (ventile_frame["end"] >= page_number) will produce a boolean mask indicating whether page_number is within the range of each row. Try it with a fixed value for page_number to understand what's going on:
print((ventile_frame["start"] <= 4) & (ventile_frame["end"] >= 4))
Bottom line is, you just need to iterate on the dicts:
for single_dict in dict_list:
page_number = single_dict['page_number']
ventile_frame["note_count"][(ventile_frame["start"] <= page_number) & (ventile_frame["end"] >= page_number)] += 1
print (ventile_frame)
Note that I replaced dict by single_dict in the above code, it's best to avoid shadowing built-in python names.
Here is a way using IntervalIndex:
m=pd.DataFrame(dict_list)
s = pd.IntervalIndex.from_arrays(df.start,df.end, 'both')
#output-> IntervalIndex([[1, 100], [101, 200], [201, 300]],
#closed='both',
#dtype='interval[int64]')
n=m.set_index(s).loc[m['page_number']].groupby(level=0)['page_number'].count()
n.index=pd.MultiIndex.from_arrays([n.index])
final=df.set_index(['start','end']).assign(new_note_count=n).reset_index()
final['new_note_count']=final['new_note_count'].fillna(0)
Output:
start end note_count new_note_count
0 1 100 0 2.0
1 101 200 0 1.0
2 201 300 0 0.0
Details:
Once we have the index as interval , set index of m and .loc[] the page_number
print(m.set_index(s).loc[m['page_number']])
type id page_number location_number content
[1, 100] highlight 0 4 40 Foo
[1, 100] highlight 0 4 40 Foo
[101, 200] highlight 1 12 96 Bar
Then using groupby() get counts, convert to Multiindex and assign it back.
I would do this with DataFrame.apply:
first create a series with the numbers of pages contained in the dictionary:
page_serie=pd.Series([dict_t['page_number'] for dict_t in dict_list])
print(page_serie)
0 4
1 12
2 128
dtype: int64
Then,
for each row of your dataframe you determine if the values ​​of the series are between 'start' and 'end' and the sums
df['note_count']=df.apply(lambda x: page_serie.between(x['start'],x['end']),axis=1).sum(axis=1)
print(df)
start end note_count
1 1 100 2
2 101 200 1
3 201 300 0

Categories

Resources