I'm sure there's a better way to describe what I'm trying to do, but here's an example.
Say I have a dataframe:
d = {'col1': [1, 5, 10, 22, 36, 57], 'col2': [100, 450, 1200, 2050, 3300, 6000]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 100
1 5 450
2 10 1200
3 22 2050
and a second dataframe (or series I suppose):
d2 = {'col2': [100, 200, 450, 560, 900, 1200, 1450, 1800, 2050, 2600, 3300, 5000, 6000]}
df2 = pd.DataFrame(data=d2)
df2
col2
0 100
1 200
2 450
3 560
4 900
5 1200
6 1450
7 1800
8 2050
9 2600
10 3300
11 5000
12 6000
I need some efficient way to assign a value to a second column in df2 in the following way:
if the value in df2['col2'] matches a value in df['col2'], assign the value of df['col1'] in the same row.
if there isn't a matching value, find the range it fits in and approximate the value based on that. e.g for df2.loc[1,'col2'], the col2 value is 200, and it belongs between 100 and 450 in the first dataframe, so the new value would be (5-1)/(450-100) *200 = 2.2857
Edit: the correct example should be (5 - 1) / (450 - 100) * (200 - 100) +1 = 2.1429
Now that have you confirmed your requirement, we can make a solution. We can use a loop to find segments that are bounded by non-NaN values and linear-interpolate the points in-between.
This algorithm only works when col1 is anchored by non-NaN values on both ends, hence the assert statement.
col1, col2 = df2.merge(df, how='left', on='col2')[['col1', 'col2']].to_numpy().T
assert ~np.isnan(col1[[0, -1]]).any(), 'First and last elements of col1 must not be NaN'
n = len(col1)
i = 0
while i < n:
j = i + 1
while j < n and np.isnan(col1[j]):
j += 1
if j - i > 1:
# The linear equation
f = np.polyfit(col2[[i,j]], col1[[i,j]], deg=1)
# Apply the equation on all points between i and j
col1[i:j+1] = np.polyval(f, col2[i:j+1])
i = j
have you considered training a regression model on your first dataframe, then predicting the values on your 2nd?
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Related
please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.
I have a dataframe that has categorical columns and numerical columns, and I want some agrupation on the values on numerical columns (max, min, sum...) depending on the value of the cateogorical ones (so I have to create new columns for each value that each cateogorical column can take).
To make it more understable, it's better to put a toy example.
Say that I have this dataframe:
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'value_type' : ['A', 'B', 'A', 'C', 'C', 'A'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
value_type amount
ref
1 A 100
1 B 50
1 A 20
2 C 300
2 C 150
3 A 70
And I want to group the amounts on the values of value_type, grouping also for each reference. The result in this case (supposing that only the sum was needed) will be this one:
df_result = pd.DataFrame({
'ref' : [1, 2, 3],
'sum_amount_A' : [120, 0, 70],
'sum_amount_B' : [50, 0, 0],
'sum_amount_C' : [0, 450, 0]
}).set_index('ref')
sum_amount_A sum_amount_B sum_amount_C
ref
1 120 50 0
2 0 0 450
3 70 0 0
I have tried something that works but it's extremely inefficient. It takes several minutes to process 30.000 rows aprox.
What I have done is this: (I have a dataframe with an only row for each index ref, called df_final)
df_grouped = df.groupby(['ref'])
for ref in df_grouped.groups:
df_aux = df.loc[[ref]]
column = 'A' # I have more columns, but for illustration one is enough
for value in df_aux[column].unique():
df_aux_column_value = df_aux.loc[df_aux[column] == value]
df_final.at[ref,'sum_' + column + '_' + str(value)] = np.sum(df_aux_columna_valor[column])
I'm sure there should be better ways of doing this aggretation... Thanks in advance!!
EDIT:
The answer given is correct when there is only one column to group by. In the real dataframe I have several columns on which I want to calculate some agg functions, but on the values on each column separately. I mean that I don't want an aggregated value for each combination of the values of the column, but only for the columns by themselves.
Let's make an example.
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'sexo' : ['Hombre', 'Hombre', 'Hombre', 'Mujer', 'Mujer', 'Hombre'],
'lugar_trabajo' : ['Campo', 'Ciudad', 'Campo', 'Ciudad', 'Ciudad', 'Campo'],
'dificultad' : ['Alta', 'Media', 'Alta', 'Media', 'Baja', 'Alta'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
This dataframe looks like that:
sexo lugar_trabajo dificultad amount
ref
1 Hombre Campo Alta 100
1 Hombre Ciudad Media 50
1 Hombre Campo Alta 20
2 Mujer Ciudad Media 300
2 Mujer Ciudad Baja 150
3 Hombre Campo Alta 70
If I group by several columns, or make a pivot table (which in a way is equivalent, as far as I know), doing this:
df.pivot_table(index='ref',columns=['sexo','lugar_trabajo','dificultad'],values='amount',aggfunc=[np.sum,np.min,np.max,len], dropna=False)
I will get a dataframe with 48 columns (because I have 3 * 2 * 2 different values, and 4 agg functions).
A way of achieve the result that I want is this:
df_agregado = pd.DataFrame(df.index).set_index('ref')
for col in ['sexo','lugar_trabajo','dificultad']:
df_agregado = pd.concat([df_agregado, df.pivot_table(index='ref',columns=[col],values='amount',aggfunc=[np.sum,np.min,np.max,len])],axis=1)
I do each group by alone, and concat all of them. In this way I get 28 columns (2 * 4 + 3 * 4 + 2 * 4). It works and it's fast, but it's not very elegant. Is there another way of getting this result??
The more efficient way is to use Pandas built-in functions instead of for loops. There are two main steps that you should take.
First, you need to groupby not only by index, but also by index and the column:
res = df.groupby(['ref','value_type']).sum()
print(res)
The output is like this at this step:
amount
ref value_type
1 A 120
B 50
2 C 450
3 A 70
Second, you need to unstack the multi index, as follows:
df2 = res.unstack(level='value_type',fill_value=0)
The output will be your desire output:
amount
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0
As an optional step you can use droplevel to flatten it:
df2.columns = df2.columns.droplevel()
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0
Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90
I have many dataframes with individual counts (e.g. df_boston below). Each row defines a data point that is uniquely identified by its marker and its point. I have a summary dataframe (df_inventory_master) that has custom bins (the points above map to the Begin-End coordinates in the master). I want to add a column to this dataframe for each individual city that sums the counts from that city in a new column. An example is shown.
Two quirks are that the the bins in the master frame can be overlapping (the count should be added to both) and that some counts may not fall in the master (the count should be ignored).
I can do this in pure Python but since the data are in dataframes it would be helpful and likely faster to do the manipulations in pandas. I'd appreciate any tips here!
This is the master frame:
>>> df_inventory_master = pd.DataFrame({'Marker': [1, 1, 1, 2],
... 'Begin': [100, 300, 500, 100],
... 'End': [200, 600, 900, 250]})
>>> df_inventory_master
Begin End Marker
0 100 200 1
1 300 600 1
2 500 900 1
3 100 250 2
This is data for one city:
>>> df_boston = pd.DataFrame({'Marker': [1, 1, 1, 1],
... 'Point': [140, 180, 250, 500],
... 'Count': [14, 600, 1000, 700]})
>>> df_boston
Count Marker Point
0 14 1 140
1 600 1 180
2 1000 1 250
3 700 1 500
This is the desired output.
- Note that the count of 700 (Marker 1, Point 500) falls in 2 master bins and is counted for both.
- Note that the count of 1000 (Marker 1, Point 250) does not fall in a master bin and is not counted.
- Note that nothing maps to Marker 2 because df_boston does not have any Marker 2 data.
>>> desired_frame
Begin End Marker boston
0 100 200 1 614
1 300 600 1 700
2 500 900 1 700
3 100 250 2 0
What I've tried: I looked at the pd.cut() function, but with the nature of the bins overlapping, and in some cases absent, this does not seem to fit. I can add the column filled with 0 values to get part of the way there but then will need to find a way to sum the data in each frame, using bins defined in the master.
>>> df_inventory_master['boston'] = pd.Series([0 for x in range(len(df_inventory_master.index))], index=df_inventory_master.index)
>>> df_inventory_master
Begin End Marker boston
0 100 200 1 0
1 300 600 1 0
2 500 900 1 0
3 100 250 2 0
Here is how I approached it, basically a *sql style left join * using the pandas merge operation, then apply() across the row axis, with a lambda to decide if the individual records are in the band or not, finally groupby and sum:
df_merged = df_inventory_master.merge(df_boston, on=['Marker'],how='left')
# logical overwrite of count
df_merged['Count'] = df_merged.apply(lambda x: x['Count'] if x['Begin'] <= x['Point'] <= x['End'] else 0 , axis=1 )
df_agged = df_merged[['Begin','End','Marker','Count']].groupby(['Begin','End','Marker']).sum()
df_agged_resorted = df_agged.sort_index(level = ['Marker','Begin','End'])
df_agged_resorted = df_agged_resorted.astype(np.int)
df_agged_resorted.columns =['boston'] # rename the count column to boston.
print df_agged_resorted
And the result is
boston
Begin End Marker
100 200 1 614
300 600 1 700
500 900 1 700
100 250 2 0
I have this code:
new_dict = {'x':[1,2,3,4,5], 'y':[11,22,33,44,55], 'val':[100, 200, 300, 400, 500]}
df = pd.DataFrame.from_dict(new_dict)
val x y
0 100 1 11
1 200 2 22
2 300 3 33
3 400 4 44
4 500 5 55
I want to be able to use values of x and y in combination as index into val,
for example
df[3][33]
300
What's the best way to achieve this? I know it must have to do with multi index, but I am not sure exactly how.
You can either define 2 boolean conditions as a mask and use with .loc:
df.loc[(df['x']==3) & (df['y']==33), 'val']
otherwise just set the index and then you can use those values to index into the df:
In [233]:
df = df.set_index(['x','y'])
df.loc[3,33]
Out[233]:
val 300
Name: (3, 33), dtype: int64
You could wrap the first version into a func quite easily
You can define a function :
new_dict = {'x':[1,2,3,4,5], 'y':[11,22,33,44,55], 'val':[100, 200, 300, 400, 500]}
df = pd.DataFrame.from_dict(new_dict)
def multindex(x,y):
return df.set_index(['x','y']).loc[x,y]
multindex(1,11) #will return '100'