Pandas adding calculated vectors into df

Pandas adding calculated vectors into df - python

my goal is to add formula based vectors to my following df:
Day Name a b 1 2 x1 x2
1 ijk 1 2 3 3 0 1
2 mno 2 1 1 3 1 1
outcome:
Day Name a b 1 2 x1 x2 y1 y2 z1 z2
1 ijk 1 2 3 3 0 1 (1*2)+3 (1*2)+3 (1+2)*(3*1+0*1) (1+2)*(3*2+1*2)
2 mno 2 1 1 3 1 1 (2*1)+1 (2*1)+3 (2+1)*(1*1+1*1) (2+1)*(3*2+1*2)
This is my tedious approach:
df[y1] = df[a]*df[b]+df[1] #This is y1 = a*b+value of column 1
df[y2] = df[a]*df[b]+df[2] #This is y2 = a*b+value of column 2
if column 3 and x3 were added in then: y3 would be y3 = a*b+value of column 3,
if column 4 and x4 were added in then: y4 = a*b+value of column 4 and so on...
df[z1] = (df[a]+df[b])*(df[1]*1+df[x1]*1) The "1" here is from the column name 1 and x1 #z1 = (a+b)*[(value of column 1)*1+(value of column x1)*1]
df[z2] = (df[a]+df[b])*(df[1]*2+df[x1]*2) The "2" here is from the column name 2 and x2 #z2 = (a+b)*[(value of column 2)*2+(value of column x2)*2]
if column 3 and x3 were added in then: z3 = (a+b)*[(value of column 3)*3+(value of column x3)*3] and so on
This works fine; however, this will get tedious if there are more columns added in. For example, it might get "3 4,... x3 x4,..." I'm wondering if there's a better approach to this using a loop maybe?
Many thanks :)

This is one way:
import pandas as pd
df = pd.DataFrame([[1, 'ijk', 1, 2, 3, 3, 2, 0, 1],
[2, 'mno', 2, 1, 1, 3, 1, 1, 1]],
columns=['Day', 'Name', 'a', 'b', 1, 2, 3, 'x1', 'x2'])
for i in range(1, 4):
df['y'+str(i)] = df['a'] * df['b'] + df[i]
#output
#Day Name a b 1 2 3 x1 x2 y1 y2 y3
#1 ijk 1 2 3 3 2 0 1 5 5 4
#2 mno 2 1 1 3 1 1 1 3 5 3

Related

How to convert two rows of data into a single row

I want convert below data into one Using pandas
Orginal data
ID Name m1 m2 m3
1 X 2 6 6
1 Y 1 2 3
2 A 2 4 7
2 y 5 6 7
I want To covert into below format using pandas libray
ID Name1 m1 m2 m3 Name2 m1 m2 m3
1 X 2 6 6 Y 1 2 3
2 A 2 4 7 y 6 6 7

Let's assume this is your data:
data = {'ID':[1, 1, 2, 2],
'Name':['X', 'Y', 'A', 'y'],
'm1':[2, 1, 2, 5], 'm2':[6,2,4,6],
'm3':[6, 3, 7, 7] }
df = pd.DataFrame(data)
Step 1: Sort the data by ID:
df = df.sort_values(by=['ID'])
Step 2: drop duplicates and keep the first records
df1 = df.drop_duplicates(subset=['ID'], keep='first')
Step 3: again drop duplicates but keep the last records
df2 = df.drop_duplicates(subset=['ID'], keep='last')
Step 4: finally, merge the two dataframe on the same ID
df = df1.merge(df2, on='ID')
Expected output would be look like:
ID Name_x m1_x m2_x m3_x Name_y m1_y m2_y m3_y
0 1 X 2 6 6 Y 1 2 3
1 2 A 2 4 7 y 5 6 7

Split rows to create new rows in Pandas Dataframe with same other row values

I have a pandas dataframe in which one column of text strings contains multiple comma-separated values. I want to split each field and create a new row per entry only where the number of commas is >= 2. For example, a should become b:
In [7]: a
Out[7]:
var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1
In [8]: b
Out[8]:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
3 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
6 h,i 3 Z1

You could use a custom function:
def custom_split(r):
if r['var3']:
s = r['var1']
i = int(r['var3'][1:])-1
l = s.split(',')
return l[i]+','+l[-1]
df['var1'] = df.apply(custom_split, axis=1)
df = df.dropna()
output:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
2 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
7 h,i 3 Z1

df['cc'] = df.groupby('var1')['var1'].cumcount()
df['var1'] = df['var1'].str.split(',')
df['var1'] = df[['cc','var1']].apply(lambda x: x['var1'][x['cc']]+','+x['var1'][-1],axis=1)
df = df.dropna().drop(columns=['cc']).reset_index(drop=True)
df

You can do so by splitting var1 on the comma into lists. The integer in var3 minus 1 can be interpreterd as the index of what item in the list in var1 to keep:
import pandas as pd
import io
data = ''' var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1'''
df = pd.read_csv(io.StringIO(data), sep = r'\s\s+', engine='python')
df['var1'] = df["var1"].str.split(',').apply(lambda x: [[i,x[-1]] for i in x[:-1]]) #split the string to list and create combinations of all items with the last item in the list
df = df[df['var3'].notnull()] # drop rows where var3 is None
df['var1'] = df.apply(lambda x: x['var1'][0 if not x['var3'] else int(x['var3'][1:])-1], axis=1) #keep only the element in the list in var1 where the index is the integer in var3 minus 1
Output:
var1
var2
var3
0
['a', 'd']
1
X1
1
['b', 'd']
1
X2
2
['c', 'd']
1
X3
4
['e', 'g']
2
Y1
5
['f', 'g']
2
Y2
7
['h', 'i']
3
Z1
Run df['var1'] = df['var1'].str.join(',') to reconvert var1 to a string.

Calculate mean of selected columns with multilevel header

I have a dataframe with multilevel headers for the columns like this:
name 1 2 3 4
x y x y x y x y
A 1 4 3 7 2 1 5 2
B 2 2 6 1 4 5 1 7
How can I calculate the mean for 1x, 2x and 3x, but not 4x?
I tried:
df['mean']= df[('1','x'),('2','x'),('3','x')].mean()
This did not work, it syas key error. I would like to get:
name 1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2
B 2 2 6 1 4 5 1 7 4
Is there a way to calculate the mean while keeping the first column header as an integer?

This is only one solution:
import pandas as pd
iterables = [[1, 2, 3, 4], ["x", "y"]]
array = [
[1, 4, 3, 7, 2, 1, 5, 2],
[2, 2, 6, 1, 4, 5, 1, 7]
]
index = pd.MultiIndex.from_product(iterables)
df = pd.DataFrame(array, index=["A", "B"], columns=index)
df["mean"] = df.xs("x", level=1, axis=1).loc[:,1:3].mean(axis=1)
print(df)
1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2.0
B 2 2 6 1 4 5 1 7 4.0
Steps:
Select all the "x"-columns with df.xs("x", level=1, axis=1)
Select only columns 1 to 3 with .loc[:,1:3]
Calculate the mean value with .mean(axis=1)

Apply function to dataframe based on column with other dataframe based on index

I would like to perform some operation (e.g. x*apples^y) on the values of column apples, based on their color. The corresponding values are in a seperate dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'apples': [2, 1, 5, 6, 7], 'color': [1, 1, 1, 2, 2]})
df2 = pd.DataFrame({'x': [100, 200], 'y': [0.5, 0.3]}).set_index(np.array([1, 2]), 'color')
I am looking for the following result:
apples color
0 100*2^0.5 1
1 100*1^0.5 1
2 100*5^0.5 1
3 200*6^0.3 2
4 200*7^0.3 2

Use DataFrame.join with default left join first and then operate with appended columns:
df = df1.join(df2, on='color')
df['apples'] = df['x'] * df['apples'] ** df['y']
print (df)
apples color x y
0 141.421356 1 100 0.5
1 100.000000 1 100 0.5
2 223.606798 1 100 0.5
3 342.353972 2 200 0.3
4 358.557993 2 200 0.3
There is left join, so append to new column in df1 should working:
df = df1.join(df2, on='color')
df1['apples'] = df['x'] * df['apples'] ** df['y']
print (df1)
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2
Another idea is use double map:
df1['apples'] = df1['color'].map(df2['x']) * df1['apples'] ** df1['color'].map(df2['y'])
print (df1)
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2

I think you need pandas.merge -
temp = df1.merge(df2, left_on='color', right_index= True, how='left')
df1['apples'] = (temp['x']*(temp['apples'].pow(temp['y'])))
Output
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2

Data frame group ID, create value: count in column

Given the following sample dataset:
import numpy as np
import pandas as pd
df1 = (pd.DataFrame(np.random.randint(3, size=(5, 4)), columns=('ID', 'X1', 'X2', 'X3')))
print(df1)
ID X1 X2 X3
0 2 2 0 2
1 1 0 2 1
2 1 2 1 1
3 1 2 0 2
4 2 0 0 0
d = {'ID' : pd.Series([1, 2, 1, 4, 5]), 'Tag' : pd.Series(['One', 'Two', 'Two', 'Four', 'Five'])}
df2 = (pd.DataFrame(d))
print(df2)
ID Tag
0 1 One
1 2 Two
2 1 Two
3 4 Four
4 5 Five
df1['Merged_Tags'] = df1.ID.map(df2.groupby('ID').Tag.apply(list))
print(df1)
ID X1 X2 X3 Merged_Tags
0 2 2 0 2 [Two]
1 1 0 2 1 [One, Two]
2 1 2 1 1 [One, Two]
3 1 2 0 2 [One, Two]
4 2 0 0 0 [Two]
Expected output for ID = 1:
1.
How would one groupby each key and generate a Tag: Frequency format in the Merged_Tags column?
ID X1 X2 X3 Merged_Tags
1 1 0 2 1 [One: 3, Two: 3]
2.
Create a new column for the number of rows with that ID
ID X1 X2 X3 Merged_Tags Frequency
1 1 0 2 1 [One: 3, Two: 3] 3
3.
Add the values of column X3 in each row occurrence with the same ID
ID X1 X2 X3 Merged_Tags Frequency X3++
1 1 0 2 1 [One: 3, Two: 3] 3 4

1 0 2 1 [One: 3, Two: 3]
should be [One: 2, Two:3] instead right? Considering that:
1 : [One,Two]
0 : None
2 : [Two]
1 : [One, Two]
and you want a total counter of each key in the row ?
Please help me understand the intuition behind [One:3, Two:3] in case I am missing anything here, but your question should be easy to solve otherwise

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas adding calculated vectors into df - python

Related

How to convert two rows of data into a single row

Split rows to create new rows in Pandas Dataframe with same other row values

Calculate mean of selected columns with multilevel header

Apply function to dataframe based on column with other dataframe based on index

Data frame group ID, create value: count in column

Categories

Resources