my goal is to add formula based vectors to my following df:
Day Name a b 1 2 x1 x2
1 ijk 1 2 3 3 0 1
2 mno 2 1 1 3 1 1
outcome:
Day Name a b 1 2 x1 x2 y1 y2 z1 z2
1 ijk 1 2 3 3 0 1 (1*2)+3 (1*2)+3 (1+2)*(3*1+0*1) (1+2)*(3*2+1*2)
2 mno 2 1 1 3 1 1 (2*1)+1 (2*1)+3 (2+1)*(1*1+1*1) (2+1)*(3*2+1*2)
This is my tedious approach:
df[y1] = df[a]*df[b]+df[1] #This is y1 = a*b+value of column 1
df[y2] = df[a]*df[b]+df[2] #This is y2 = a*b+value of column 2
if column 3 and x3 were added in then: y3 would be y3 = a*b+value of column 3,
if column 4 and x4 were added in then: y4 = a*b+value of column 4 and so on...
df[z1] = (df[a]+df[b])*(df[1]*1+df[x1]*1) The "1" here is from the column name 1 and x1 #z1 = (a+b)*[(value of column 1)*1+(value of column x1)*1]
df[z2] = (df[a]+df[b])*(df[1]*2+df[x1]*2) The "2" here is from the column name 2 and x2 #z2 = (a+b)*[(value of column 2)*2+(value of column x2)*2]
if column 3 and x3 were added in then: z3 = (a+b)*[(value of column 3)*3+(value of column x3)*3] and so on
This works fine; however, this will get tedious if there are more columns added in. For example, it might get "3 4,... x3 x4,..." I'm wondering if there's a better approach to this using a loop maybe?
Many thanks :)
This is one way:
import pandas as pd
df = pd.DataFrame([[1, 'ijk', 1, 2, 3, 3, 2, 0, 1],
[2, 'mno', 2, 1, 1, 3, 1, 1, 1]],
columns=['Day', 'Name', 'a', 'b', 1, 2, 3, 'x1', 'x2'])
for i in range(1, 4):
df['y'+str(i)] = df['a'] * df['b'] + df[i]
#output
#Day Name a b 1 2 3 x1 x2 y1 y2 y3
#1 ijk 1 2 3 3 2 0 1 5 5 4
#2 mno 2 1 1 3 1 1 1 3 5 3
Related
I want convert below data into one Using pandas
Orginal data
ID Name m1 m2 m3
1 X 2 6 6
1 Y 1 2 3
2 A 2 4 7
2 y 5 6 7
I want To covert into below format using pandas libray
ID Name1 m1 m2 m3 Name2 m1 m2 m3
1 X 2 6 6 Y 1 2 3
2 A 2 4 7 y 6 6 7
Let's assume this is your data:
data = {'ID':[1, 1, 2, 2],
'Name':['X', 'Y', 'A', 'y'],
'm1':[2, 1, 2, 5], 'm2':[6,2,4,6],
'm3':[6, 3, 7, 7] }
df = pd.DataFrame(data)
Step 1: Sort the data by ID:
df = df.sort_values(by=['ID'])
Step 2: drop duplicates and keep the first records
df1 = df.drop_duplicates(subset=['ID'], keep='first')
Step 3: again drop duplicates but keep the last records
df2 = df.drop_duplicates(subset=['ID'], keep='last')
Step 4: finally, merge the two dataframe on the same ID
df = df1.merge(df2, on='ID')
Expected output would be look like:
ID Name_x m1_x m2_x m3_x Name_y m1_y m2_y m3_y
0 1 X 2 6 6 Y 1 2 3
1 2 A 2 4 7 y 5 6 7
I have a pandas dataframe in which one column of text strings contains multiple comma-separated values. I want to split each field and create a new row per entry only where the number of commas is >= 2. For example, a should become b:
In [7]: a
Out[7]:
var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1
In [8]: b
Out[8]:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
3 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
6 h,i 3 Z1
You could use a custom function:
def custom_split(r):
if r['var3']:
s = r['var1']
i = int(r['var3'][1:])-1
l = s.split(',')
return l[i]+','+l[-1]
df['var1'] = df.apply(custom_split, axis=1)
df = df.dropna()
output:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
2 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
7 h,i 3 Z1
df['cc'] = df.groupby('var1')['var1'].cumcount()
df['var1'] = df['var1'].str.split(',')
df['var1'] = df[['cc','var1']].apply(lambda x: x['var1'][x['cc']]+','+x['var1'][-1],axis=1)
df = df.dropna().drop(columns=['cc']).reset_index(drop=True)
df
You can do so by splitting var1 on the comma into lists. The integer in var3 minus 1 can be interpreterd as the index of what item in the list in var1 to keep:
import pandas as pd
import io
data = ''' var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1'''
df = pd.read_csv(io.StringIO(data), sep = r'\s\s+', engine='python')
df['var1'] = df["var1"].str.split(',').apply(lambda x: [[i,x[-1]] for i in x[:-1]]) #split the string to list and create combinations of all items with the last item in the list
df = df[df['var3'].notnull()] # drop rows where var3 is None
df['var1'] = df.apply(lambda x: x['var1'][0 if not x['var3'] else int(x['var3'][1:])-1], axis=1) #keep only the element in the list in var1 where the index is the integer in var3 minus 1
Output:
var1
var2
var3
0
['a', 'd']
1
X1
1
['b', 'd']
1
X2
2
['c', 'd']
1
X3
4
['e', 'g']
2
Y1
5
['f', 'g']
2
Y2
7
['h', 'i']
3
Z1
Run df['var1'] = df['var1'].str.join(',') to reconvert var1 to a string.
I have a dataframe with multilevel headers for the columns like this:
name 1 2 3 4
x y x y x y x y
A 1 4 3 7 2 1 5 2
B 2 2 6 1 4 5 1 7
How can I calculate the mean for 1x, 2x and 3x, but not 4x?
I tried:
df['mean']= df[('1','x'),('2','x'),('3','x')].mean()
This did not work, it syas key error. I would like to get:
name 1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2
B 2 2 6 1 4 5 1 7 4
Is there a way to calculate the mean while keeping the first column header as an integer?
This is only one solution:
import pandas as pd
iterables = [[1, 2, 3, 4], ["x", "y"]]
array = [
[1, 4, 3, 7, 2, 1, 5, 2],
[2, 2, 6, 1, 4, 5, 1, 7]
]
index = pd.MultiIndex.from_product(iterables)
df = pd.DataFrame(array, index=["A", "B"], columns=index)
df["mean"] = df.xs("x", level=1, axis=1).loc[:,1:3].mean(axis=1)
print(df)
1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2.0
B 2 2 6 1 4 5 1 7 4.0
Steps:
Select all the "x"-columns with df.xs("x", level=1, axis=1)
Select only columns 1 to 3 with .loc[:,1:3]
Calculate the mean value with .mean(axis=1)
I would like to perform some operation (e.g. x*apples^y) on the values of column apples, based on their color. The corresponding values are in a seperate dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'apples': [2, 1, 5, 6, 7], 'color': [1, 1, 1, 2, 2]})
df2 = pd.DataFrame({'x': [100, 200], 'y': [0.5, 0.3]}).set_index(np.array([1, 2]), 'color')
I am looking for the following result:
apples color
0 100*2^0.5 1
1 100*1^0.5 1
2 100*5^0.5 1
3 200*6^0.3 2
4 200*7^0.3 2
Use DataFrame.join with default left join first and then operate with appended columns:
df = df1.join(df2, on='color')
df['apples'] = df['x'] * df['apples'] ** df['y']
print (df)
apples color x y
0 141.421356 1 100 0.5
1 100.000000 1 100 0.5
2 223.606798 1 100 0.5
3 342.353972 2 200 0.3
4 358.557993 2 200 0.3
There is left join, so append to new column in df1 should working:
df = df1.join(df2, on='color')
df1['apples'] = df['x'] * df['apples'] ** df['y']
print (df1)
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2
Another idea is use double map:
df1['apples'] = df1['color'].map(df2['x']) * df1['apples'] ** df1['color'].map(df2['y'])
print (df1)
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2
I think you need pandas.merge -
temp = df1.merge(df2, left_on='color', right_index= True, how='left')
df1['apples'] = (temp['x']*(temp['apples'].pow(temp['y'])))
Output
apples color
0 141.421356 1
1 100.000000 1
2 223.606798 1
3 342.353972 2
4 358.557993 2
Given the following sample dataset:
import numpy as np
import pandas as pd
df1 = (pd.DataFrame(np.random.randint(3, size=(5, 4)), columns=('ID', 'X1', 'X2', 'X3')))
print(df1)
ID X1 X2 X3
0 2 2 0 2
1 1 0 2 1
2 1 2 1 1
3 1 2 0 2
4 2 0 0 0
d = {'ID' : pd.Series([1, 2, 1, 4, 5]), 'Tag' : pd.Series(['One', 'Two', 'Two', 'Four', 'Five'])}
df2 = (pd.DataFrame(d))
print(df2)
ID Tag
0 1 One
1 2 Two
2 1 Two
3 4 Four
4 5 Five
df1['Merged_Tags'] = df1.ID.map(df2.groupby('ID').Tag.apply(list))
print(df1)
ID X1 X2 X3 Merged_Tags
0 2 2 0 2 [Two]
1 1 0 2 1 [One, Two]
2 1 2 1 1 [One, Two]
3 1 2 0 2 [One, Two]
4 2 0 0 0 [Two]
Expected output for ID = 1:
1.
How would one groupby each key and generate a Tag: Frequency format in the Merged_Tags column?
ID X1 X2 X3 Merged_Tags
1 1 0 2 1 [One: 3, Two: 3]
2.
Create a new column for the number of rows with that ID
ID X1 X2 X3 Merged_Tags Frequency
1 1 0 2 1 [One: 3, Two: 3] 3
3.
Add the values of column X3 in each row occurrence with the same ID
ID X1 X2 X3 Merged_Tags Frequency X3++
1 1 0 2 1 [One: 3, Two: 3] 3 4
1 0 2 1 [One: 3, Two: 3]
should be [One: 2, Two:3] instead right? Considering that:
1 : [One,Two]
0 : None
2 : [Two]
1 : [One, Two]
and you want a total counter of each key in the row ?
Please help me understand the intuition behind [One:3, Two:3] in case I am missing anything here, but your question should be easy to solve otherwise