Iterate over columns of Pandas dataframe and create new variables - python

I am having trouble figuring out how to iterate over variables in a pandas dataframe and perform same arithmetic function on each.
I have a dataframe df that contain three numeric variables x1, x2 and x3. I want to create three new variables by multiplying each by 2. Here’s what I am doing:
existing = ['x1','x2','x3']
new = ['y1','y2','y3']
for i in existing:
for j in new:
df[j] = df[i]*2
Above code is in fact creating three new variables y1, y2 and y3 in the dataframe. But the values of y1 and y2 are being overridden by the values of y3 and all three variables have same values, corresponding to that of y3. I am not sure what I am missing.
Really appreciate any guidance/ suggestion. Thanks.

You are looping something like 9 times here - 3 times for each column, with each iteration overwriting the previous.
You may want something like
for e, n in zip(existing,new):
df[n] = df[e]*2

I would do something more generic
#existing = ['x1','x2','x3']
exisiting = df.columns
new = existing.replace('x','y')
#maybe you need map+lambda/for for each existing string
for (ind_existing, ind_new) in zip(existing,new):
df[new[ind_new]] = df[existing[ind_existing]]*2
#maybe there is more elegant way by using pandas assign function

You can concatenante the original DataFrame with the columns with doubled values:
cols_to_double = ['x0', 'x1', 'x2']
new_cols = list(df.columns) + [c.replace('x', 'y') for c in cols_to_double]
df = pd.concat([df, 2 * df[cols_to_double]], axis=1, copy=True)
df.columns = new_cols
So, if your input df Dataframe is:
x0 x1 x2 other0 other1
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
after executing the previous lines, you get:
x0 x1 x2 other0 other1 y0 y1 y2
0 0 1 2 3 4 0 2 4
1 0 1 2 3 4 0 2 4
2 0 1 2 3 4 0 2 4
3 0 1 2 3 4 0 2 4
4 0 1 2 3 4 0 2 4
Here the code to create df:
import pandas as pd
import numpy as np
df = pd.DataFrame(
data=np.column_stack([np.full((5,), i) for i in range(5)]),
columns=[f'x{i}' for i in range(3)] + [f'other{i}' for i in range(2)]
)

Related

How to create a pandas dataframe vector that has values based on a groupby

given the following data:
x1 = 'one'
x2 = 'two'
x3 = 'three'
y1 = 'yes'
y2 = 'no'
n = 3
df = pd.DataFrame(dict(
a = [x1]*n + [x2]*n + [x3]*n,
b = [
y1,
y1,
y2,
y2,
y2,
y2,
y2,
y2,
y1,
]
))
which look as:
Out[5]:
a b
0 one yes
1 one yes
2 one no
3 two no
4 two no
5 two no
6 three no
7 three no
8 three yes
I want to know if it's possible to create column c as follows:
Out[5]:
a b c
0 one yes 1
1 one yes 1
2 one no 1
3 two no 0
4 two no 0
5 two no 0
6 three no 1
7 three no 1
8 three yes 1
where c is defined as being 1 if for group in a the column b contains yes
I tried the following:
group_results = df.groupby('a').apply(lambda x: 'yes' in x.b.to_list() )
group_results = group_results.reset_index()
group_results = group_results.rename(columns = {0 : 'c'})
df = pd.merge(df, group_results, left_on = 'a',
right_on = 'a',
how = 'left').copy()
But I feel as though there is a better approach.
Use Series.isin for test groups with at least one yes in a column, last convert mask to integers with Series.view:
df['c'] = df['a'].isin(df.loc[df['b'].eq('yes'), 'a']).view('i1')
print(df)
a b c
0 one yes 1
1 one yes 1
2 one no 1
3 two no 0
4 two no 0
5 two no 0
6 three no 1
7 three no 1
8 three yes 1
Detail:
print(df.loc[df['b'].eq('yes'), 'a'])
0 one
1 one
8 three
Name: a, dtype: obje
IIUC, you can use Groupby+transform with any after grouping on the conditional series which checks if df['b'] equals 'yes' and chain either astype(int) or view for integer repr.
df['c'] = df['b'].eq('yes').groupby(df['a']).transform('any').view('i1')
print(df)
a b c
0 one yes 1
1 one yes 1
2 one no 1
3 two no 0
4 two no 0
5 two no 0
6 three no 1
7 three no 1
8 three yes 1

Adding values of multiple different rows to one row using pandas?

We want to add the values of several different rows into one single row. In the image you can see an example of want we want to do, on the left (column ABC) the data we have, on the right the data we want.
We have a large dataset and thus want to write a script. Currently we have a pandas dataframe. We want to add five rows into one.
Does anyone have a simple solution?
Image (left what we have, right what we want)
You can do this:
inport pandas as pd
# reads an 1 Dimensional List and reads it as columns
pd.DataFrame([
[j for j in i for i in df.values] # makes 2D matrix of all values to 1D list
])
the [] in (pd.DataFrame([...])) means that the first row is the following data -> horizontal formatting
Here's a way you can try:
from itertools import product
# sample data
df = pd.DataFrame(np.random.randint(1, 10, size=9).reshape(-1, 3), columns=['X','Y','Z'])
X Y Z
0 2 6 5
1 5 6 2
2 2 4 5
# get all values
total_values = df.count().sum()
# existing column name
cols = df.columns
nums = [1,2,3]
# create new column names
new_cols = ['_'.join((str(i) for i in x)) for x in list(product(cols, nums))]
df2 = pd.DataFrame(df.values.reshape(-1, total_values), columns=new_cols)
X_1 X_2 X_3 Y_1 Y_2 Y_3 Z_1 Z_2 Z_3
0 2 6 5 5 6 2 2 4 5
I'd do this:
import pandas as pd, numpy as np
df=pd.DataFrame(np.arange(1,10).reshape(3,3),columns=["X","Y","Z"])
print(df)
X Y Z
0 1 2 3
1 4 5 6
2 7 8 9
dat = df.to_numpy()
d = np.column_stack([dat[:,x].reshape(1,dat.shape[0]) for x in range(dat.shape[1])])
pd.DataFrame(d,columns=(x+str(y) for x in df.columns for y in range(len(df)) ))
X0 X1 X2 Y0 Y1 Y2 Z0 Z1 Z2
0 1 4 7 2 5 8 3 6 9
Assuming this is a numpy array. (if its a csv you can read in as numpy array)
yourArray.flatten(order='C')

Return running count of values in a pandas df

I am trying to return a running count based off two columns in a pandas df.
For the df below I'm trying to determine the count based off Column 'Event' & Column 'Who'.
import pandas as pd
import numpy as np
d = ({
'Event' : ['A','B','E','','C','B','B','B','B','E','C','D'],
'Space' : ['X1','X1','X2','','X3','X3','X3','X4','X3','X2','X2','X1'],
'Who' : ['Home','Home','Even','Out','Home','Away','Home','Away','Home','Even','Away','Home']
})
d = pd.DataFrame(data = d)
I have tried the following.
df = d.groupby(['Event', 'Who'])['Space'].count().reset_index(name="count")
Which produces this:
Event Who count
0 Out 1
1 A Home 1
2 B Away 2
3 B Home 3
4 C Away 1
5 C Home 1
6 D Home 1
7 E Even 2
But I would like it to be a running count rather than a total count.
Can df = d.groupby(['Event', 'Who'['Space'].count().reset_index(name="count") be amended to filter additional constraints or will it have to be a mask function or similar?
So my intended Output is:
A_Away A_Home B_Away B_Home C_Away C_Home D_Away D_Home Event Space Who
0 1 A X1 Home
1 B X1 Home
2 E X2 Even
3 Out
4 1 C X3 Home
5 1 B X3 Away
6 1 B X3 Home
7 B X4 Away
8 2 B X3 Home
9 2 E X2 Even
10 1 C X2 Away
11 1 D X1 Home
So the count gets added to the row. Not a total count for the entire dataset.
Here are the steps needed to get to your result:
Prepare "Who" and "Event" as the index
Get a cumulative count for groups using groupby and cumcount
Reshape/pivot/unstack your DataFrame to a tabular format using unstack
Fix the column headers
Concatenate this result with the original using pd.concat
# set the index
v = df.set_index(['Who', 'Event'], append=True)['Space']
# assign `v` the values for the cumulative count
v[:] = df.groupby(['Event', 'Who']).cumcount().add(1)
# reshape `v`
v = v.unstack([1, 2], fill_value='')
# fix your headers
v.columns = v.columns.map('{0[1]}_{0[0]}'.format)
# concatenate the result
pd.concat([v.loc[:, ~v.columns.str.contains('Out')], df], 1)
A_Home B_Home E_Even C_Home B_Away C_Away D_Home Event Space Who
0 1 A X1 Home
1 1 B X1 Home
2 1 E X2 Even
3 Out
4 1 C X3 Home
5 1 B X3 Away
6 2 B X3 Home
7 2 B X4 Away
8 3 B X3 Home
9 2 E X2 Even
10 1 C X2 Away
11 1 D X1 Home

Sum up non-unique rows in DataFrame

I have a dataframe like this:
id = [1,1,2,3]
x1 = [0,1,1,2]
x2 = [2,3,1,1]
df = pd.DataFrame({'id':id, 'x1':x1, 'x2':x2})
df
id x1 x2
1 0 2
1 1 3
2 1 1
3 2 1
Some rows have the same id. I want to sum up such rows (over x1 and x2) to obtain a new dataframe with unique ids:
df_new
id x1 x2
1 1 5
2 1 1
3 2 1
An important detail is that the real number of columns x1, x2,... is large, so I cannot apply a function that requires manual input of column names.
As discussed you can use pandas groupby function to sum based on the id value:
df.groupby(df.id).sum()
# or
df.groupby('id').sum()
If you need don't want id to become the index then you can:
df.groupby('id').sum().reset_index()
# or
df.groupby('id', as_index=False).sum() # #John_Gait
With pivot_table:
In [31]: df.pivot_table(index='id', aggfunc=sum)
Out[31]:
x1 x2
id
1 1 5
2 1 1
3 2 1

Update in pandas on specific columns

I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866

Categories

Resources