add values to multiple columns in one go with new index - pandas - python

df = pd.DataFrame(columns=['w','x','y','z'])
I'm trying to insert a new index row by row, and add values to certain columns.
If I were adding one value to a specific column, I could do: df.loc['a','x'] = 2
However what if I'd like to add values to several columns in one go, like this:
{'x':2, 'z':3}
is there a way to do this in pandas?

reindex and assign
df.reindex(['a']).assign(**d)
w x y z
a NaN 2 NaN 3
Where:
d = {'x':2, 'z':3}

df=pd.DataFrame(d,index=['a']).combine_first(df)
w x y z
a NaN 2 NaN 3

Use loc but selecting multiple columns and assign an iterable (like a list or tuple)
df.loc['a',['x','z']] = [2,3]
Or as suggested from #jfaccioni, in case the data is a dictionary d:
df.loc['a', list(d.keys())] = list(d.values())

Related

How to shift row values left in a dataframe to replace NaN

I have a huge dataframe with 40 columns (10 groups of 4 columns), with value in some groups and NaN for others. I want the values for all the row left-shifted, such that wherever values be present in that row, the final Dataframe should be filled with Group1 -> Group 2 -> Group 3 and so on.
Here is a sample dataframe and the required output below:
Here is the required output:
I have used the below code to achieve shifting the values left. However, if a value is missing in an available group, e.g. Item 2 type-1, or Item 3 cat-2, the below code will ignore that and will replace it with the value to its right, and so on.
v = df1.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
df2 = pd.DataFrame(v[a,b],df1.index,df1.columns)
How to achieve this?
Thanks.

Pandas. What is the best way to insert additional rows in dataframe based on cell values?

I have dataframe like this:
id
name
emails
1
a
a#e.com,b#e.com,c#e.com,d#e.com
2
f
f#gmail.com
And I need iterate over emails if there are more than one, create additional rows in dataframe with additional emails, not corresponding to name, should be like this:
id
name
emails
1
a
a#e.com
2
f
f#gmail.com
3
NaN
b#e.com
4
NaN
c#e.com
5
NaN
d#e.com
What is the best way to do it apart of iterrows with append or concat? is it ok to modify iterated dataframe during iteration?
Thanks.
Use DataFrame.explode with splitted values by Series.str.split first, then compare values before # and if no match set missing value and last sorting like missing values are in end of DataFrame with assign range to id column:
df = df.assign(emails = df['emails'].str.split(',')).explode('emails')
mask = df['name'].eq(df['emails'].str.split('#').str[0])
df['name'] = np.where(mask, df['name'], np.nan)
df = df.sort_values('name', key=lambda x: x.isna(), ignore_index=True)
df['id'] = range(1, len(df) + 1)
print (df)
id name emails
0 1 a a#e.com
1 2 f f#gmail.com
2 3 NaN b#e.com
3 4 NaN c#e.com
4 5 NaN d#e.com

Store nth row elements in a list panda dataframe

I am new to python.Could you help on follow
I have a dataframe as follows.
a,d,f & g are column names. dataframe can be named as df1
a d f g
20 30 20 20
0 1 NaN NaN
I need to put second row of the df1 into a list without NaN's.
Ideally as follows.
x=[0,1]
Select the second row using df.iloc[1] then using .dropna remove the nan values, finally using .tolist method convert the series into python list.
Use:
x = df.iloc[1].dropna().astype(int).tolist()
# x = [0, 1]
Check itertuples()
So you would have something like taht:
for row in df1.itertuples():
row[0] #-> that's your index of row. You can do whatever you want with it, as well as with whole row which is a tuple now.
you can also use iloc and dropna() like that:
row_2 = df1.iloc[1].dropna().to_list()

How to "partially transpose" dataframe in Pandas?

I have csv file like this:
A,B,C,X
a,a,a,1.0
a,a,a,2.1
a,b,b,1.2
a,b,b,2.4
a,b,b,3.6
b,c,c,1.1
b,c,d,1.0
(A, B, C) is a "primary key" in this dataset, that means this set of columns should be unique. What I need to do is to find duplicates and present associated values (X column) in separate columns, like this:
A,B,C,X1,X2,X3
a,a,a,1.0,2.1,
a,b,b,1.2,2.4,3.6
I somehow know how to find duplicates and aggregate X values into tuples:
df = data.groupby(['A', 'B', 'C']).filter(lambda group: len(group) > 1).groupby(['A', 'B', 'C']).aggregate(tuple)
This is basically what I need, but I struggle with transforming it further.
I don't know how many duplicates for a given key I have in my data, so I need to find some max value and compute columns:
df['items'] = df['X'].apply(lambda x: len(x))
columns = [f'x_{i}' for i in range(1, df['X'].max() + 1)]
and then create new dataframe with new columns:
df2 = pd.DataFrame(df['RATE'].tolist(), columns=columns)
But at this point I lost index :shrug:
This page on Pandas docs suggests I should use something like this:
df.pivot(columns=columns, values=['X'])
because df already contains an index, but I get this (confusing) error:
KeyError: "None of [Index(['x_1', 'x_2'], dtype='object')] are in the [columns]"
What am I missing here?
I originally marked this as a duplicate of the infamous, but since this is a bit different, here's an answer:
(df.assign(col=df.groupby(['A','B','C']).cumcount().add(1))
.pivot_table(index=['A','B','C'], columns='col', values='X')
.add_prefix('X')
.reset_index()
)
Output:
col A B C X1 X2 X3
0 a a a 1.0 2.1 NaN
1 a b b 1.2 2.4 3.6
2 b c c 1.1 NaN NaN
3 b c d 1.0 NaN NaN
Note: this only differs to the linked question/answer in that you groupby/pivot on a set of columns, instead of one column.

Pandas: create a dataframe relating a column to other two columns

I have a dataframe with three columns: A, B, C. Let's say A and B are integer series ranging from 0 to 10. I'd like to create a new data frame in which unique values of A is the index, unique values of B are the columns and each cell is the mean value C obtained at the intersection of Ai,Cj.
So for instance if we grouped the dataframe like this:
Cvalues = df.groupby(['A','B'],as_index=False).mean()
in the (i,j) position of the dataframe I'd like to create there would be:
Cvalues.loc[Cvalues.A==i].loc[Cvalues.B==j].C
What is the easiest way to do that?
You are almost there. You can either pivot your Cvalues, or better yet, directly go for pivot_table and utilize its built-in option of aggfunc.
df = pd.DataFrame({'A':[2,0,1,1,2,0,1,0],
'B':[1,2,1,0,1,2,1,1],
'C':[10,20,30,40,50,60,70,80]})
Recommended One-Liner:
res = df.pivot_table(index='A', columns='B', values='C', aggfunc='mean')
Making Your Method Work:
Cvalues = df.groupby(['A','B'],as_index=False).mean()
res = Cvalues.pivot(index='A', columns='B', values='C')
Why bother, but just in case, you can make this a little more compact:
res = df.groupby(['A','B'],as_index=False).mean().pivot(index='A', columns='B', values='C')
Here is the result of both ways:
B 0 1 2
A
0 NaN 80.0 40.0
1 40.0 50.0 NaN
2 NaN 30.0 NaN
where, at the intersection of A=2 and B=1: 30.0 = (10 + 50)/2

Categories

Resources