Pandas DataFrame filter - python

My question is about the pandas.DataFrame.filter command. It seems that pandas creates a copy of the data frame to write any changes. How am I able to write on the data frame itself?
In other words:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.filter(regex='col1').iloc[0]=10
Output:
col1 col2
0 1 3
1 2 4
Desired Output:
col1 col2
0 10 3
1 2 4

I think you need extract columns names and then use loc or iloc functions:
cols = df.filter(regex='col1').columns
df.loc[0, cols]=10
Or:
df.iloc[0, df.columns.get_indexer(cols)] = 10
print (df)
col1 col2
0 10 3
1 2 4
You cannnot use filter function, because subset returns a Series/DataFrame which may have its data as a view. That's why SettingWithCopyWarning is possible there (or raise if you set the option).

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28

Create new dataframe and new column at the same time

I wonder if we can create new DataFrame and new column at once as below.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
# Can I combine the 2 rows into 1
new_df = df
new_df['new column'] = new_df['col1'] * 2 + new_df['col2'] / 4
print(new_df)
You can do this with the .assign() method of a data frame, creating a copy and adding a new column at the same time:
>>> df
col1 col2
0 1 3
1 2 4
>>> new_df = df.assign(col3=df["col1"] * 2 + df["col2"] / 4)
>>> new_df
col1 col2 col3
0 1 3 2.75
1 2 4 5.00
If you just want to make the code looks shorter, use assign right after creating a dataframe.
The code snippet can look like below:
df = pd.DataFrame(d).assign(new_column=lambda x: x['col1'] * 2 + x['col2'] / 4)

Pandas DataFrame.assign() doesn't work properly for multiple columns

I am trying to reassign multiple columns in DataFrame with modifications.
The below is a simplified example.
import pandas as pd
d = {'col1':[1,2], 'col2':[3,4]}
df = pd.DataFrame(d)
print(df)
col1 col2
0 1 3
1 2 4
I use assign() method to add 1 to both 'col1' and 'col2'.
However, the result is to add 1 only to 'col2' and copy the result to 'col1'.
df2 = df.assign(**{c: lambda x: x[c] + 1 for c in ['col1','col2']})
print(df2)
col1 col2
0 4 4
1 5 5
Can someone explain why this is happening, and also suggest a correct way to apply assign() to multiple columns?
I think the lambda here can not be used within the for loop dict
df.assign(**{c: df[c] + 1 for c in ['col1','col2']})

How to split a column by a delimiter, while respecting the relative position of items to be separated

Below is my script for a generic data frame in Python using pandas. I am hoping to split a certain column in the data frame that will create new columns, while respecting the original orientation of the items in the original column.
Please see below for my clarity. Thank you in advance!
My script:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x,y,z', 'a,b', 'c']})
print(df)
Here's what I want
df = pd.DataFrame({'col1': ['x',np.nan,np.nan],
'col2': ['y','a',np.nan],
'col3': ['z','b','c']})
print(df)
Here's what I get
df = pd.DataFrame({'col1': ['x','a','c'],
'col2': ['y','b',np.nan],
'col3': ['z',np.nan,np.nan]})
print(df)
You can use the justify function from this answer with Series.str.split:
dfn = pd.DataFrame(
justify(df['col1'].str.split(',', expand=True).to_numpy(),
invalid_val=None,
axis=1,
side='right')
).add_prefix('col')
col0 col1 col2
0 x y z
1 None a b
2 None None c
Here is a way of tweaking the split:
max_delim = df['col1'].str.count(',').max() #count the max occurance of `,`
delim_to_add = max_delim - df['col1'].str.count(',') #get difference of count from max
# multiply the delimiter and add it to series, followed by split
df[['col1','col2','col3']] = (df['col1'].radd([','*i for i in delim_to_add])
.str.split(',',expand=True).replace('',np.nan))
print(df)
col1 col2 col3
0 x y z
1 NaN a b
2 NaN NaN c
Try something like
s=df.col1.str.count(',')
#(s.max()-s).map(lambda x : x*',')
#0
#1 ,
#2 ,,
Name: col1, dtype: object
(s.max()-s).map(lambda x : x*',').add(df.col1).str.split(',',expand=True)
0 1 2
0 x y z
1 a b
2 c

Pandas: Create dataframe from data and column order

what i'm asking must be something very easy, but i honestly can't see it.... :(
I have an array, lets say
data = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10,11,12]]
and i want to put it in a dataframe.
I do df = pd.Dataframe(data, columns={'col1', 'col2', 'col3'})
aiming for:
col1 col2 col3
1 2 3
4 5 6
7 8 9
10 11 12
but i am getting:
col3 col1 col2
1 2 3
4 5 6
7 8 9
10 11 12
(notice the discrepancy between column names and data)
I know i can re-arrange the column names order in the dataframe creation, but i'm trying to understand how it works.
Am i doing something wrong, or it's normal behaviour? (why though?)
You are are using a {set} of columns, which is NOT an ordered collection (neither are dictionaries).
Try with a (tuple), o simply a [list]
df = pd.Dataframe(data, columns=['col1', 'col2', 'col3'])
You have to pass a tuple or list as value for columns property.
In your example you're using a set of columns which is an unordered collection.
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])

Categories

Resources