I wonder if we can create new DataFrame and new column at once as below.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
# Can I combine the 2 rows into 1
new_df = df
new_df['new column'] = new_df['col1'] * 2 + new_df['col2'] / 4
print(new_df)
You can do this with the .assign() method of a data frame, creating a copy and adding a new column at the same time:
>>> df
col1 col2
0 1 3
1 2 4
>>> new_df = df.assign(col3=df["col1"] * 2 + df["col2"] / 4)
>>> new_df
col1 col2 col3
0 1 3 2.75
1 2 4 5.00
If you just want to make the code looks shorter, use assign right after creating a dataframe.
The code snippet can look like below:
df = pd.DataFrame(d).assign(new_column=lambda x: x['col1'] * 2 + x['col2'] / 4)
Related
I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28
Below is my script for a generic data frame in Python using pandas. I am hoping to split a certain column in the data frame that will create new columns, while respecting the original orientation of the items in the original column.
Please see below for my clarity. Thank you in advance!
My script:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x,y,z', 'a,b', 'c']})
print(df)
Here's what I want
df = pd.DataFrame({'col1': ['x',np.nan,np.nan],
'col2': ['y','a',np.nan],
'col3': ['z','b','c']})
print(df)
Here's what I get
df = pd.DataFrame({'col1': ['x','a','c'],
'col2': ['y','b',np.nan],
'col3': ['z',np.nan,np.nan]})
print(df)
You can use the justify function from this answer with Series.str.split:
dfn = pd.DataFrame(
justify(df['col1'].str.split(',', expand=True).to_numpy(),
invalid_val=None,
axis=1,
side='right')
).add_prefix('col')
col0 col1 col2
0 x y z
1 None a b
2 None None c
Here is a way of tweaking the split:
max_delim = df['col1'].str.count(',').max() #count the max occurance of `,`
delim_to_add = max_delim - df['col1'].str.count(',') #get difference of count from max
# multiply the delimiter and add it to series, followed by split
df[['col1','col2','col3']] = (df['col1'].radd([','*i for i in delim_to_add])
.str.split(',',expand=True).replace('',np.nan))
print(df)
col1 col2 col3
0 x y z
1 NaN a b
2 NaN NaN c
Try something like
s=df.col1.str.count(',')
#(s.max()-s).map(lambda x : x*',')
#0
#1 ,
#2 ,,
Name: col1, dtype: object
(s.max()-s).map(lambda x : x*',').add(df.col1).str.split(',',expand=True)
0 1 2
0 x y z
1 a b
2 c
I have a large dataframe but has similar contents to the one below.
d = {'col1': [1, -2.654, 3, 1.995]}
df = pd.DataFrame(data=d)
Output
col1
0 1
1 -2.654
2 3
3 1.995
I would like to delete the floating point values so rows 1 and 3 would be deleted.
Thanks for any help!
try:
d = {'col1': [1, -2.654, 3, 1.995]}
df = pd.DataFrame(data=d)
df[df.col1 == round(df.col1)]
# col1
# 0 1.0
# 2 3.0
My question is about the pandas.DataFrame.filter command. It seems that pandas creates a copy of the data frame to write any changes. How am I able to write on the data frame itself?
In other words:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.filter(regex='col1').iloc[0]=10
Output:
col1 col2
0 1 3
1 2 4
Desired Output:
col1 col2
0 10 3
1 2 4
I think you need extract columns names and then use loc or iloc functions:
cols = df.filter(regex='col1').columns
df.loc[0, cols]=10
Or:
df.iloc[0, df.columns.get_indexer(cols)] = 10
print (df)
col1 col2
0 10 3
1 2 4
You cannnot use filter function, because subset returns a Series/DataFrame which may have its data as a view. That's why SettingWithCopyWarning is possible there (or raise if you set the option).
Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))