I am trying to update the names in a pandas dataframe column. I want:
[IN]
B17.31
107.34
34
B50.56
[OUT]
B17.31
B107.34
B34
B50.56
The code I am using is:
for file in df1.loc[:, '#filename']:
new = str(file)
if new[0] != 'B':
final = new[:0] + 'B' + new[0:]
else:
final = new
print((final))
df1.replace(new, final)
print(df1['#filename'])
df1.to_csv('updated_name_data.csv')
I can not work out why it will print out the updated name but will not update in the dataframe or csv. Any help or a pointer in the right direction would be greatly appreciated.
This should work:
for file in df1.loc[:, '#filename']:
new = str(file)
if new[0] != 'B':
final = new[:0] + 'B' + new[0:]
else:
final = new
print((final))
df1.replace(new, final,inplace=True)# yuo are not using inplace=True, this is required as otherwise this will return the old dataframe after replacement
print(df1['#filename'])
df1.to_csv('updated_name_data.csv')
You should aim to use vectorised operations rather than a manual loop. For example, you can isolate numeric values and prefix with "B":
s = pd.Series(['B17.31', 107.34, 34, 'B50.56'])
mask = pd.to_numeric(s, errors='coerce').notnull()
s.loc[mask] = 'B' + s.astype(str)
print(s)
0 B17.31
1 B107.34
2 B34
3 B50.56
dtype: object
In pandas is best avoid loops, because slow, better is use vectorized functions.
So you can create boolean mask by str.startswith and then add B to original column with numpy.where:
mask = df1['#filename'].astype(str).str.startswith('B')
df1['#filename'] = np.where(mask, df1['#filename'], 'B' + df1['#filename'].astype(str))
Another similar solution with inverting mask by ~:
df1.loc[~mask, '#filename'] = 'B' + df1['#filename'].astype(str)
print (df1)
#filename
0 B17.31
1 B107.34
2 B34
3 B50.56
Related
I have a DataFrame:
value,combined,value_shifted,Sequence_shifted,long,short
12834.0,2.0,12836.0,3.0,2.0,-2.0
12813.0,-2.0,12781.0,-3.0,-32.0,32.0
12830.0,2.0,12831.0,3.0,1.0,-1.0
12809.0,-2.0,12803.0,-3.0,-6.0,6.0
12822.0,2.0,12805.0,3.0,-17.0,17.0
12800.0,-2.0,12807.0,-3.0,7.0,-7.0
12773.0,2.0,12772.0,3.0,-1.0,1.0
12786.0,-2.0,12787.0,1.0,1.0,-1.0
12790.0,2.0,12784.0,3.0,-6.0,6.0
I want to combine the long and short columns according to the value of the combined column
If df.combined == 2 then we leave the value long
If df.combined == -2 then we leave the value short
Expected result:
value,combined,value_shifted,Sequence_shifted,calc
12834.0,2.0,12836.0,3.0,2.0
12813.0,-2.0,12781.0,-3.0,32
12830.0,2.0,12831.0,3.0,1.0
12809.0,-2.0,12803.0,-3.0,6.0
12822.0,2.0,12805.0,3.0,-17
12800.0,-2.0,12807.0,-3.0,-1.0
12773.0,2.0,12772.0,3.0,-1.0
12786.0,-2.0,12787.0,1.0,-6.0
12790.0,2.0,12784.0,3.0,20.0
Use if possible 2,-2 or another values in combined column numpy.select:
df['calc'] = np.select([df['combined'].eq(2), df['combined'].eq(-2)],
[df['long'], df['short']])
Or if only 2,-1 values use numpy.where:
df['calc'] = np.where(df['combined'].eq(2), df['long'], df['short'])
Try this:
df['calc'] = df['long'].where(df['combined'] == 2, df['short'])
df['calc'] = np.nan
mask_2 = df['combined'] == 2
df.loc[mask_2, 'calc'] = df.loc[mask_2, 'long']
mask_minus_2 = df['combined'] == -2
df.loc[mask_minus_2, 'calc'] = df.loc[mask_minus_2, 'short']
then you can drop the long and short columns:
df.drop(columns=['long', 'short'], inplace=True)
data = {"marks":[1,2,3,4,5,6,7,8,9,10,11,12], "month":['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']}
df2 = pd.DataFrame(data)
Till now I tried below but not getting as mentioned above:
for i in df2['month']:
if (i=='jan' or i=='feb' or i=='mar'):
df2['q'] = '1Q'
else:
df2['q']='other'
Use Series.dt.quarter with convert column to datetimes and add q:
df2['new'] = 'q' + pd.to_datetime(df2['month'], format='%b').dt.quarter.astype(str)
Or use Series.map by dictionary:
d = {'jan':'q1', 'feb':'q1','mar':'q1',
'apr':'q2','may':'q2', 'jun':'q2',
'jul':'q3','aug':'q3', 'sep':'q3',
'oct':'q4','nov':'q4', 'dec':'q4'}
df2['new'] = df2['month'].map(d)
For a particular column (dtype = object), how can I add '-' to the start of the string, given it ends with '-'.
i.e convert: 'MAY500-' to '-May500-'
(I need to add this to every element in the column)
Try something like this:
#setup
df = pd.DataFrame({'col':['aaaa','bbbb-','cc-','dddddddd-']})
mask = df.col.str.endswith('-'), 'col'
df.loc[mask] = '-' + df.loc[mask]
Output
df
col
0 aaaa
1 -bbbb-
2 -cc-
3 -dddddddd-
You can use np.select
Given a dataframe like this:
df
values
0 abcd-
1 a-bcd
2 efg-
You can use np.select as follows:
df['values'] = np.select([df['values'].str.endswith('-')], ['-' + df['values']], df['values'])
output:
df
values
0 -abcd-
1 a-bcd
2 -efg-
def add_prefix(text):
# If text is null or empty string the -1 index will result in IndexError
if text and text[-1] == "-":
return "-"+text
return text
df = pd.DataFrame(data={'A':["MAY500", "MAY500-", "", None, np.nan]})
# Change the column to string dtype first
df['A'] = df['A'].astype(str)
df['A'] = df['A'].apply(add_prefix)
0 MAY500
1 -MAY500-
2
3 None
4 nan
Name: A, dtype: object
I have a knack for using apply with lambda functions a lot. It just makes the code a lot easier to read.
df['value'] = df['value'].apply(lambda x: '-'+str(x) if str(x).endswith('-') else x)
I have a dataframe "bb" like this:
Response Unique Count
I love it so much! 246_0 1
This is not bad, but can be better. 246_1 2
Well done, let's do it. 247_0 1
If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected)
Response Unique
I love it so much! 246_0
This is not bad 246_1_0
but can be better. 246_1_1
Well done, let's do it. 247_0
My code:
bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)
But the result is like this:
Response Unique
0 This is not bad 246_1
1 but can be better. 246_1
How can I keep the original dataframe in this case?
First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
EDIT: If need _0 for all another values remove mask:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0_0
Step wise we can solve this problem the following:
Split your dataframes by count
Use this function to explode the string to rows
We groupby on index and use cumcount to get the correct unique column values.
Finally we concat the dataframes together again.
df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1
df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter
# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)
df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
Goal: Create new column that outputs strings based on value in original column
Below is my data frame table. I want to create the new column highlighted in yellow.
Below is my business logic:
1. If value in 'Cat_Priority_1' = 'Cat_1' then the new column ('Cat_Priority_1_Rationale') is equal to the string values in 'Age_Flag', 'Salary_Flag', and 'Education_Flag' columns.
2. If value in 'Cat_Priority_1' = 'Cat_3' then the new column ('Cat_Priority_1_Rationale') is equal to the string values in 'Race_Flag'
This is the code I tried, but it didn't work:
Any help greatly appreciated!
You can use np.where and access it through the pandas library with pd.np.where, which acts like an if statement:
df['Cat_Priority_1_Rationale'] = pd.np.where(df['Cat_Priority_1'] == 'Cat_1',
df['Age_Flag'] + ";" + df['Salary_Flag'] + ";" + df['Education_Flag'],
df['Race_Flag'])
apply function is used to iterate over the columns of the dataframe.
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
gives this dataframe
Year quarter period
0 2014 q1 2014q1
1 2015 q2 2015q2
Or you can send each row to a separate function that handles the if condition and returns the concatenated string.
This is how you can directly implement your business logic.
>>> def bus_log(row):
... if row['Cat_Priority_1'] == 'Cat_1':
... result = []
... result.append(row['Age_Flag'])
... result.append(row['Salary_Flag'])
... result.append(row['Education_Flag'])
... result = ';'.join(result)
... if result.startswith(';'):
... result = result[1:]
... return result
... elif row['Cat_Priority_1'] == 'Cat_3':
... return row['Race_Flag']
... elif ....: ## another condition could go here
... ## calculate a result
... return result
... elif ....: ## another condition could go here
... ## calculate a result
... return result
... else:
... return ''
...
>>> df['Cat_Priority_1_Rationale'] = df.apply(bus_log, axis=1)
There are two points I should mention: (1) You should clear away instances of NaN from your data in favour of empty strings before your do this. (2) I suspect an error in the third row of your data, in the 'Salary_Flag' value.
You could use something like this. Broadcasting is usually faster and more readable than iterating over rows. The last line exploits the fact that False * s == '' and True * s == s for any string s.
bs = df.Cat_priority_1 == 'Cat_1'
s1 = df.Race_Flag
s3 = df.Age_Flag + ';' + df.Educ_Flag + ';' + df.Salary_Flag
df['new_col'] = bs * s1 + (1 - bs) * s2