If I have more than 200 columns, each with long names and I want to remove the first part of the names, how do I do that using pandas?
You could loop through them and omit the first n characters:
n = 3
li = []
for col in df.columns:
col = col[n:]
df.columns = li
Or perform any other form of string manipulation, I'm not sure what you mean by "to remove the first part".
I'd just use rename:
df.rename(columns = lambda x: x[n:])
and here, the lambda can be anything, you could also strip further whitespace, and actually, you can just define a callable and use with here, potentially without even using lambda
Use indexing with str:
N = 5
df.columns = df.columns.str[N:]
If you just want to remove a certain number of characters:
df.rename(columns=lambda col: col[n:])
If you want to selectively remove based on a prefix:
# cols = 'a_A', 'a_B', 'b_A'
df.rename(columns=lambda col: col.split('a_')[1] if 'a_' in col else col)
How complicated your rules are is up to you.
I want to replace some characters in column names at scale and I was able to do it through columns with str.replace(). However, I want to know if I could do this through lambda functions so I could bring them into my other pandas workflow instead of doing it independently.
dat.columns = (
.str.replace(r"park_1_city", "us1state")
.str.replace(r"park_2_city", "us2state")
.str.replace(r"park_3_city", "us3state")
.str.replace(r"us1tree", "us1garden")
.str.replace(r"us2tree", "us2garden")
.str.replace(r"us3tree", "us3garden")
Simply do:
your_function = lambda col: col # Or whatever you would like to do with the names
dat.columns = [your_function(col) for col in dat.columns]
You can also use any normal function, instead of a lambda, of course.
Use dictionary for replace subtrings, here \d+ match digit and \1 same value in Series.replace for possible pass dictionary:
dat = pd.DataFrame(columns=['park_1_city','park_2_city','park_3_city',
d = {r"park_(\d+)_city": r"us\1state", r"us(\d+)tree": r"us\1garden"}
dat.columns = dat.columns.to_series().replace(d, regex=True)
print (dat)
Empty DataFrame
Columns: [us1state, us2state, us3state, us1garden, us2garden, us30garden]
Index: []
I have a dataframe, in which I want to delete columns whose name starts with "test","id_1","vehicle" and so on
I use below code to delete one column
df1.drop(*filter(lambda col: 'test' in col, df.columns))
how to specify all columns at once in this line?
this doesnt work:
df1.drop(*filter(lambda col: 'test','id_1' in col, df.columns))
You do something like the following:
expression = lambda col: all([col.startswith(i) for i in ['test', 'id_1', 'vehicle']])
df1.drop(*filter(lambda col: expression(col), df.columns))
In PySpark version 2.1.0, it is possible to drop multiple columns using drop by providing a list of strings (with the names of the columns you want to drop) as argument to drop. (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop).
In your case, you may create a list containing the names of the columns you want to drop. For example:
cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]
And then apply the drop unpacking the list:
Ultimately, it is also possible to achieve a similar result by using select. For example:
# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]
# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)
Note that, by using select you don't need to unpack the list.
Please note that this question also address similar issue.
I hope this helps.
Well, it seems you can use regular column filter as following:
val forColumns = df.columns.filter(x => (x.startsWith("test") || x.startsWith("id_1") || x.startsWith("vehicle"))) ++ ["c_007"]
I have data loaded into a dataframe but cannot figure out how to compare the parsed data against the other column and return only matches.
This seems like it should be easy but I just don't see it. I've tried splitting the values out to compare but here's where I get stuck.
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
# output something like...
df['output'] = [null,';c1312;',';d1310;']
I'd expect to see something like -
1st row - return null, as t9010 is not contained in col2_split
2nd row - return c1312, as it is in col2_split
3rd row - return d1310 but not c1512, as only d1310 is in col2_split
lastly, the final text should be returned semicolon delimited and with leading and trailing semicolons i.e. ;t9010; or ;c1312; or ;d1310;c1512; if there is more than one.
The part where you have tried to split using ";" is correct. After that, you need to compare each element in col1_split with each element in col2_split. You can write a simple function to avoid many loops and use pandas apply function to do the rest
Here is the sample code for the same
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
def value_check(list1, list2):
string = ""
for i in list1:
if (i in list2) & (len(i)>0):
string += ";"+i+';'
return string
df['output'] = df.apply(lambda x: value_check(x.col1_split, x.col2_split), axis=1)
You may be can try this method to get all values in col1 if its values are in col2. The method is by splitting string values in each row to a list and then omitting the empty values or length is less than 0 in the list values ([]) first. And then searching the values without empty values in col1 that is matched to the col2 and displaying the output to the output column.
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
#splitting & omitting the empty values
df['col1_split']=df.col1.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
df['col2_split']=df.col2.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
def check(list1, list2):
for i in list1:
if (i in list2): res += ';'+str(i)
#semicolon cover at the end of string in each row
if len(res)>0: res+=';'
return res
df['output']=df.apply(lambda x: check(x.col1_split, x.col2_split), axis=1)
Hope this can help you.
We can use a nested list comprehension for this:
df['common'] = pd.Series([[sub for sub in left if sub in right] for left, right in zip(df['col1_split'], df['col2_split'])]).str.join(';')
0 ;
1 ;c1312;
2 ;d1310;
Name: common, dtype: object
The following works fine to drop column names that contain the string basket anywhere in the column name of the df, how can I modify the below code to pass a list of strings to be filtered out instead of just a single string?
banned_columns = ["basket","cricket","ball"]
condition = lambda col: "basket" in col
new_df = df.drop(*filter(condition, df.columns))
The above just filters basket. How can I filter out basket, cricket and ball from df.columns ?
You can check exclude all the columns that contain any of the banned words using the built-in any() function:
banned_columns = ["basket","cricket","ball"]
condition = lambda col: any(word in col for word in banned_columns)
new_df = df.drop(*filter(condition, df.columns))
The built-in function any() comes in handy here:
condition = lambda col: any(item in col for item in banned_columns)
How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines
I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')