If I have more than 200 columns, each with long names and I want to remove the first part of the names, how do I do that using pandas?
You could loop through them and omit the first n characters:
n = 3
li = []
for col in df.columns:
col = col[n:]
li.append(col)
df.columns = li
Or perform any other form of string manipulation, I'm not sure what you mean by "to remove the first part".
I'd just use rename:
n=5
df.rename(columns = lambda x: x[n:])
and here, the lambda can be anything, you could also strip further whitespace, and actually, you can just define a callable and use with here, potentially without even using lambda
Use indexing with str:
N = 5
df.columns = df.columns.str[N:]
If you just want to remove a certain number of characters:
df.rename(columns=lambda col: col[n:])
If you want to selectively remove based on a prefix:
# cols = 'a_A', 'a_B', 'b_A'
df.rename(columns=lambda col: col.split('a_')[1] if 'a_' in col else col)
How complicated your rules are is up to you.
Related
I want to replace some characters in column names at scale and I was able to do it through columns with str.replace(). However, I want to know if I could do this through lambda functions so I could bring them into my other pandas workflow instead of doing it independently.
dat.columns = (
dat.columns
.str.replace(r"park_1_city", "us1state")
.str.replace(r"park_2_city", "us2state")
.str.replace(r"park_3_city", "us3state")
.str.replace(r"us1tree", "us1garden")
.str.replace(r"us2tree", "us2garden")
.str.replace(r"us3tree", "us3garden")
)
Simply do:
your_function = lambda col: col # Or whatever you would like to do with the names
dat.columns = [your_function(col) for col in dat.columns]
You can also use any normal function, instead of a lambda, of course.
Use dictionary for replace subtrings, here \d+ match digit and \1 same value in Series.replace for possible pass dictionary:
dat = pd.DataFrame(columns=['park_1_city','park_2_city','park_3_city',
'us1tree','us2tree','us30tree'])
d = {r"park_(\d+)_city": r"us\1state", r"us(\d+)tree": r"us\1garden"}
dat.columns = dat.columns.to_series().replace(d, regex=True)
print (dat)
Empty DataFrame
Columns: [us1state, us2state, us3state, us1garden, us2garden, us30garden]
Index: []
I have a dataframe, in which I want to delete columns whose name starts with "test","id_1","vehicle" and so on
I use below code to delete one column
df1.drop(*filter(lambda col: 'test' in col, df.columns))
how to specify all columns at once in this line?
this doesnt work:
df1.drop(*filter(lambda col: 'test','id_1' in col, df.columns))
You do something like the following:
expression = lambda col: all([col.startswith(i) for i in ['test', 'id_1', 'vehicle']])
df1.drop(*filter(lambda col: expression(col), df.columns))
In PySpark version 2.1.0, it is possible to drop multiple columns using drop by providing a list of strings (with the names of the columns you want to drop) as argument to drop. (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop).
In your case, you may create a list containing the names of the columns you want to drop. For example:
cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]
And then apply the drop unpacking the list:
df1.drop(*cols_to_drop)
Ultimately, it is also possible to achieve a similar result by using select. For example:
# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]
# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)
Note that, by using select you don't need to unpack the list.
Please note that this question also address similar issue.
I hope this helps.
Well, it seems you can use regular column filter as following:
val forColumns = df.columns.filter(x => (x.startsWith("test") || x.startsWith("id_1") || x.startsWith("vehicle"))) ++ ["c_007"]
df.drop(*forColumns)
I have data loaded into a dataframe but cannot figure out how to compare the parsed data against the other column and return only matches.
This seems like it should be easy but I just don't see it. I've tried splitting the values out to compare but here's where I get stuck.
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
# output something like...
df['output'] = [null,';c1312;',';d1310;']
I'd expect to see something like -
1st row - return null, as t9010 is not contained in col2_split
2nd row - return c1312, as it is in col2_split
3rd row - return d1310 but not c1512, as only d1310 is in col2_split
lastly, the final text should be returned semicolon delimited and with leading and trailing semicolons i.e. ;t9010; or ;c1312; or ;d1310;c1512; if there is more than one.
The part where you have tried to split using ";" is correct. After that, you need to compare each element in col1_split with each element in col2_split. You can write a simple function to avoid many loops and use pandas apply function to do the rest
Here is the sample code for the same
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
def value_check(list1, list2):
string = ""
for i in list1:
if (i in list2) & (len(i)>0):
string += ";"+i+';'
return string
df['output'] = df.apply(lambda x: value_check(x.col1_split, x.col2_split), axis=1)
df
Output
You may be can try this method to get all values in col1 if its values are in col2. The method is by splitting string values in each row to a list and then omitting the empty values or length is less than 0 in the list values ([]) first. And then searching the values without empty values in col1 that is matched to the col2 and displaying the output to the output column.
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
#splitting & omitting the empty values
df['col1_split']=df.col1.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
df['col2_split']=df.col2.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
def check(list1, list2):
res=''
for i in list1:
if (i in list2): res += ';'+str(i)
#semicolon cover at the end of string in each row
if len(res)>0: res+=';'
return res
df['output']=df.apply(lambda x: check(x.col1_split, x.col2_split), axis=1)
df
Output:
Hope this can help you.
We can use a nested list comprehension for this:
df['common'] = pd.Series([[sub for sub in left if sub in right] for left, right in zip(df['col1_split'], df['col2_split'])]).str.join(';')
print(df['common'])
Output:
0 ;
1 ;c1312;
2 ;d1310;
Name: common, dtype: object
The following works fine to drop column names that contain the string basket anywhere in the column name of the df, how can I modify the below code to pass a list of strings to be filtered out instead of just a single string?
banned_columns = ["basket","cricket","ball"]
condition = lambda col: "basket" in col
new_df = df.drop(*filter(condition, df.columns))
The above just filters basket. How can I filter out basket, cricket and ball from df.columns ?
You can check exclude all the columns that contain any of the banned words using the built-in any() function:
banned_columns = ["basket","cricket","ball"]
condition = lambda col: any(word in col for word in banned_columns)
new_df = df.drop(*filter(condition, df.columns))
The built-in function any() comes in handy here:
condition = lambda col: any(item in col for item in banned_columns)
How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines
I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')