I have dataframe with many columns, some of them are strings and some are numbers. Right now when I print the columns names as list I get something like this:
df.columns.tolist()
>>>['name','code','date','401.2','405.6','507.3'...]
I would like to get the numerical columns as float numbers and not as string, I haven't found yet any way to do that, is is possible to do something like this?
my goal in the end is to be able to create list only of the numerical columns names, so if you know other way to seperate them now when they are string could work also.
Use custom function with try-except statement:
df = pd.DataFrame(columns=['name','code','date','401.2','405.6','507.3'])
def f(x):
try:
return float(x)
except:
return x
df.columns = df.columns.map(f)
print (df.columns.tolist())
['name', 'code', 'date', 401.2, 405.6, 507.3]
Using list comprehension
df.columns = [float(col) if col.replace('.', '').isnumeric() else col for col in df.columns]
res = df.columns.to_list()
print(res)
Output:
['name', 'code', 'date', 401.2, 405.6, 507.3]
Related
I am trying to apply a function on multiple columns in a pandas dataframe where I compare the value of two columns to create a third new based on this comparison. The code runs, however, the output does not get correct. For example, this code:
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x,column1=x[item], column2=x[lst1[i]]) , axis=1)
i=i+1
The output should be that the first row contains an incorrect instance, but it marks it as correct.
This is how the output looks:
The correct would be that col4_4_2 and col5_5_2 should be marked as incorrect. This is how it should look:
Is it not possible to apply a function in this way on multiple columns and pass the column name as arguments in pandas? If so, how should it be performed?
You didn't provide a df, so I used this:
df = pd.DataFrame([[0,0,0,1,0,0,0,0,0,1,0,0,0,0,0]],columns = ['col1', 'col2', 'col3', 'col4', 'col5','col1_1','col2_2','col3_3','col4_4','col5_5','col1_1_2','col2_2_2','col3_3_2','col4_4_2','col5_5_2',])
Your conditions function is expecting a dataframe and then references to two of it's columns, but you are supplying it a df and then two values. One way to solve your problem is to change your comparison function to something like this (note you don't actually need the df itself anymore):
def conditions(x,column1, column2):
print(column1,column2)
if column1 != column2:
return "incorrect"
else:
return "correct"
Alternatively, you could change the line with lamba in it to something like this:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x, lst2[i], lst1[i]) , axis=1)
I first had to add the columns and fill them with zeros, then apply the function.
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = 0
i=0
for item in df.columns[-5:]:
df[item]=df.apply(lambda x: conditions(x, column1=lst1[i], column2=lst2[i]) , axis=1)
i=i+1
I have a column named keywords in my pandas dataset. The values of the column are like this :
[jdhdhsn, cultuere, jdhdy]
I want my output to be
jdhdhsn, cultuere, jdhdy
Try this
keywords = [jdhdhsn, cultuere, jdhdy]
if(isinstance(keyword, list)):
output = ','.join(keywords)
else:
output = keywords[1:-1]
The column of your dataframe seems to be a list
Lists are formatted with brackets and each elements of that list's repr()
Pandas has built in functions for dealing with strings
df['column_name'].str let's you use each element in the column and apply a str function on them. Just like ', '.join(['foo', 'bar', 'baz'])
Thus df['column_name_str'] = df['column_name'].str.join(', ') will produce a new column with the formatting you're after.
You can also use the .apply to perform arbitrary lambda functions on a column, such as:
df['column_name'].apply(lambda row: ', '.join(row))
But since pandas has the .str built in this isn't needed for this example.
Try this
data = ["[jdhdhsn, cultuere, jdhdy]"]
df = pd.DataFrame(data, columns = ["keywords"])
new_df = df['keywords'].str[1:-1]
print(df)
print(new_df)
I have a dataframe, in which I want to delete columns whose name starts with "test","id_1","vehicle" and so on
I use below code to delete one column
df1.drop(*filter(lambda col: 'test' in col, df.columns))
how to specify all columns at once in this line?
this doesnt work:
df1.drop(*filter(lambda col: 'test','id_1' in col, df.columns))
You do something like the following:
expression = lambda col: all([col.startswith(i) for i in ['test', 'id_1', 'vehicle']])
df1.drop(*filter(lambda col: expression(col), df.columns))
In PySpark version 2.1.0, it is possible to drop multiple columns using drop by providing a list of strings (with the names of the columns you want to drop) as argument to drop. (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop).
In your case, you may create a list containing the names of the columns you want to drop. For example:
cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]
And then apply the drop unpacking the list:
df1.drop(*cols_to_drop)
Ultimately, it is also possible to achieve a similar result by using select. For example:
# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]
# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)
Note that, by using select you don't need to unpack the list.
Please note that this question also address similar issue.
I hope this helps.
Well, it seems you can use regular column filter as following:
val forColumns = df.columns.filter(x => (x.startsWith("test") || x.startsWith("id_1") || x.startsWith("vehicle"))) ++ ["c_007"]
df.drop(*forColumns)
If I have more than 200 columns, each with long names and I want to remove the first part of the names, how do I do that using pandas?
You could loop through them and omit the first n characters:
n = 3
li = []
for col in df.columns:
col = col[n:]
li.append(col)
df.columns = li
Or perform any other form of string manipulation, I'm not sure what you mean by "to remove the first part".
I'd just use rename:
n=5
df.rename(columns = lambda x: x[n:])
and here, the lambda can be anything, you could also strip further whitespace, and actually, you can just define a callable and use with here, potentially without even using lambda
Use indexing with str:
N = 5
df.columns = df.columns.str[N:]
If you just want to remove a certain number of characters:
df.rename(columns=lambda col: col[n:])
If you want to selectively remove based on a prefix:
# cols = 'a_A', 'a_B', 'b_A'
df.rename(columns=lambda col: col.split('a_')[1] if 'a_' in col else col)
How complicated your rules are is up to you.
I have data loaded into a dataframe but cannot figure out how to compare the parsed data against the other column and return only matches.
This seems like it should be easy but I just don't see it. I've tried splitting the values out to compare but here's where I get stuck.
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
# output something like...
df['output'] = [null,';c1312;',';d1310;']
I'd expect to see something like -
1st row - return null, as t9010 is not contained in col2_split
2nd row - return c1312, as it is in col2_split
3rd row - return d1310 but not c1512, as only d1310 is in col2_split
lastly, the final text should be returned semicolon delimited and with leading and trailing semicolons i.e. ;t9010; or ;c1312; or ;d1310;c1512; if there is more than one.
The part where you have tried to split using ";" is correct. After that, you need to compare each element in col1_split with each element in col2_split. You can write a simple function to avoid many loops and use pandas apply function to do the rest
Here is the sample code for the same
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
def value_check(list1, list2):
string = ""
for i in list1:
if (i in list2) & (len(i)>0):
string += ";"+i+';'
return string
df['output'] = df.apply(lambda x: value_check(x.col1_split, x.col2_split), axis=1)
df
Output
You may be can try this method to get all values in col1 if its values are in col2. The method is by splitting string values in each row to a list and then omitting the empty values or length is less than 0 in the list values ([]) first. And then searching the values without empty values in col1 that is matched to the col2 and displaying the output to the output column.
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
#splitting & omitting the empty values
df['col1_split']=df.col1.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
df['col2_split']=df.col2.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
def check(list1, list2):
res=''
for i in list1:
if (i in list2): res += ';'+str(i)
#semicolon cover at the end of string in each row
if len(res)>0: res+=';'
return res
df['output']=df.apply(lambda x: check(x.col1_split, x.col2_split), axis=1)
df
Output:
Hope this can help you.
We can use a nested list comprehension for this:
df['common'] = pd.Series([[sub for sub in left if sub in right] for left, right in zip(df['col1_split'], df['col2_split'])]).str.join(';')
print(df['common'])
Output:
0 ;
1 ;c1312;
2 ;d1310;
Name: common, dtype: object