I have a dataframe that contains strings in columns that should be only floats. I saw several solutions on how to drop a row with a specific string or parts of it from an individual column.
So for an individual column I suppose one could do it like this
new_df = df[df['Column'].dtypes != object]
But this
new_df = df[df.dtypes != object]
did not work. One could iterate over all columns via a loop, but is there a way to drop the strings for all columns at once?
Use DataFrame.select_dtypes:
#excluding object columns
new_df = df.select_dtypes(exclude=object)
#only floats columns
new_df = df.select_dtypes(include=float)
#only numeric columns
new_df = df.select_dtypes(include=np.number)
EDIT:
new_df = df.apply(pd.to_numeric, errors='coerce').dropna()
Related
I got a DF called "df" with 4 numerical columns [frame,id,x,y]
I made a loop that creates two dataframes called df1 and df2. Both df1 and df2 are subseted of the original dataframe.
What I want to do (and I am not understanding how to do it) is this: I want to CHECK if df1 and df2 have same VALUES in the column called "id". If they do, I want to concatenate those rows of df2 (that have the same id values) to df1.
For example: if df1 has rows with different id values (1,6,4,8) and df2 has this id values (12,7,8,10). I want to concatenate df2 rows that have the id value=8 to df1. That is all I need
This is my code:
for i in range(0,max(df['frame']),30):
df1=df[df['frame'].between(i, i+30)]
df2=df[df['frame'].between(i-30, i)]
There are several ways to accomplish what you need.
The simplest one is to get the slice of df2 that contains the values you need with .isin() and concatenate it with df1 in one line.
df3 = pd.concat([df1, df2[df2.id.isin(df1.id)]], axis = 0)
To gain more control and avoid any errors that might stem from updating df1 and df2 elsewhere, you may want to take the apart this one-liner.
look_for_vals = set(df1['id'].tolist())
# do some stuff
need_ix = df2[df2["id"].isin(look_for_vals )].index
# do more stuff
df3 = pd.concat([df1, df2.loc[need_ix,:]], axis=0)
Instead of set() you may also use df1['id'].unique()
Pandas dataframe is having 5 columns which contains 'verifier' in it. I want to drop all those columns which contained 'verified in it except the column named 'verified_90'(Pandas dataframe). I am trying following code but it is removing all columns which contains that specific word.
Column names: verified_30 verified_60 verified_90 verified_365 logo.verified. verified.at etc
''''
df = df[df.columns.drop(list(df.filter(regex='Test')))]
''''
You might be able to use a regex approach here:
df = df[df.columns.drop(list(df.filter(regex='^(?!verified_90$).*verified.*$')))]
Filter columns with not verified OR with verified_90 in DataFrame.loc, here : means select all rows and columns by mask:
df.loc[:, ~df.columns.str.contains('verified') | (df.columns == 'verified_90')]
I am dealing with DNA sequencing data, and each column looks something like "ACCGTGC". I would like to transform this into several columns, where each column contains only one char. How to do this in Python pandas?
For performance convert values to lists and pass to DataFrame constructor:
df1 = pd.DataFrame([list(x) for x in df['col']], index=df.index)
If need add to original:
df = df.join(df1)
I need do create some additional columns to my table or separate table based on following:
I have a table
and I need to create additional columns where the column indexes (names of columns) will be inserted as values. Like this:
How to do it in pandas? Any ideas?
Thank you
If need matched columns only for 1 values:
df = (df.set_index('name')
.eq(1)
.dot(df.columns[1:].astype(str) + ',')
.str.rstrip(',')
.str.split(',', expand=True)
.add_prefix('c')
.reset_index())
print (df)
Explanation:
Idea is create boolean mask with True for values which are replaced by columns names - so compare by DataFrame.eq by 1 and used matrix multiplication by DataFrame.dot by all columns without first with added separator. Then remove last traling separator by Series.str.rstrip and use Series.str.split for new column, changed columns names by DataFrame.add_prefix.
Another solution:
df1 = df.set_index('name').eq(1).apply(lambda x: x.index[x].tolist(), 1)
df = pd.DataFrame(df1.values.tolist(), index=df1.index).add_prefix('c').reset_index()
I recently started working with pandas dataframes.
I have a list of dataframes called 'arr'.
Edit: All the dataframes in 'arr' have same columns but different data.
Also, I have an empty dataframe 'ndf' which I need to fill in using the above list.
How do I iterate through 'arr' to fill in the max values of a column from 'arr' into a row in 'ndf'
So, we'll have
Number of rows in ndf = Number of elements in arr
I'm looking for something like this:
columns=['time','Open','High','Low','Close']
ndf=DataFrame(columns=columns)
ndf['High']=arr[i].max(axis=0)
Based on your description, I assume a basic example of your data looks something like this:
import pandas as pd
data =[{'time':'2013-09-01','open':249,'high':254,'low':249,'close':250},
{'time':'2013-09-02','open':249,'high':256,'low':248,'close':250}]
data2 =[{'time':'2013-09-01','open':251,'high':253,'low':248,'close':250},
{'time':'2013-09-02','open':245,'high':251,'low':243,'close':247}]
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
arr = [df, df2]
If that's the case, then you can simply iterate over the list of dataframes (via enumerate()) and the columns of each dataframe (via iteritems(), see http://pandas.pydata.org/pandas-docs/stable/basics.html#iteritems), populating each new row via a dictionary comprehension: (see Create a dictionary with list comprehension in Python):
ndf = pd.DataFrame(columns = df.columns)
for i, df in enumerate(arr):
ndf = ndf.append(pd.DataFrame(data = {colName: max(colData) for colName, colData in df.iteritems()}, index = [i]))
If some of your dataframes have any additional columns, the resulting dataframe ndf will have NaN entries in the relevant places.