I'm trying to create a dataframe by selecting rows that meet only specific conditions from a different dataframe.
Technicians can only select one of several fields for Column 1 using a dropdown menu so I want to specify the specific field. However, column 2 is a freetext entry therefore I'm looking for two specific key words with any type of spelling/case.
I want all columns from the rows in the new dataframe.
Any help or insight would be much appreciated.
import pandas as pd
df = pd.read_excel (r'File.xlsx, sheet_name = 'Sheet1')
filter = ['x', 'y']
columns=df.columns
data = pd.DataFrame(columns=columns)
for row in df.iterrows():
if 'Column 1' == 'a':
row.data.append()
elif df['Column 2'].str.contains('filter', case = 'false'):
row.data.append()
print(data.head())
In general, it's best to have a vectorized solution to things, so I'll put my solution as follows (there are many ways to do this, this is one of the ways that came to my head). Here, you can use a simple boolean mask to filter out some specific rows that you don't want, since you've already clearly defined your criteria (df['Column 1'] == 'a' or df['Column 2'].str.contains('filter', case = 'false')).
As such, you can simply create a boolean mask that includes this criteria. By itself, df['Column 1'] == 'a' will create an indexing dataframe with the structure of [1, 0, 1, 1, ...], where each number corresponds to whether it's true in the original array. Once you have that, you can simply index back into the original array with df[df['Column 1'] == 'a'] to return your filtered array.
Of course, since you have two conditions here (which seem to follow an "or" clause), you can simply feed both of these conditions into the boolean mask, such as df[df['Column 1'] == 'a' & df['Column 2'].str.contains('filter', case = 'false')].
I'm not at my development computer, so this might not work as expected due to a couple minor issues, but this is the general idea. This line should replace your entire df.iterrows block. Hope this helps :)
Related
I have a mock pandas dataframe 'df' where I want to create a new column 'fruit' and was wondering the easiest way to do this. The new column 'fruit_cost' will be taking the integer from the 'cost' column where the item type is equal to 'fruit'. What would the standard was of doing this in PANDAS be? Should I use conditional logic, or is there a simpler way. If anyone has any good practice tutorials for this type of thing it would also be beneficial.
In SQL I would create it using a case:
SQL
case
when item_type = 'fruit' then cost
else 0
end
as fruit_cost
*Python
import pandas as pd
list_of_customers =[
['patrick','lemon','fruit',10],
['paul','lemon','fruit',20],
['frank','lemon','fruit',10],
['jim','lemon','fruit',20],
['wendy','watermelon','fruit',39],
['greg','watermelon','fruit',32],
['wilson','carrot','vegetable',34],
['maree','carrot','vegetable',22],
['greg','','',],
['wilmer','sprite','drink',22]
]
df = pd.DataFrame(list_of_customers,columns = ['customer','item','item_type','cost'])
print(df)
#create new field 'fruit_cost'
df[fruit_cost] = if df[item_type] == 'fruit':
df[cost]
else:
0
df["fruit_cost"] = df["cost"].where(df["item_type"] == "fruit", other=0)
Here's some solutions:
np.where
df['fruit_cost'] = np.where(df['item_type'] == 'fruit', df['cost'], 0)
Dataframe.where
df['fruit_cost'] = df['cost'].where(df['item_type'] == 'fruit', 0)
There isn't really a standard since there are so many ways to do this; it's a matter of preference. I suggest you take a look at these links:
Pandas: Column that is dependent on another value
Set Pandas Conditional Column Based on Values of Another Column
I have the following code snippet
{dataset: https://www.internationalgenome.org/data-portal/sample}
genome_data = pd.read_csv('../genome')
genome_data_columns = genome_data.columns
genPredict = genome_data[genome_data_columns[genome_data_columns != 'Geuvadis']]
This drops the column Geuvadis, is there a way I can include more than one column?
You could use DataFrame.drop like genome_data.drop(['Geuvadis', 'C2', ...], axis=1).
Is it ok for you to not read them in the first place?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
The ‘usecols’ option in read_csv lets you specify the columns of data you want to include in the DataFrame.
Venkatesh-PrasadRanganath is the correct answer to how to drop multiple columns.
But if you want to avoid reading data into memory which you’re not going to use, genome_data = pd.read_csv('../genome', usecols=["only", "required", "columns"] is the syntax to use.
I think #Venkatesh-PrasadRanganath 's answer is better, but taking a similar approach to your attempt, this is how I would do it.:
identify all of the columns with columns.to_list()'
Create a list of columns to be excluded
Subtract the columns to be excluded from the full list with list(set() - set())
Select the remaining columns.
genome_data = pd.read_csv('../genome')
all_genome_data_columns = genome_data.columns.to_list()
excluded_genome_data_columns = ['a', 'b', 'c'] #Type in the columns that you want to exclude here.
genome_data_columns = list(set(all_genome_data_columns) - set(excluded_genome_data_columns))
genPredict = genome_data[genome_data_columns]
I have a dataframe that has 100 variables va1--var100. I want to bring var40, var20, and var30 to the front with other variables remain the original order. I've searched online, methods like
1: df[[var40, var20, var30, var1....]]
2: columns= [var40, var20, var30, var1...]
all require to specify all the variables in the dataframe. With 100 variables exists in my dataframe, how can I do it efficiently?
I am a SAS user, in SAS, we can use a retain statement before the set statement to achieve the goal. Is there a equivalent way in python too?
Thanks
Consider reindex with a conditional list comprehension:
first_cols = ['var30', 'var40', 'var20']
df = df.reindex(first_cols + [col for col in df.columns if col not in first_cols],
axis = 'columns')
this is really trivial but can't believe I have wandered around for an hour and still can find the answer, so here you are:
df = pd.DataFrame({"cats":["a","b"], "vals":[1,2]})
df.cats = df.cats.astype("category")
df
My problem is how to select the row that its "cats" columns's category is "a". I know that df.loc[df.cats == "a"] will work but it's based on equality on element. Is there a way to select based on levels of category?
This works:
df.cats[df.cats=='a']
UPDATE
The question was updated. New solution:
df[df.cats.cat.categories == ['a']]
For those who are trying to filter rows based on a numerical categorical column:
df[df['col'] == pd.Interval(46, 53, closed='right')]
This would keep the rows where the col column has category (46, 53].
This kind of categorical column is common when you discretize numerical columns using pd.qcut() method.
You can query the categorical list using df.cats.cat.categories which prints output as
Index(['a', 'b'], dtype='object')
For this case, to select a row with category of 'a' which is df.cats.cat.categories['0'], you just use:
df[df.cats == df.cats.cat.categories[0]]
Using the isin function to create a boolean index is an approach that will extend to multiple categories, similar to R's %in% operator.
# will return desired subset
df[df.cats.isin(['a'])]
# can be extended to multiple categories
df[df.cats.isin(['a', 'b'])]
df[df.cats.cat.categories == df.cats.cat.categories[0]]
Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]