The red highlight part is pd.read_csv, we can get a dataframe type object here;
Then the blue highlight part is a list with lambda function (we can filter account ID when reading the CSV file).
This method seems very smart, but a bit confusing to me. Could anyone explain how this could work as a filter? Thank you very much.
The [...] part is called indexing, and basically there you're just creating a function (a "lambda" function) and indexing the dataframe with it. What you're going to get out of it is all the rows where acct_id is OVIWFZA.
It's identical to this:
df = pd.read_csv('/content/drive/client.csv', nrows=5)
df = df[df['acct_id'] == 'OVIWFZA']
Related
I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?
In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.
From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.
The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).
I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.
I have been trying to learn to analyze Big Mart Sales Data Set from this website. I am unable to decode a line of code which is little bit complex. I tried to understand demystify it but I wasn't able to. Kindly help me understand this line at
In [26]
df['Item_Visibility_MeanRatio'] = df.apply(lambda x: x['Item_Visibility']/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0],axis=1).astype(float)
Thankyou very much in advance. Happy coding
df['Item_Visibility_MeanRatio']
This is the new column name
= df.apply(lambda x:
applying a function to the dataframe
x['Item_Visibility']
take the Item_Visibility column from the original dataframe
/visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
divide where the Item_Visibility column in the new pivot table where the Item_Identifier is equal to the Item_Identifier in the original dataframe
,axis=1)
apply along the columns (horizontally)
.astype(float)
convert to float type
Also, it looks like .apply is used a lot on the link you attached. It should be noted that apply is generally the slow way to do things, and there are usually alternatives to avoid using apply.
Lets go thorough it step by step:
df['Item_Visibility_MeanRatio']
This part is creating a column in the data frame and its name is Item_Visibility_MeanRatio.
df.apply(lambda...)
Apply a function along an axis of the Data frame.
x['Item_Visibility']
It is getting the data from Item_Visibility column in the data frame.
visibility_item_avg['Item_Visibility'][visibility_item_avg.index == x['Item_Identifier']][0]
This part finds the indexes that visibility_item_avg index is equal to df['Item_Identifier'].This will lead to a list. Then it will get the elements in visibility_item_avg['Item_Visibility'] that its index is equal to what was found in the previous part. [0] at the end is to find the first element of the outcome array.
axis=1
1 : apply function to each row.
astype(float)
This is for changing the value types to float.
To make the code easy to grab, you can always split it to separate parts and digest it little by little.
To make the code faster you can do Vectorization instead of applying lambda.
Refer to the link here.
I am looking to filter a dataset based off of whether a certain ID does not appear in a different dataframe.
While I'm not super attached to the way in which I've decided to do this if there's a better way that I'm not familiar with, I want to apply a Boolean function to my dataset, put the results in a new column, and then filter the entire dataset off of that True/False result.
My main dataframe is df, and my other dataframe with the ID's in it is called ID:
def groups():
if df['owner_id'] not in ID['owner_id']:
return True
return False
This ends up being accepted (no syntax problems), so I then go to apply it to my dataframe, which fails:
df['ID Groups?'] = df.apply (lambda row: groups() ,axis=1)
Result:
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')
It seems that somewhere my data that I'm trying to use (the ID's are both letters and numbers, so strings) is incorrectly formatted.
I have two questions:
Is my proposed method the best way of going about this?
How can I fix the error that I'm seeing?
My apologies if it's something super obvious, I have very limited exposure to Python and coding as a whole, but I wasn't able to find anywhere where this type of question had already been addressed.
Expression to keep only these rows in df that match owner_id in ID:
df = df[df['owner_id'].isin(ID['owner_id'])]
Lambda expression is going to be way slower that this.
isin is the Pandas way. not in is the Python collections way.
The reason you are getting this error is df['owner_id'] not in ID['owner_id'] hashes left hand side to figure out if it is present in the right hand side. df['owner_id'] is of type Series and is not hashable, as reported. Luckily, it is not needed.
I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this.
I now want to delete the columns contained within 'public' from the larger DataFrame ('main').
I've tried the following instructions:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html
Python Pandas - Deleting multiple series from a data frame in one command
without any success, along with various other statements that have been unsuccessful.
The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function].
Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful.
(Have Python 2.7 and am using Pandas, numpy, math, pylab etc.)
Thanks in advance.
Ignore my question - Murphy's Law prevails and I've just solved it.
I was using the statement from the stackoverflow question mentioned below:
df.drop(df.columns[1:], axis=1)
and this was not working. I have instead used
df = df.drop(df2, axis=1)
and this worked (df = main, df2 = public). Simple really once you don't overthink it.