remove for loop for df.drop [duplicate] - python

This question already has answers here:
How to exclude multiple columns in Spark dataframe in Python
(4 answers)
Closed 4 years ago.
I am working with pyspark 2.0.
My code is :
for col in to_exclude:
df = df.drop(col)
I cannot do directly df = df.drop(*to_exclude) because in 2.0, drop method accept only 1 column at a time.
Is there a way to change my code and remove the for loop ?

First of all - worries not. Even if you do it in loop, it does not mean Spark executes separate queries for each drop. Queries are lazy, so it will build one big execution plan first, and then executes everything at once. (but you probably know it anyway)
However, if you still want to get rid of the loop within 2.0 API, I’d go with something opposite to what you’ve implemented: instead of dropping columns, select only needed:
df.select([col for col in df.columns if col not in to_exclude])

Related

how to fix groupby elimination of index? [duplicate]

This question already has answers here:
Pandas reset index is not taking effect [duplicate]
(4 answers)
Closed 3 months ago.
Please see images -
After creating a dataframe, I use groupby, then I reset the index column only to find that the column for 'county' is still unseen by the dataframe. Please help to rectify.
The df.reset_index() by default is not an "inplace" operation. But with use of the inplace parameter you can make it behave as such.
1. Either use inplace=True -
mydf.reset_index(inplace=True)
2. Or save the df into another (or the same) variable -
mydf = mydf.reset_index()
This should fix your issue.

Removing the index when appending data and rewriting CSV using pandas [duplicate]

This question already has answers here:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
(11 answers)
Closed 1 year ago.
I have a script that runs on a daily basis to collect data.
I record this data in a CSV file using the following code:
old_df = pd.read_csv('/Users/tdonov/Desktop/Python/Realestate Scraper/master_data_for_realestate.csv')
old_df = old_df.append(dataframe_for_cvs, ignore_index=True)
old_df.to_csv('/Users/tdonov/Desktop/Python/Realestate Scraper/master_data_for_realestate.csv')
I am using append(ignore_index=True), but after every run of the code I still get additional columns created at the start of my CSV. I delete them manually, but is there a way to stop them from the code itself? I looked the function but I am still not sure if it is possible.
My result file gets the following columns added after every run (one at a time, after each run):
This is really annoying to have to delete everytime.
Update: Data looks like that:
However the id is not unique. Every day it can be repeated. In my case it is not unique. This is an id of an online offer. The offer can be available for one day or for 5 months, or couple of days.
Did you try
to_csv(index=False)

How to use apply and lambda to change a pandas data frame column [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I cannot figure this out. I want to change the "type" column in this dataset to 0/1 values.
url = "http://www.stats.ox.ac.uk/pub/PRNN/pima.tr"
Pima_training = pd.read_csv(url,sep = '\s+')
Pima_training["type"] = Pima_training["type"].apply(lambda x : 1 if x == 'Yes' else 0)
I get the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
This is a warning and won't break your code. This happens when pandas detects chained assignment, which is when you use multiple indexing operations, and there might be ambiguity about whether you are modifying the original df or a copy of the df. Other more experienced programmers have explained it in depth in another SO thread, so feel free to give it a read for a further explanation.
In your particular example, you don't need .apply at all here (see this question for why not, but using apply on a single column is very inefficient because it loops over rows internally), and I think it makes more sense to use .replace instead, and a pass a dictionary.
Pima_training['type'] = Pima_training['type'].replace({"No":0,"Yes":1})

Best way to select columns in python pandas dataframe [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I have two ways in my script how i select specific rows from dataframe:
1.
df2 = df1[(df1['column_x']=='some_value')]
2.
df2 = df1.loc[df1['column_x'].isin(['some_value'])]
From a efficiency perspective, and from a pythonic perspective (as in, what is most python way of coding) which method of selecting specific rows is preferred?
P.S. Also, I feel there are probably even more ways to achieve the same.
P.S.S. I feel that this question is already been asked, but i couldnt find it. Please reference if duplicate
They are different. df1[(df1['column_x']=='some_value')] is fine if you're just looking for a single value. The advantage of isin is that you can pass it multiple values. For example: df1.loc[df1['column_x'].isin(['some_value', 'another_value'])]
It's interesting to see that from a performance perspective, the first method (using ==) actually seems to be significantly slower than the second (using isin):
import timeit
df = pd.DataFrame({'x':np.random.choice(['a','b','c'],10000)})
def method1(df = df):
return df[df['x'] == 'b']
def method2(df=df):
return df[df['x'].isin(['b'])]
>>> timeit.timeit(method1,number=1000)/1000
0.001710233046906069
>>> timeit.timeit(method2,number=1000)/1000
0.0008507879299577325

SQL-like statements in pandas? [duplicate]

This question already has an answer here:
index a Python Pandas dataframe with multiple conditions SQL like where statement
(1 answer)
Closed 7 years ago.
I'm trying to run the following SQL statement (obviously in Python code) in pandas, but am getting nowhere:
select year, contraction, indiv_count, total_words from dataframe
where contraction in ("i'm","we're","we've","it's","they're")
Where contractions is char, and year, indiv_count, and total_words are int.
I'm not too familiar with pandas. How do I create a similar statement in Python?
I'd recommend reading the docs listed in Anton's comment if you haven't already, but it lacks documentation for the .isin() method which is what you will need to replicate the SQL in.
df[df['contraction'].isin(["i'm","we're","we've","it's","they're"])]
The columns selection can then be obtained using .loc[] or whatever you're favorite method for that is (there are many).

Categories

Resources