Get row-index values of Pandas DataFrame as list? [duplicate] - python

This question already has answers here:
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 4 years ago.
I'm probably using poor search terms when trying to find this answer. Right now, before indexing a DataFrame, I'm getting a list of values in a column this way...
list = list(df['column'])
...then I'll set_index on the column. This seems like a wasted step. When trying the above on an index, I get a key error.
How can I grab the values in an index (both single and multi) and put them in a list or a list of tuples?

To get the index values as a list/list of tuples for Index/MultiIndex do:
df.index.values.tolist() # an ndarray method, you probably shouldn't depend on this
or
list(df.index.values) # this will always work in pandas

If you're only getting these to manually pass into df.set_index(), that's unnecessary. Just directly do df.set_index['your_col_name', drop=False], already.
It's very rare in pandas that you need to get an index as a Python list (unless you're doing something pretty funky, or else passing them back to NumPy), so if you're doing this a lot, it's a code smell that you're doing something wrong.

Related

How to use apply and lambda to change a pandas data frame column [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I cannot figure this out. I want to change the "type" column in this dataset to 0/1 values.
url = "http://www.stats.ox.ac.uk/pub/PRNN/pima.tr"
Pima_training = pd.read_csv(url,sep = '\s+')
Pima_training["type"] = Pima_training["type"].apply(lambda x : 1 if x == 'Yes' else 0)
I get the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
This is a warning and won't break your code. This happens when pandas detects chained assignment, which is when you use multiple indexing operations, and there might be ambiguity about whether you are modifying the original df or a copy of the df. Other more experienced programmers have explained it in depth in another SO thread, so feel free to give it a read for a further explanation.
In your particular example, you don't need .apply at all here (see this question for why not, but using apply on a single column is very inefficient because it loops over rows internally), and I think it makes more sense to use .replace instead, and a pass a dictionary.
Pima_training['type'] = Pima_training['type'].replace({"No":0,"Yes":1})

pandas - is there any difference between df.loc[df['column_label'] == filter_value] and df[df['column_label'] == filter_value] [duplicate]

This question already has an answer here:
Pandas, loc vs non loc for boolean indexing
(1 answer)
Closed 2 years ago.
I am learning pandas and want to know the best practice for filtering rows of a DataFrame by column values.
According to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html, the recommendation is to use optimized pandas data access methods such as .loc
An example from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html -
df.loc[df['shield'] > 6]
However, according to https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#where, a construction like tips[tips['time'] == 'Dinner'] could be used.
Why is the recommended .loc omitted? Is there any difference?
With .loc you can also correctly set a value, as not using it raises an you are trying to set a value on a copy of a DataFrame error. For getting something out of your DataFrame, there might be performance differences, but I don't know that.

python id()'s function returns different values for the same object [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 4 years ago.
I have a small dataframe, say this one :
Mass32 Mass44
12 0.576703 0.496159
13 0.576658 0.495832
14 0.576703 0.495398
15 0.576587 0.494786
16 0.576616 0.494473
...
I would like to have a rolling mean of column Mass32, so I do this:
x['Mass32s'] = pandas.rolling_mean(x.Mass32, 5).shift(-2)
It works as in I have a new column named Mass32s which contains what I expect it to contain but I also get the warning message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I'm wondering if there's a better way to do it, notably to avoid getting this warning message.
This warning comes because your dataframe x is a copy of a slice. This is not easy to know why, but it has something to do with how you have come to the current state of it.
You can either create a proper dataframe out of x by doing
x = x.copy()
This will remove the warning, but it is not the proper way
You should be using the DataFrame.loc method, as the warning suggests, like this:
x.loc[:,'Mass32s'] = pandas.rolling_mean(x.Mass32, 5).shift(-2)

Best way to select columns in python pandas dataframe [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I have two ways in my script how i select specific rows from dataframe:
1.
df2 = df1[(df1['column_x']=='some_value')]
2.
df2 = df1.loc[df1['column_x'].isin(['some_value'])]
From a efficiency perspective, and from a pythonic perspective (as in, what is most python way of coding) which method of selecting specific rows is preferred?
P.S. Also, I feel there are probably even more ways to achieve the same.
P.S.S. I feel that this question is already been asked, but i couldnt find it. Please reference if duplicate
They are different. df1[(df1['column_x']=='some_value')] is fine if you're just looking for a single value. The advantage of isin is that you can pass it multiple values. For example: df1.loc[df1['column_x'].isin(['some_value', 'another_value'])]
It's interesting to see that from a performance perspective, the first method (using ==) actually seems to be significantly slower than the second (using isin):
import timeit
df = pd.DataFrame({'x':np.random.choice(['a','b','c'],10000)})
def method1(df = df):
return df[df['x'] == 'b']
def method2(df=df):
return df[df['x'].isin(['b'])]
>>> timeit.timeit(method1,number=1000)/1000
0.001710233046906069
>>> timeit.timeit(method2,number=1000)/1000
0.0008507879299577325

Why does formatting of a dataframe automatically get applied to another? [duplicate]

This question already has answers here:
why should I make a copy of a data frame in pandas
(8 answers)
Closed 4 years ago.
Sorry if this is a really dumb question. I'm a total noob at pandas and can't even figure out what key words to use to search for a solution for the problem I have.
Basically, I have a numeric data frame,
numeric_df = pd.DataFrame({"colA": [1.23, 2.34, 3.45],
"colB":[1.00, 2.00, 3.00]})
Now I create a second df that duplicates the value of numeric_df
formatted_df = numeric_df
Then I format the two columns in "formatted_df" according my needs, I'm doing it this way because I want to keep the values in numeric_df as numbers, so I can operate on them later.
formatted_df["colA"] = formatted_df["colA"].map("${:}".format)
formatted_df["colB"] = formatted_df["colB"].map("{:}Years".format)
But now, if I view numeric_df, its columns are already formatted and converged into a string. What is causing the problem? Why does my map method modify the original data frame?
Thank you in advance for any help you can give.
Using formatted_df = numeric_df mean the variables share the same memory footprint. Referencing the same object. To manipulate one independently you need a seperate object. For this you can clone an object or Pandas offers copy()
formatted_df = numeric_df.copy()
why should I make a copy of a data frame in pandas

Categories

Resources