How to change the sequence of some variables in a Pandas dataframe? - python

I have a dataframe that has 100 variables va1--var100. I want to bring var40, var20, and var30 to the front with other variables remain the original order. I've searched online, methods like
1: df[[var40, var20, var30, var1....]]
2: columns= [var40, var20, var30, var1...]
all require to specify all the variables in the dataframe. With 100 variables exists in my dataframe, how can I do it efficiently?
I am a SAS user, in SAS, we can use a retain statement before the set statement to achieve the goal. Is there a equivalent way in python too?
Thanks

Consider reindex with a conditional list comprehension:
first_cols = ['var30', 'var40', 'var20']
df = df.reindex(first_cols + [col for col in df.columns if col not in first_cols],
axis = 'columns')

Related

Changes to pandas dataframe in for loop is only partially saved

I have two dfs, and want to manipulate them in some way with a for loop.
I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.
import pandas as pd
import numpy as np
gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))
df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)
all_df = [df1, df2]
for x in all_df:
x['test'] = x[1]+1
x = x.set_index(0).drop(2, axis=1)
print(x)
Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.
Am I missing something as to why only one of the commands have been made permanent? Thank you.
Here's what's going on:
x is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df. When you assign to x['test'], you are using x to update that element, so it does what you want.
However, when you assign something new to x, you are simply causing x to refer to that new thing without touching the contents of what x previously referred to (namely, the element of all_df that you are hoping to change).
You could try something like this instead:
for x in all_df:
x['test'] = x[1]+1
x.set_index(0, inplace=True)
x.drop(2, axis=1, inplace=True)
print(df1)
print(df2)
Please note that using inplace is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1 and df2.

Creating new dataframe by appending rows from an old dataframe

I'm trying to create a dataframe by selecting rows that meet only specific conditions from a different dataframe.
Technicians can only select one of several fields for Column 1 using a dropdown menu so I want to specify the specific field. However, column 2 is a freetext entry therefore I'm looking for two specific key words with any type of spelling/case.
I want all columns from the rows in the new dataframe.
Any help or insight would be much appreciated.
import pandas as pd
df = pd.read_excel (r'File.xlsx, sheet_name = 'Sheet1')
filter = ['x', 'y']
columns=df.columns
data = pd.DataFrame(columns=columns)
for row in df.iterrows():
if 'Column 1' == 'a':
row.data.append()
elif df['Column 2'].str.contains('filter', case = 'false'):
row.data.append()
print(data.head())
In general, it's best to have a vectorized solution to things, so I'll put my solution as follows (there are many ways to do this, this is one of the ways that came to my head). Here, you can use a simple boolean mask to filter out some specific rows that you don't want, since you've already clearly defined your criteria (df['Column 1'] == 'a' or df['Column 2'].str.contains('filter', case = 'false')).
As such, you can simply create a boolean mask that includes this criteria. By itself, df['Column 1'] == 'a' will create an indexing dataframe with the structure of [1, 0, 1, 1, ...], where each number corresponds to whether it's true in the original array. Once you have that, you can simply index back into the original array with df[df['Column 1'] == 'a'] to return your filtered array.
Of course, since you have two conditions here (which seem to follow an "or" clause), you can simply feed both of these conditions into the boolean mask, such as df[df['Column 1'] == 'a' & df['Column 2'].str.contains('filter', case = 'false')].
I'm not at my development computer, so this might not work as expected due to a couple minor issues, but this is the general idea. This line should replace your entire df.iterrows block. Hope this helps :)

How to use multiple columns in filter and lambda functions pyspark

I have a dataframe, in which I want to delete columns whose name starts with "test","id_1","vehicle" and so on
I use below code to delete one column
df1.drop(*filter(lambda col: 'test' in col, df.columns))
how to specify all columns at once in this line?
this doesnt work:
df1.drop(*filter(lambda col: 'test','id_1' in col, df.columns))
You do something like the following:
expression = lambda col: all([col.startswith(i) for i in ['test', 'id_1', 'vehicle']])
df1.drop(*filter(lambda col: expression(col), df.columns))
In PySpark version 2.1.0, it is possible to drop multiple columns using drop by providing a list of strings (with the names of the columns you want to drop) as argument to drop. (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop).
In your case, you may create a list containing the names of the columns you want to drop. For example:
cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]
And then apply the drop unpacking the list:
df1.drop(*cols_to_drop)
Ultimately, it is also possible to achieve a similar result by using select. For example:
# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]
# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)
Note that, by using select you don't need to unpack the list.
Please note that this question also address similar issue.
I hope this helps.
Well, it seems you can use regular column filter as following:
val forColumns = df.columns.filter(x => (x.startsWith("test") || x.startsWith("id_1") || x.startsWith("vehicle"))) ++ ["c_007"]
df.drop(*forColumns)

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources