I am trying to winsorize a data set that would contain a few hundred columns of data. I'd like to make a new column to the dataframe and the column would contain the winsorized result from its row's data. How can I do this with a pandas dataframe without having to specify each column (I'd like to use all columns)?
Edit: I would want to use the function 'winsorize(list, limits = [0.1,0.1])' but I'm not sure how to format the dataframe rows to work as a list.
Some tips:
You may use the pandas function apply with axis=1 to apply a function to every row.
The apply function will receive a pandas Series object but you can easily convert it to a list using tolist method
For example:
df.apply(lambda x: winsorize(x.tolist(), limits=[0.1,0.1]), axis=1)
You can use the numpy version of your dataframe using to_numpy()
from scipy.stats.mstats import winsorize
ma = winsorize(df.to_numpy(), axis=1, limits=[0.1, 0.1])
out = pd.DataFrame(ma.data, index=df.index, columns=df.columns)
Related
I am reading data from EXCEL to a pandas DataFrame:
df = pd.read_excel(file, sheet_name='FactoidList', ignore_index=False, sort=False)
Applying sort=False preserves the original order of my columns. But when I apply a numpy condition list, which generates a numpy array, the order of the columns changes.
Numpy orders the columns alphabetically from A to Z and I do not know how I can prevent it. Is there an equivalent to sort=False?
I searched online but could not find a solution. The problem is that I want to re-convert the numpy array to a dataframe in the original format, re-applying the original column names.
ADDITION: code for condition list used in script:
condlist = [f['pers_name'].str.contains('|'.join(qn)) ^ f['pers_name'].isin(qn),
f['inst_name'].isin(qi),
f['pers_title'].isin(qt),
f['pers_function'].isin(qf),
f['rel_pers'].str.contains('|'.join(qr)) ^ f['rel_pers'].isin(qr)]
choicelist = [f['pers_name'],
f['inst_name'],
f['pers_title'],
f['pers_function'],
f['rel_pers']]
output = np.select(condlist, choicelist)
print(output) # this print output already shows an inversion of columns
rows=np.where(output)
new_array=f.to_numpy()
result_array=new_array[rows]
Reviewing my script, I figured out that the problem isn't numpy but pandas.
Before applying my condition list, I am adding the dataframe df with the explicit sort=False statement to another dataframe f with the exact same structure, but I made the wrong assumption that the new combined dataframe would inherit sort=False.
Instead, I had to make it explicit:
f = f.append(df, axis=1, ignore_index=False, sort=False)
I have a list of columns from a dataframe
df_date=[df[var1],df[var2]]
I want to change the data in that columns to date time type
for t in df_date:
pd.DatetimeIndex(t)
for some reason its not working
I whould like to understand what is more general solution for applying sevral operations on several columns.
As an alternative, you can do:
for column_name in ["var1", "var2"]:
df[column_name] = pd.DatetimeIndex(df[column_name])
You can use pandas.to_datetime and pandas.DataFrame.apply to convert a dataframe's entire content to datetime. You can also filter out the columns you need and apply it only to them.
df[['column1', 'column2']] = df[['column1', 'column2']].apply(pd.to_datetime)
Note that a list of series and a DataFrame are not the same thing.
A DataFrame is accessed like this:
df[[columns]]
While a list of series is looks like this:
[seriesA, seriesB]
Suppose, I have a pandas data frame df with columns A, B and C. I would like to compute the row-wise minimum from arithmetic operator on the columns, specifically df['D']=min(df['A']+dF['B']*3, df['C']*np.sqrt(12)). I have seen related questions, and it would seem like I would need to first create two columns for the arguments in the min function, and them perform min of axis =1. I was wondering if there was another way, without creating the temporary columns.
Without creating new columns, you can use apply:
df['D'] = df.apply(lambda x: min(x['A'] + x['B']*3, x['C']*np.sqrt(12), axis=1)
But it's best just do:
df['D'] = np.mininum(df['A']+dF['B']*3, df['C']*np.sqrt(12))
which creates two intermediate columns/series but is much faster thanks to vectorization.
I am using python 2.7 with dask
I have a dataframe with one column of tuples that I created like this:
table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe)
I want to re convert this tuple column into two seperate columns
In pandas I would do it like this:
table[[col1,col2]] = table[col].apply(pd.Series)
The point of doing so, is that dask dataframe does not support multi index and i want to use groupby according to multiple columns, and wish to create a column of tuples that will give me a single index containing all the values I need (please ignore efficiency vs multi index, for there is not yet a full support for this is dask dataframe)
When i try to unpack the tuple columns with dask using this code:
rxTable[["a","b"]] = rxTable["tup"].apply(lambda x: s(x), meta = pd.DataFrame, axis = 1)
I get this error
AttributeError: 'Series' object has no attribute 'columns'
when I try
rxTable[["a","b"]] = rxTable["tup"].apply(dd.Series, axis = 1, meta = pd.DataFrame)
I get the same
How can i take a column of tuples and convert it to two columns like I do in Pandas with no problem?
Thanks
Best i found so for in converting into pandas dataframe and then convert the column, then go back to dask
df1 = df.compute()
df1[["a","b"]] = df1["c"].apply(pd.Series)
df = dd.from_pandas(df1,npartitions=1)
This will work well, if the df is too big for memory, you can either:
1.compute only the wanted column, convert it into two columns and then use merge to get the split results into the original df
2.split the df into chunks, then converting each chunk and adding it into an hd5 file, then using dask to read the entire hd5 file into the dask dataframe
I found this methodology works well and avoids converting the Dask DataFrame to Pandas:
df['a'] = df['tup'].str.partition(sep)[0]
df['b'] = df['tup'].str.partition(sep)[2]
where sep is whatever delimiter you were using in the column to separate the two elements.
If I read a csv file into a pandas dataframe, followed by using a groupby (pd.groupby([column1,...])), why is that I cannot call a to_excel attribute on the new grouped object.
import pandas as pd
data = pd.read_csv("some file.csv")
data2 = data.groupby(['column1', 'column2'])
data2.to_excel("some file.xlsx") #spits out an error about series lacking the attribute 'to_excel'
data3 = pd.DataFrame(data=data2)
data3.to_excel("some file.xlsx") #works just perfectly!
Can someone explain why pandas needs to go through the whole process of converting from a dataframe to a series to group the rows?
I believe I was unclear in my question.
Re-framed question: Why does pandas convert the dataframe into a different kind of object (groupby object) when you use pd.groupby()? Clearly, you can cast this object as a dataframe, where the grouped columns become the (multi-level) indices.
Why not do this by default (without the user having to manually cast it as a dataframe)?
To answer your reframed question about why groupby gives you a groupby object and not a DataFrame: it does this for efficiency. The groupby object doesn't duplicate all the info about the original data; it essentially stores indices into the original DataFrame, indicating which group each row is in. This allows you to use a single groupby object for multiple aggregating group operations, each of which may use different columns, (e.g., you can do g = df.groupby('Blah') and then separately do g.SomeColumn.sum() and g.OtherColumn.mean()).
In short, the main point of groupby is to let you do aggregating computations on the groups. Simply pivoting the values of a single column out to an index level isn't what most people do with groupby. If you want to do that, you have to do it yourself.