reshaping data frame containing non-numeric value using Pandas

reshaping data frame containing non-numeric value using Pandas - python

I am an R user currently trying to learn python, most of the time during my work I need to reshape dataframe which each cell contains a string. Reshaping is easy for me using dcast of reshape2 package in R. I want to do something similarly using the pandas package, like the script below:
import pandas as pd
temp = pd.DataFrame(index=arange(10), columns=['a','b','c','d'])
temp['a'] = 'A'
temp['b'] = 'B'
temp['c'] = 'C'
temp['d'] = 'D'
temp = pd.melt(temp, id_vars=['a','b'])
temp
pd.pivot_table(temp,index=['a','b'],columns='variable',values='value')
It keeps giving me error of DataError: No numeric types to aggregate, I think the aggfunc is the issue because the default value is np.mean, is there other aggfunc that list the cell rather than computing some value for the cell?

pd.pivot_table(temp,index=['a','b'],columns='variable',values='value',
aggfunc=lambda x: ', '.join(x.unique()))
You can write your own function to aggfunc

Related

Using a Lambda Function to Convert a Column thats a list to a set

I am trying to figure out a way to convert a column in a data frame that is currently a list to a set.
#Converting column from a list to a set
df['Column2']=s.apply(lambda x: [x])
The Error I am getting is mentioned below
NameError: name 's' is not defined

Hopes this helps. It uses lambda to go over the column and set the type of each of them to a set.
import pandas as pd
df = pd.DataFrame({'Column1':[1,2,3],'Column2':[4,5,6]})
df['Column1'] = df['Column1'].apply(lambda x: {x})

Change DataTypes of Pandas Columns by selecting columns by regex

I have a Pandas dataframe with a lot of columns looking like p_d_d_c0, p_d_d_c1, ... p_d_d_g1, p_d_d_g2, ....
df =
a b c p_d_d_c0 p_d_d_c1 p_d_d_c2 ... p_d_d_g0 p_d_d_g1 ...
All these columns, which confirm to the regex need to be selected and their datatypes need to be changed from object to float. In particular, columns look like p_d_d_c* and p_d_d_g* are they are all object types and I would like to change them to float types. Is there a way to select columns in bulk by using regular expression and change them to float types?
I tried the answer from here, but it takes a lot of time and memory as I have hundreds of these columns.
df[df.filter(regex=("p_d_d_.*"))
I also tried:
df.select(lambda col: col.startswith('p_d_d_g'), axis=1)
But, it gives an error:
AttributeError: 'DataFrame' object has no attribute 'select'
My Pandas version is 1.0.1
So, how to select columns in bulk and change their data types using regex?

Try this:
import pandas as pd
# sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4], "abc":[5,4,3,2]})
# select all columns which have co in it
floatcols = [col for col in df.columns if "co" in col]
for floatcol in floatcols:
df[floatcol] = df[floatcol].astype(float)

From the same link, and with some astype magic.
column_vals = df.columns.map(lambda x: x.startswith("p_d_d_"))
train_temp = df.loc(axis=1)[column_vals]
train_temp = train_temp.astype(float)
EDIT:
To modify the original dataframe, do something like this:
column_vals = [x for x in df.columns if x.startswith("p_d_d_")]
df[column_vals] = df[column_vals].astype(float)

SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index

I keep getting the following error on Databricks:
SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index, got You are trying to use pandas function .iloc[..., ...], use spark function select, where
this is my code:
import re
import nltk
import heapq
corpus = []
for i in range(0, len(Y)):
describe = re.sub('[^a-zA-Z]', ' ', Y.iloc[i, 0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
The code works fine in Spyder, but not in databricks.

I tried to reproduce the same issue as yours successfully, as the code and figure below.
import numpy as np
import pandas as pd
import databricks.koalas as ks
dates = pd.date_range('20130101', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df = ks.from_pandas(pdf)
print(pdf.iloc[0,0])
print(df.iloc[0,0])
Due to lack of the necessary description of your variable Y, I guess Y is a dataframe, but the differences are pandas dataframe on local Spyder, Koalas dataframe in databricks.
According to the Koalas document for databricks.koalas.DataFrame.iloc, it does not support the operation iloc(int, int) for a Koalas dataframe.
So if you want to do some operation for the first column value of each row in databricks, there are two solutions as below.
Make sure Y is a pandas dataframe in the same script of your databricks.
Y must be a Koalas dataframe as you want, please try to the code as below.
# Here, `Y` is a Koalas dataframe
for row in Y.iterrows():
describe = re.sub('[^a-zA-Z]', ' ', row[1][0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
As you can see my sample code and result below, the function iterrows can help to get get the first column value of each row.

Rename column values using pandas DataFrame

in one of the columns in my dataframe I have five values:
1,G,2,3,4
How to make it change the name of all "G" to 1
I tried:
df = df['col_name'].replace({'G': 1})
I also tried:
df = df['col_name'].replace('G',1)
"G" is in fact 1 (I do not know why there is a mixed naming)
Edit:
works correctly with:
df['col_name'] = df['col_name'].replace({'G': 1})

If I am understanding your question correctly, you are trying to change the values in a column and not the column name itself.
Given you have mixed data type there, I assume that column is of type object and thus the number is read as string.
df['col_name'] = df['col_name'].str.replace('G', '1')

You could try the following line
df.replace('G', 1, inplace=True)

use numpy
import numpy as np
df['a'] = np.where((df.a =='G'), 1, df.a)

You can try this, lets say your data is like :
ab=pd.DataFrame({'a':[1,2,3,'G',5]})
And you will replace it as :
ab1=ab.replace('G',4)

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.

Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)

# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

reshaping data frame containing non-numeric value using Pandas - python

pd.pivot_table(temp,index=['a','b'],columns='variable',values='value', aggfunc=lambda x: ', '.join(x.unique())) You can write your own function to aggfunc

Related

Using a Lambda Function to Convert a Column thats a list to a set

Change DataTypes of Pandas Columns by selecting columns by regex

SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index

Rename column values using pandas DataFrame

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

Categories

Resources