Map pandas dataframe column to a matrix - python

The following operation
import pandas as pd
import numpy as np
data = pd.read_csv(fname,sep=",",quotechar='"')
will create a 650,000 x 9 dataframe. The first column contains dates and the following is designed to turn a single date stamp and turn it into 5 seperate features.
def timepartition(elm):
tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
return tm[0], tm[1], tm[2], tm[3], tm[4]
data["Dates"].map(timepartition)
What I would like is to assign those 5 values to a 650,000x7 np matrix.
xtrn = np.zeros(shape=(data.shape[0],7))
xtrn[:,0:4] = np.asarray(data["Dates"].map(timepartition))
#above returns error ValueError: could not broadcast input array from shape (650000) into shape (650000,4)

You might try using some of the builtin pandas features.
dates = pd.to_datetime(data['Dates'])
date_df = pd.DataFrame(dict(
year=dates.dt.year,
month=dates.dt.month,
day=dates.dt.day,
# etc.
))
xtrn[:, :5] = date_df.values # use date[['year', 'month', 'day', etc.]] if the order comes out wrong

The map function applied to a dataframe is mapping to a new series object, and by returning tuples, it will come back as an object series.
Another approach is the following.
make the following change to timepartition:
def timepartition(elm):
tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
return [tm[i] for i in range(5)]
this will now return a listed of a tuple. The following code will create a matrix from a dataframe series that has the desired dimensions, and map it to xtrn.
xtrn[:,0:5] = = np.matrix(map(timepartition, data["Dates"].tolist()))
np matrix will infer a matrix from the nested lists from applying the partitioning function from the data to a list representation of the series, which is flat in this case.

The following worked for me. I'm not sure which method is faster, but it was easier for me to understand logically what's going on. Here my dataset "crimes" is your "data" and our time formats are a bit different.
def timepartition(elm):
tm = time.strptime(elm,"%m/%d/%Y %H:%M:%S %p")
return tm[0:5]
zeros = np.zeros(shape=(crimes.shape[0],3), dtype=np.int)
dates = np.array([timepartition(crimes["Date"][i]) for i in range(0,len(crimes))])
new = np.hstack((dates,zeros))

Related

Is there a way to add two arrays in two columns in to a third array using pands

I am working on a project, which uses pandas data frame. So in there, I received some values in to the columns as below.
In there, I need to add this Pos_vec column and word_vec column and need to create a new column called the sum_of_arrays. And the size of the third column's array size should 2.
Eg: pos_vec Word_vec sum_of_arrays
[-0.22683072, 0.32770252] [0.3655883, 0.2535131] [0.13875758,0.58121562]
Is there anyone who can help me? I'm stuck in here. :(
If you convert them to np.array you can simply sum them.
import pandas as pd
import numpy as np
df = pd.DataFrame({'pos_vec':[[-0.22683072,0.32770252],[0.14382899,0.049593687],[-0.24300802,-0.0908088],[-0.2507714,-0.18816864],[0.32294357,0.4486494]],
'word_vec':[[0.3655883,0.2535131],[0.33788466,0.038143277], [-0.047320127,0.28842866],[0.14382899,0.049593687],[-0.24300802,-0.0908088]]})
If you want to use numpy
df['col_sum'] = df[['pos_vec','word_vec']].applymap(lambda x: np.array(x)).sum(1)
If you don't want to use numpy
df['col_sum'] = df.apply(lambda x: [sum(x) for x in zip(x.pos_vec,x.word_vec)], axis=1)
There are maybe cleaner approaches possible using pandas to iterate over the columns, however this is the solution I came up with by extracting the data from the DataFrame as lists:
# Extract data as lists
pos_vec = df["pos_vec"].tolist()
word_vec = df["word_vec"].tolist()
# Create new list with desired calculation
sum_of_arrays = [[x+y for x,y in zip(l1, l2)] for l1,l2 in zip(pos,word)]
# Add new list to DataFrame
df["sum_of_arrays"] = sum_of_arrays

statsmodels has trouble predicting on formulas using functions like log on rows of heterogeneous type

I have a pandas DataFrame whose rows contain data of multiple types. I want to fit a model based on this data using statsmodels.formula.api and then make some predictions. For my application I want to make predictions a single row at a time. If I do this naively I get AttributeError: 'numpy.float64' object has no attribute 'log' for the reason described in this answer. Here's some sample code:
import string
import random
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# Generate an example DataFrame
N = 100
z = np.random.normal(size=N)
u = np.random.normal(size=N)
w = np.exp(1 + u + 2*z)
x = np.exp(z)
y = np.log(w)
names = ["".join(random.sample(string.lowercase, 4)) for lv in range(N)]
df = pd.DataFrame({"x": x, "y": y, "names": names})
reg_spec = "y ~ np.log(x)"
fit = smf.ols(reg_spec, data=df).fit()
series = df.iloc[0] # In reality it would be `apply` extracting the rows one at a time
print(series.dtype) # gives `object` if `names` is in the DataFrame
print(fit.predict(series)) # AttributeError: 'numpy.float64' object has no attribute 'log'
The problem is that apply feeds me rows as Series, not DataFrames, and because I'm working with multiple types, the Series have type object. Sadly np.log doesn't like Series of objects even if all the objects are in fact floats. Swapping apply for transform doesn't help. I could create an intermediate DataFrame with only numeric columns or change my regression specification to y ~ np.log(x.astype('float64')). In the context of a larger program with a more complicated formula these are both pretty ugly. Is there a cleaner approach I'm missing?
Although you said you don't want to create an intermediate DataFrame with only numeric columns because it's pretty ugly, I think using select_dtypes to create a numbers-only subset of your Series on the fly is quite elegant and doesn't involve large code modifications:
series = df.select_dtypes(include='number').iloc[0]
Another solution that dawned on me as I was doing some other work is to convert the Series that apply gives me into a DataFrame consisting of a single row. This works:
row_df = pd.DataFrame([series])
print(fit.predict(row_df))

Vectorising pandas dataframe apply function for user defined function in python

I want to compute week of the month for a specified date. For computing week of the month, I currently use the user-defined function.
Input data frame:
Output data frame:
Here is what I have tried:
from math import ceil
def week_of_month(dt):
"""
Returns the week of the month for the specified date.
"""
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(ceil(adjusted_dom/7.0))
After this,
import pandas as pd
df = pd.read_csv("input_dataframe.csv")
df.date = pd.to_datetime(df.date)
df['year_of_date'] = df.date.dt.year
df['month_of_date'] = df.date.dt.month
df['day_of_date'] = df.date.dt.day
wom = pd.Series()
# worker function for creating week of month series
def convert_date(t):
global wom
wom = wom.append(pd.Series(week_of_month(datetime.datetime(t[0],t[1],t[2]))), ignore_index = True)
# calling worker function for each row of dataframe
_ = df[['year_of_date','month_of_date','day_of_date']].apply(convert_date, axis = 1)
# adding new computed column to dataframe
df['week_of_month'] = wom
# here this updated dataframe should look like Output data frame.
What this does is for each row of data frame it computes week of the month using given function. It makes computations slower as the data frame grows to more rows. Because currently I have more than 10M+ rows.
I am looking for a faster way of doing this. What changes can I make to this code to vectorize this operation across all rows?
Thanks in advance.
Edit: What worked for me after reading answers is below code,
first_day_of_month = pd.to_datetime(df.date.values.astype('datetime64[M]'))
df['week_of_month'] = np.ceil((df.date.dt.day + first_day_of_month.weekday) / 7.0).astype(int)
The week_of_month method can be vectorized. It could be beneficial to not do the conversion to datetime objects, and instead use pandas only methods.
first_day_of_month = df.date.to_period("M").to_timestamp()
df["week_of_month"] = np.ceil((data.day + first_day_of_month.weekday) / 7.0).astype(int)
just right off the bat without even going into your code and mentioning X/Y problems, etc.:
try to get a list of unique dates, I'm sure in the 10M rows you have more than one is a duplicate.
Steps:
create a 2nd df that contains only the columns you need and no
duplicates (drop_duplicates)
run your function on the small dataframe
merge the large and small dfs
(optional) drop the small one

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Categories

Resources