creating a pandas dataframe from a list of image files - python

I am trying to create a pandas dataframe from a list of image files (.png files)
samples = []
img = misc.imread('a.png')
X = img.reshape(-1, 3)
samples.append(X)
I added multiple .png files in samples like this. I am then trying to create a pandas dataframe from this.
df = pd.DataFrame(samples)
It is throwing error "ValueError: Must pass 2-d input". What is wrong here? Is it really possible to convert a list of image files to pandas dataframe. I am totally new to panda, so do not mind if this looks silly.
For ex.
X = [[1,2,3,4],[2,3,4,5]] df = pd.DataFrame(X)
gives me a nice dataframe of samples 2 as expected (row 2 column 4), but it is not happening with image files.

you can use:
df = pd.DataFrame.from_records(samples)

If you want to create a DataFrame from a list, the easiest way to do this is to create a pandas.Series, like the following example:
import pandas as pd
samples = ['a','b','c']
s = pd.Series(samples)
print s
output:
0 a
1 b
2 c

X = img.reshape(-1, 3)
samples.append(X)
So X is a 2D array of size (number_of_pixels,3), and that makes samples a 3D list of size (number_of_images, numbers_pixels, 3) . So the error you're getting ( "ValueError: Must pass 2-d input") is legitimate.
what you probably want is :
X = img.flatten()
or
X = img.reshape(-1)
either is going to give you X of size (number_of_pixels*3,) and samples of size (number_of_images, number_of_pixels*3).
you will probably take extra care to ensure that all images have the same number of pixels and channels.

You can use reshape(-1)
x.append((img[::2,::2]/255.0).reshape(-1))
df = pd.DataFrame(x)

Related

Is there a way to add two arrays in two columns in to a third array using pands

I am working on a project, which uses pandas data frame. So in there, I received some values in to the columns as below.
In there, I need to add this Pos_vec column and word_vec column and need to create a new column called the sum_of_arrays. And the size of the third column's array size should 2.
Eg: pos_vec Word_vec sum_of_arrays
[-0.22683072, 0.32770252] [0.3655883, 0.2535131] [0.13875758,0.58121562]
Is there anyone who can help me? I'm stuck in here. :(
If you convert them to np.array you can simply sum them.
import pandas as pd
import numpy as np
df = pd.DataFrame({'pos_vec':[[-0.22683072,0.32770252],[0.14382899,0.049593687],[-0.24300802,-0.0908088],[-0.2507714,-0.18816864],[0.32294357,0.4486494]],
'word_vec':[[0.3655883,0.2535131],[0.33788466,0.038143277], [-0.047320127,0.28842866],[0.14382899,0.049593687],[-0.24300802,-0.0908088]]})
If you want to use numpy
df['col_sum'] = df[['pos_vec','word_vec']].applymap(lambda x: np.array(x)).sum(1)
If you don't want to use numpy
df['col_sum'] = df.apply(lambda x: [sum(x) for x in zip(x.pos_vec,x.word_vec)], axis=1)
There are maybe cleaner approaches possible using pandas to iterate over the columns, however this is the solution I came up with by extracting the data from the DataFrame as lists:
# Extract data as lists
pos_vec = df["pos_vec"].tolist()
word_vec = df["word_vec"].tolist()
# Create new list with desired calculation
sum_of_arrays = [[x+y for x,y in zip(l1, l2)] for l1,l2 in zip(pos,word)]
# Add new list to DataFrame
df["sum_of_arrays"] = sum_of_arrays

Write two dataframes to two different columns in CSV

I converted two arrays into two dataframes and would like to write them to a CSV file in two separate columns. There are no common columns in the dataframes. I tried the solutions as follows and also from stack exchange but did not get the result. Solution 2 has no error but it prints all the data into one column. I am guessing that is a problem with how the arrays are converted to df? I basically want two column values of Frequency and PSD exported to csv. How do I do that ?
Solution 1:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_BP_frq['tmp'] = 1
df_BP_psd['tmp'] = 1
df_500 = pd.merge(df_BP_frq, df_BP_psd, on=['tmp'], how='outer')
df_500 = df_500.drop('tmp', axis=1)
Error: Unable to allocate 2.00 TiB for an array with shape (274870566961,) and data type int64
Solution 2:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_500 = df_BP_frq.merge(df_BP_psd, left_on='Frequency', right_on='PSD', how='outer')
No Error.
Result: The PSD values are all 0 and are seen below the frequency values in the lower rows.
Solution 3:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_500 = pd.merge(df_BP_frq, df_BP_psd, on='tmp').ix[:, ('Frequency','PSD')]
Error: KeyError: 'tmp'
Exporting to csv using:
df_500.to_csv("PSDvalues500.csv", index = False, sep=',', na_rep = 'N/A', encoding = 'utf-8')
You can use directly store the array as columns of the dataframe. If the lengths of both arrays is same, the following method would work.
df_500 = pd.DataFrame()
df_500['Frequency'] = freq_BP[L_BP]
df_500['PSD'] = PSDclean_BP[L_BP]
If the lengths of the arrays are different, you can convert them to series and then add them as columns in the following way. This would make add nan for empty values in the dataframe.
df_500 = pd.DataFrame()
df_500['Frequency'] = pd.Series(freq_BP[L_BP])
df_500['PSD'] = pd.Series(PSDclean_BP[L_BP])
From your question what I understood is that you have two arrays you want to store them into one dataframe different columns and save that dataframe to csv with separate columns .
Creating two Numpy arrays of equal length .
import numpy as np
n1 = np.arange(2, 100, 0.01)
n2 = np.arange(3, 101, 0.01)
Creating an empty dataframe and storing the above arrays as columns of the dataframe
n = pd.DataFrame()
n['feq']= n1
n['psd'] = n2
Storing into Csv
n.to_csv(r"C\:...\dataframe.csv",index= False)
If they are unequal dataframes convert them as series and then store them in empty dataframe .

statsmodels has trouble predicting on formulas using functions like log on rows of heterogeneous type

I have a pandas DataFrame whose rows contain data of multiple types. I want to fit a model based on this data using statsmodels.formula.api and then make some predictions. For my application I want to make predictions a single row at a time. If I do this naively I get AttributeError: 'numpy.float64' object has no attribute 'log' for the reason described in this answer. Here's some sample code:
import string
import random
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# Generate an example DataFrame
N = 100
z = np.random.normal(size=N)
u = np.random.normal(size=N)
w = np.exp(1 + u + 2*z)
x = np.exp(z)
y = np.log(w)
names = ["".join(random.sample(string.lowercase, 4)) for lv in range(N)]
df = pd.DataFrame({"x": x, "y": y, "names": names})
reg_spec = "y ~ np.log(x)"
fit = smf.ols(reg_spec, data=df).fit()
series = df.iloc[0] # In reality it would be `apply` extracting the rows one at a time
print(series.dtype) # gives `object` if `names` is in the DataFrame
print(fit.predict(series)) # AttributeError: 'numpy.float64' object has no attribute 'log'
The problem is that apply feeds me rows as Series, not DataFrames, and because I'm working with multiple types, the Series have type object. Sadly np.log doesn't like Series of objects even if all the objects are in fact floats. Swapping apply for transform doesn't help. I could create an intermediate DataFrame with only numeric columns or change my regression specification to y ~ np.log(x.astype('float64')). In the context of a larger program with a more complicated formula these are both pretty ugly. Is there a cleaner approach I'm missing?
Although you said you don't want to create an intermediate DataFrame with only numeric columns because it's pretty ugly, I think using select_dtypes to create a numbers-only subset of your Series on the fly is quite elegant and doesn't involve large code modifications:
series = df.select_dtypes(include='number').iloc[0]
Another solution that dawned on me as I was doing some other work is to convert the Series that apply gives me into a DataFrame consisting of a single row. This works:
row_df = pd.DataFrame([series])
print(fit.predict(row_df))

Map pandas dataframe column to a matrix

The following operation
import pandas as pd
import numpy as np
data = pd.read_csv(fname,sep=",",quotechar='"')
will create a 650,000 x 9 dataframe. The first column contains dates and the following is designed to turn a single date stamp and turn it into 5 seperate features.
def timepartition(elm):
tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
return tm[0], tm[1], tm[2], tm[3], tm[4]
data["Dates"].map(timepartition)
What I would like is to assign those 5 values to a 650,000x7 np matrix.
xtrn = np.zeros(shape=(data.shape[0],7))
xtrn[:,0:4] = np.asarray(data["Dates"].map(timepartition))
#above returns error ValueError: could not broadcast input array from shape (650000) into shape (650000,4)
You might try using some of the builtin pandas features.
dates = pd.to_datetime(data['Dates'])
date_df = pd.DataFrame(dict(
year=dates.dt.year,
month=dates.dt.month,
day=dates.dt.day,
# etc.
))
xtrn[:, :5] = date_df.values # use date[['year', 'month', 'day', etc.]] if the order comes out wrong
The map function applied to a dataframe is mapping to a new series object, and by returning tuples, it will come back as an object series.
Another approach is the following.
make the following change to timepartition:
def timepartition(elm):
tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
return [tm[i] for i in range(5)]
this will now return a listed of a tuple. The following code will create a matrix from a dataframe series that has the desired dimensions, and map it to xtrn.
xtrn[:,0:5] = = np.matrix(map(timepartition, data["Dates"].tolist()))
np matrix will infer a matrix from the nested lists from applying the partitioning function from the data to a list representation of the series, which is flat in this case.
The following worked for me. I'm not sure which method is faster, but it was easier for me to understand logically what's going on. Here my dataset "crimes" is your "data" and our time formats are a bit different.
def timepartition(elm):
tm = time.strptime(elm,"%m/%d/%Y %H:%M:%S %p")
return tm[0:5]
zeros = np.zeros(shape=(crimes.shape[0],3), dtype=np.int)
dates = np.array([timepartition(crimes["Date"][i]) for i in range(0,len(crimes))])
new = np.hstack((dates,zeros))

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))

Categories

Resources