I have a Excel sheet with 2 colums and 1000 rows.
I want to give this as inputs to my Linear Regression Fit command using the sklearn.
/
when I want to create a dataframe using panda how can I give the inputs?
like df_x=pd.dataFrame(...)
I used without dataframe sucessfully as:
npMatrix=np.matrix(raw_data)
X,Y=npMatrix[:,1],npMatrix[:,2]
md1=LinearRegression().fit(X,Y)
Can you help with me Pandas how to access the rows?
I think you can convert a pandas dataframe to a numpy array by np.array()
This is discussed here: Quora: How does python-pandas go along with scikit-learn library?
The example, by Muktabh Mayank, is copied below:
>>> from pandas import *
>>> from numpy import *
>>> new_df = DataFrame(array([[1,2,3,4],[5,6,7,8],[9,8,10,11],[16,45,67,88]]))
>>> new_df.index= ["A1","A2","A3","A4"]
>>> new_df.columns= ["X1","X2","X3","X4"]
>>> new_df
X1 X2 X3 X4
A1 1 2 3 4
A2 5 6 7 8
A3 9 8 10 11
A4 16 45 67 88
>>> array(new_df)
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 8, 10, 11],
[16, 45, 67, 88]], dtype=int64)
>>>
And btw, people are actually working on bridging sklearn and pandas: sklearn-pandas
You can read excel
df = pd.read_excel(...)
You can single column using column number
X = df[0]
Y = df[1]
If columns have names ie. "column1", "column2"
X = df["column1"]
Y = df["column2"]
But it gives single column as Series.
If you need single column as DataFrame then use list of columns
X = df[ [0] ]
Y = df[ [1] ]
More: How to get column by number in Pandas?
Related
So I want to multiply each row of a dataframe with a multiplier vector, and I am managing, but it looks ugly. Can this be improved?
import pandas as pd
import numpy as np
# original data
df_a = pd.DataFrame([[1,2,3],[4,5,6]])
print(df_a, '\n')
# multiplier vector
df_b = pd.DataFrame([2,2,1])
print(df_b, '\n')
# multiply by a list - it works
df_c = df_a*[2,2,1]
print(df_c, '\n')
# multiply by the dataframe - it works
df_c = df_a*df_b.T.to_numpy()
print(df_c, '\n')
"It looks ugly" is subjective, that said, if you want to multiply all rows of a dataframe with something else you either need:
a dataframe of a compatible shape (and compatible indices, as those are aligned before operations in pandas, which is why df_a*df_b.T would only work for the common index: 0)
a 1D vector, which in pandas is a Series
Using a Series:
df_a*df_b[0]
output:
0 1 2
0 2 4 3
1 8 10 6
Of course, better define a Series directly if you don't really need a 2D container:
s = pd.Series([2,2,1])
df_a*s
Just for the beauty, you can use Einstein summation:
>>> np.einsum('ij,ji->ij', df_a, df_b)
array([[ 2, 4, 3],
[ 8, 10, 6]])
The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")
I have this DataFrame. In Column ArraysDate contains many elements. I want to be able to number and run the for loop in the array of java. I have not found any solution, please tell me some ideas?.
Ex with CustomerNumber = 4 , then ArraysDate have 3 elements ,and understood i1,i2,i3,i4 to use calculations in ArraysDate.
Thanks you
CustomerNumber ArraysDate
1 [ 1 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
If I understand correctly, you want to get an array of data from 'ArraysDate' based on column 'CustomerNumber'.
Basically, you can use loc
import pandas as pd
data = {'c': [1, 2, 3, 4], 'date': [[1,2],[3],[0],[2,60,30,40]]}
df = pd.DataFrame(data)
df.loc[df['c']==4, 'date']
df.loc[df['c']==4, 'date'] = df.loc[df['c']==4, 'date'].apply(lambda i: sum(i))
Result:
[2, 60, 30, 40]
c date
0 1 [1, 2]
1 2 [3]
2 3 [0]
3 4 132
You can use the lambda to sum all items in the array per row.
Step 1: Create a dataframe
import pandas as pd
import numpy as np
d = {'ID': [[1,2,3],[1,2,43]]}
df = pd.DataFrame(data=d)
Step 2: Sum the items in the array
df['ID2']=df['ID'].apply(lambda x: sum(x))
df
I'm using python pandas to organize some measurements values in a DataFrame.
One of the columns is a value which I want to convert in a 2D-vector so let's say the column contains such values:
col1
25
12
14
21
I want to have the values of this column changed one by one (in a for loop):
for value in values:
df.['col1'][value] = convert2Vector(df.['col1'][value])
So that the column col1 becomes:
col1
[-1. 21.]
[-1. -2.]
[-15. 54.]
[11. 2.]
The values are only examples and the function convert2Vector() converts the angle to a 2D-vector.
With the for-loop that I wrote it doesn't work .. I get the error:
ValueError: setting an array element with a sequence.
Which I can understand.
So the question is: How to do it?
That exception comes from the fact that you want to insert a list or array in a column (array) that stores ints. And arrays in Pandas and NumPy can't have a "ragged shape" so you can't have 2 elements in one row and 1 element in all the others (except maybe with masking).
To make it work you need to store "general" objects. For example:
import pandas as pd
df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
df.col1[0] = [1, 2]
# ValueError: setting an array element with a sequence.
But this works:
>>> df.col1 = df.col1.astype(object)
>>> df.col1[0] = [1, 2]
>>> df
col1
0 [1, 2]
1 12
2 14
3 21
Note: I wouldn't recommend doing that because object columns are much slower than specifically typed columns. But since you're iterating over the Column with a for loop it seems you don't need the performance so you can also use an object array.
What you should be doing if you want it fast is vectorize the convert2vector function and assign the result to two columns:
import pandas as pd
import numpy as np
def convert2Vector(angle):
"""I don't know what your function does so this is just something that
calculates the sin and cos of the input..."""
ret = np.zeros((angle.size, 2), dtype=float)
ret[:, 0] = np.sin(angle)
ret[:, 1] = np.cos(angle)
return ret
>>> df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
>>> df['col2'] = [0]*len(df)
>>> df[['col1', 'col2']] = convert2Vector(df.col1)
>>> df
col1 col2
0 -0.132352 0.991203
1 -0.536573 0.843854
2 0.990607 0.136737
3 0.836656 -0.547729
You should call a first order function like df.apply or df.transform which creates a new column which you then assign back:
In [1022]: df.col1.apply(lambda x: [x, x // 2])
Out[1022]:
0 [25, 12]
1 [12, 6]
2 [14, 7]
3 [21, 10]
Name: col1, dtype: object
In your case, you would do:
df['col1'] = df.col1.apply(convert2vector)
I want to get a 2d-numpy array from a column of a pandas dataframe df having a numpy vector in each row. But if I do
df.values.shape
I get: (3,) instead of getting: (3,5)
(assuming that each numpy vector in the dataframe has 5 dimensions, and that the dataframe has 3 rows)
what is the correct method?
Ideally, avoid getting into this situation by finding a different way to define the DataFrame in the first place. However, if your DataFrame looks like this:
s = pd.Series([np.random.randint(20, size=(5,)) for i in range(3)])
df = pd.DataFrame(s, columns=['foo'])
# foo
# 0 [4, 14, 9, 16, 5]
# 1 [16, 16, 5, 4, 19]
# 2 [7, 10, 15, 13, 2]
then you could convert it to a DataFrame of shape (3,5) by calling pd.DataFrame on a list of arrays:
pd.DataFrame(df['foo'].tolist())
# 0 1 2 3 4
# 0 4 14 9 16 5
# 1 16 16 5 4 19
# 2 7 10 15 13 2
pd.DataFrame(df['foo'].tolist()).values.shape
# (3, 5)
I am not sure what you want. But df.values.shape seems to be giving the correct result.
import pandas as pd
import numpy as np
from pandas import DataFrame
df3 = DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])
print df3
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711
#1 0.246188 0.628944 0.528552 0.179939 -0.019213
#2 0.080049 0.579549 1.790376 -1.301700 1.372702
df3.values.shape
#(3L, 5L)
df3["a"]
#0 -0.221059
#1 0.246188
#2 0.080049
df3[:1]
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711