ValueError while trying to convert pandas dataframe into dask dataframe

ValueError while trying to convert pandas dataframe into dask dataframe - python

I am trying to convert pandas dataframe into dask dataframe. Here is how my dataframe looks like, it only consists of file names and vectors
file_names \
0 C:\Users\pilot_project\pilot_2/...
1 C:\Users\pilot_project\pilot_2/...
2 C:\Users\pilot_project\pilot_2/...
3 C:\Users\pilot_project\pilot_2/...
4 C:\Users\Yilmaz\Desktop\pilot_project\pilot_2/...
vectors
0 [0.011174, 0.011548, 0.011642, 0.000159, 2.3e-...
1 [0.003017, 0.003247, 0.003309, 9e-06, 6e-06, 8...
2 [0.008307, 0.008461, 0.008461, 0.0, 0.0, 2.8e-...
3 [0.007146, 0.007241, 0.007261, 0.000392, 2.4e-...
4 [0.007226, 0.007281, 0.007336, 9.9e-05, 1.9e-0...
Here is the simple code
import dask.dataframe as dd
import pandas as pd
df1 = pd.read_pickle('output.p')
df1['vectors'] = df1['vectors'].apply(lambda x: np.array(x)) # This line didn't solve my problem
df = dd.from_pandas(df1, npartitions=8)
I get:
ValueError: setting an array element with a sequence.
Do you have any ideas ? Thank you very much in advance

Related

Compute number of floats in a int range - Python

I've the following dataframe containing floats as input and would like to compute how many values are in range 0;90 and 90;180. The output dataframe was obtained using frequency() function from excel.
[Input dataframe]
[Desired output]
I'd like to do the same thing with python but didn't find a solution. Do you have any suggestion ?
I can also provide source files if needed.

Here's one way, by dividing the columns by 90, then using groupy and count:
import numpy as np
import pandas as pd
data = [
[87.084,5.293],
[55.695,0.985],
[157.504,2.995],
[97.701,179.593],
[97.67,170.386],
[118.713,177.53],
[99.972,176.665],
[124.849,1.633],
[72.787,179.459]
]
df = pd.DataFrame(data,columns=['Var1','Var2'])
df = (df / 90).astype(int)
df1 = pd.DataFrame([["0-90"], ["90-180"]])
df1['Var1'] = df.groupby('Var1').count()
df1['Var2'] = df.groupby('Var2').count()
print(df1)
Output:
0 Var1 Var2
0 0-90 3 4
1 90-180 6 5

create multiple rows if a column has more than one value in a dataframe

I have a df like below :-
import pandas as pd
# intialise data of lists.
data = {'cust':['fnwp', 'utp'], 'events':[['abhi','ashu'],'abhi']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
My expected outcome is :-

You can use pandas.explode() function:
>>> df.explode('events').reset_index(drop=True)
cust events
0 fnwp abhi
1 fnwp ashu
2 utp abhi

Pandas read_csv: Columns are being imported as rows

Edit: I believe this was all user error. I have been typing df.T by default, and it just occurred to me that this is very likely the TRANSPOSE output. By typing df, the data frame is output normally (headers as columns). Thank you for those who stepped up to try and help. In the end, it was just my misunderstanding of pandas language..
Original Post
I'm not sure if I am making a simple mistake but the columns in a .csv file are being imported as rows using pd.read_csv. The dataframe turns out to be 5 rows by 2000 columns. I am importing only 5 columns out of 14 so I set up a list to hold the names of the columns I want. They match exactly those in the .csv file. What am I doing wrong here?
import os
import numpy as np
import pandas as pd
fp = 'C:/Users/my/file/path'
os.chdir(fp)
cols_to_use = ['VCOMPNO_CURRENT', 'MEASUREMENT_DATETIME',
'EQUIPMENT_NUMBER', 'AXLE', 'POSITION']
df = pd.read_csv('measurement_file.csv',
usecols=cols_to_use,
dtype={'EQUIPMENT_NUMBER': np.int,
'AXLE': np.int},
parse_dates=[2],
infer_datetime_format=True)
Output:
0 ... 2603
VCOMPNO_CURRENT T92656 ... T5M247
MEASUREMENT_DATETIME 7/26/2018 13:04 ... 9/21/2019 3:21
EQUIPMENT_NUMBER 208 ... 537
AXLE 1 ... 6
POSITION L ... R
[5 rows x 2000 columns]
Thank you.
Edit: To note, if I import the entire .csv with the standard pd.read_csv('measurement_file.csv'), the columns are imported properly.
Edit 2: Sample csv:
VCOMPNO_CURRENT,MEASUREMENT_DATETIME,REPAIR_ORDER_NUMBER,EQUIPMENT_NUMBER,AXLE,POSITION,FLANGE_THICKNESS,FLANGE_HEIGHT,FLANGE_SLOPE,DIAMETER,RO_NUMBER_SRC,CL,VCOMPNO_AT_MEAS,VCOMPNO_SRC
T92656,10/19/2018 7:11,5653054,208,1,L,26.59,27.34,6.52,691.3,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,1,R,26.78,27.25,6.64,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,L,26.6,27.13,6.49,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,R,26.61,27.45,6.75,691.6,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T7L672,10/19/2018 7:11,5653054,208,3,L,26.58,27.14,6.58,644.4,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
T7L672,10/19/2018 7:11,5653054,208,3,R,26.21,27.44,6.17,644.5,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH

A simple workaround here is to just to take the transpose of the dataframe.
Link to Pandas Documentation
df = pd.DataFrame.transpose(df)

Can you try like this?
import pandas as pd
dataset = pd.read_csv('yourfile.csv')
#filterhere
dataset = dataset[cols_to_use]

Problems Sorting Data out of a text-file

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
Here´s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?

The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

What is the Pythonic way to apply a function to multi-index multi-columns dataFrame?

Given a multi-index multi-column dataframe below, I want to apply LinearRegression to each block of this dataframe, for example, "index(X,1), column A". And compute the predicted dataframe as df_result.
A B
X 1 1997-01-31 -0.061332 0.630682
1997-02-28 -2.671818 0.377036
1997-03-31 0.861159 0.303689
...
1998-01-31 0.535192 -0.076420
...
1998-12-31 1.430995 -0.763758
Y 1 1997-01-31 -0.061332 0.630682
1997-02-28 -2.671818 0.377036
1997-03-31 0.861159 0.303689
...
1998-01-31 0.535192 -0.076420
...
1998-12-31 1.430995 -0.763758
Here is what I tried:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
N = 24
dates = pd.date_range('19970101', periods=N, freq='M')
df=pd.DataFrame(np.random.randn(len(dates),2),index=dates,columns=list('AB'))
df2=pd.concat([df,df],keys=[('X','1'),('Y','1')])
regr = LinearRegression()
# df_result will be reassined, copy the index and metadata from df2
df_result=df2.copy()
# I know the double loop below is not a clever idea. What is the right way?
for row in df2.index.to_series().unique():
for col in df2.columns:
#df2 can contain missing values
lenX=np.count_nonzero(df2.ix[row[:1],col].notnull().values.ravel())
X=np.array(range(lenX)).reshape(lenX,1)
y=df2.ix[row[:1],col]
y=y[y.notnull()]
# train the model
regr.fit(X,y)
df_result.ix[row[:1],col][:lenX] = regr.predict(X)
The question is that the double loop above make the computing quite slow, more than ten minutes for 100kb data set. What is the pythonic way to do this?
EDIT:
The second question for the last line of the code above is that I am working with a copy of a slice of the dataframe. Some columns of "df_result" are not updated with this operation.
EDIT2:
Some columns of the original data can contain missing value, and we cannot apply regression directly on them. For example,
df2.ix[('X','1','1997-12-31')]['A']=np.nan
df2.ix[('Y','1','1998-12-31')]['A']=np.nan

I don't quite understand the row looping.
anyhow, to maintain consistency in the numbers I put np.random.seed(1) at the top
In short I think you can achieve what you want with a function, groupby, and call to .transform().
def do_regression(y):
X=np.array(range(len(y))).reshape(len(y),1)
regr.fit(X,y)
return regr.predict(X)
df_regressed = df2.groupby(level=[0,1]).transform(do_regression)
print df_regressed.head()
A B
X 1 1997-01-31 0.779476 -1.222119
1997-02-28 0.727184 -1.138630
1997-03-31 0.674892 -1.055142
1997-04-30 0.622601 -0.971653
1997-05-31 0.570309 -0.888164
which matches your df_result output.
print df_result.head()
A B
X 1 1997-01-31 0.779476 -1.222119
1997-02-28 0.727184 -1.138630
1997-03-31 0.674892 -1.055142
1997-04-30 0.622601 -0.971653
1997-05-31 0.570309 -0.888164
oh and a couple of alternatives for:
X=np.array(range(len(y))).reshape(len(y),1)
1.) X = np.expand_dims(range(len(y)), axis=1)
2.) X = np.arange(len(y))[:,np.newaxis]
Edit for empty data
ok 2 suggestions:
Would it be legitimate to use the interpolate method to fill the null values?
df2 = df2.interpolate()
OR
do the regression on non null values and then pop the nulls back in at the appropriate index position
def do_regression(y):
x_s =np.arange(len(y))
x_s_non_nulls = x_s[y.notnull().values]
x_s_non_nulls = np.expand_dims(x_s_non_nulls, axis=1)
y_non_nulls = y[y.notnull()] # get the non nulls
regr.fit(x_s_non_nulls,y_non_nulls) # regression
results = regr.predict(x_s_non_nulls)
#pop back in then nulls.
for idx in np.where(y.isnull().values ==True):
results = np.insert(results,idx,np.NaN)
return results

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ValueError while trying to convert pandas dataframe into dask dataframe - python

Related

Compute number of floats in a int range - Python

create multiple rows if a column has more than one value in a dataframe

Pandas read_csv: Columns are being imported as rows

Problems Sorting Data out of a text-file

What is the Pythonic way to apply a function to multi-index multi-columns dataFrame?

Categories

Resources