Compute number of floats in a int range - Python - python

I've the following dataframe containing floats as input and would like to compute how many values are in range 0;90 and 90;180. The output dataframe was obtained using frequency() function from excel.
[Input dataframe]
[Desired output]
I'd like to do the same thing with python but didn't find a solution. Do you have any suggestion ?
I can also provide source files if needed.

Here's one way, by dividing the columns by 90, then using groupy and count:
import numpy as np
import pandas as pd
data = [
[87.084,5.293],
[55.695,0.985],
[157.504,2.995],
[97.701,179.593],
[97.67,170.386],
[118.713,177.53],
[99.972,176.665],
[124.849,1.633],
[72.787,179.459]
]
df = pd.DataFrame(data,columns=['Var1','Var2'])
df = (df / 90).astype(int)
df1 = pd.DataFrame([["0-90"], ["90-180"]])
df1['Var1'] = df.groupby('Var1').count()
df1['Var2'] = df.groupby('Var2').count()
print(df1)
Output:
0 Var1 Var2
0 0-90 3 4
1 90-180 6 5

Related

How to implement python custom function on dictionary of dataframes

I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003

How to convert byte values in a dataframe to regular integers in python

How I imported the dataset into python after downloading it from the website (UCI machine learning)
Here is the link to the dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00327/
From scipy.io import arff
Import pandas as pd
Data = arff.loadarff('Training Dataset.arff')
Data = pd.DataFrame(Data[0])
The value in my data set look like this: b'-1', b'1' and b'0'
Please how do I change the values from the above to just regular integers like: 1, -1 and 0
Update
After applying the code below, my out is not in a dataframe format, which is not what I want.
I would like the dataframe format to be the output.
for col in df:
df = df[col].astype(str).str.decode("utf-8")
-1
1
-1
0
-1
1
Here is how:
for col in df:
df = df[col].astype(str).str.decode("utf-8")
For the updated question:
lst = [b'-1', b'1', b'0' ]
lst = [int(s.decode()) for s in l]
print(lst)
Output:
[-1, 1, 0]

Problems Sorting Data out of a text-file

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
HereĀ“s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?
The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

ValueError while trying to convert pandas dataframe into dask dataframe

I am trying to convert pandas dataframe into dask dataframe. Here is how my dataframe looks like, it only consists of file names and vectors
file_names \
0 C:\Users\pilot_project\pilot_2/...
1 C:\Users\pilot_project\pilot_2/...
2 C:\Users\pilot_project\pilot_2/...
3 C:\Users\pilot_project\pilot_2/...
4 C:\Users\Yilmaz\Desktop\pilot_project\pilot_2/...
vectors
0 [0.011174, 0.011548, 0.011642, 0.000159, 2.3e-...
1 [0.003017, 0.003247, 0.003309, 9e-06, 6e-06, 8...
2 [0.008307, 0.008461, 0.008461, 0.0, 0.0, 2.8e-...
3 [0.007146, 0.007241, 0.007261, 0.000392, 2.4e-...
4 [0.007226, 0.007281, 0.007336, 9.9e-05, 1.9e-0...
Here is the simple code
import dask.dataframe as dd
import pandas as pd
df1 = pd.read_pickle('output.p')
df1['vectors'] = df1['vectors'].apply(lambda x: np.array(x)) # This line didn't solve my problem
df = dd.from_pandas(df1, npartitions=8)
I get:
ValueError: setting an array element with a sequence.
Do you have any ideas ? Thank you very much in advance

Pandas assign each row the mean of its bin

I have the following dataframe (p1.head(7)):
ColA
0 6.286333
1 3.317000
2 13.24889
3 26.20667
4 26.25556
5 60.59000
6 79.59000
7 1.361111
I can get the bin ranges using:
pandas.qcut(p1.ColA, 4)
Is there a way I can create a new column where each value corresponds to the mean value of the bin? I.e for each bin, (a,b], I want (a+b)/2
The key here is the retbins option on qcut.
import pandas
df = pandas.DataFrame(np.random.random(100)*100, columns=['val1'])
pctiles = pandas.qcut(df['val1'],4,retbins=True)
pctile_object = pctiles[0]
pctile_boundaries = pctiles[1]
Here pctile_object is just what qcut would return if you hadn't passed retbins=True, and pctile_boundaries is a numpy array of the interval boundaries.
import numpy
bin_halfway = pctile_boundaries[:-1] + (numpy.diff(pctile_boundaries)/2)
This gives us the halfway points of the bins.
Now we make a dataframe with just the interval names (as strings) and the halfway points.
df2 = pandas.DataFrame({'quartile boundaries': pctile_object.levels,
'midway point': bin_halfway})
Finally, merge the bin halfway points back into the original dataframe.
df['quartile boundaries'] = pctile_object
pandas.merge(df,df2,on='quartile boundaries')
Then you can drop quartile boundaries if you want.
I wrote a function to utilize #exp1orer 's logic:
def midway_quantiles(feature_series,q=4):
import pandas as pd
pctiles = pd.qcut(feature_series,q,retbins=True)
pctile_object = pctiles[0]
df1= pd.DataFrame({"feature":feature_series,"q_bound": pctile_object})
pctile_boundaries = pctiles[1]
import numpy as np
bin_halfway = pctile_boundaries[:-1] + (np.diff(pctile_boundaries)/2)
df2 = pd.DataFrame({"q_bound": pctile_object.cat.categories,
"midpoint": bin_halfway})
df3=pd.merge(df1,df2,on="q_bound",how="left")
return df3["midpoint"]

Categories

Resources