Python pandas dataframe sort_values does not work - python

I have the following pandas data frame which I want to sort by 'test_type'
test_type tps mtt mem cpu 90th
0 sso_1000 205.263559 4139.031090 24.175933 34.817701 4897.4766
1 sso_1500 201.127133 5740.741266 24.599400 34.634209 6864.9820
2 sso_2000 203.204082 6610.437558 24.466267 34.831947 8005.9054
3 sso_500 189.566836 2431.867002 23.559557 35.787484 2869.7670
My code to load the dataframe and sort it is, the first print line prints the data frame above.
df = pd.read_csv(file) #reads from a csv file
print df
df = df.sort_values(by=['test_type'], ascending=True)
print '\nAfter sort...'
print df
After doing the sort and printing the dataframe content, the data frame still looks like below.
Program output:
After sort...
test_type tps mtt mem cpu 90th
0 sso_1000 205.263559 4139.031090 24.175933 34.817701 4897.4766
1 sso_1500 201.127133 5740.741266 24.599400 34.634209 6864.9820
2 sso_2000 203.204082 6610.437558 24.466267 34.831947 8005.9054
3 sso_500 189.566836 2431.867002 23.559557 35.787484 2869.7670
I expect row 3 (test type: sso_500 row) to be on top after sorting. Can someone help me figure why it's not working as it should?

Presumbaly, what you're trying to do is sort by the numerical value after sso_. You can do this as follows:
import numpy as np
df.ix[np.argsort(df.test_type.str.split('_').str[-1].astype(int).values)
This
splits the strings at _
converts what's after this character to the numerical value
Finds the indices sorted according to the numerical values
Reorders the DataFrame according to these indices
Example
In [15]: df = pd.DataFrame({'test_type': ['sso_1000', 'sso_500']})
In [16]: df.sort_values(by=['test_type'], ascending=True)
Out[16]:
test_type
0 sso_1000
1 sso_500
In [17]: df.ix[np.argsort(df.test_type.str.split('_').str[-1].astype(int).values)]
Out[17]:
test_type
1 sso_500
0 sso_1000

Alternatively, you could also extract the numbers from test_type and sort them. Followed by reindexing DF according to those indices.
df.reindex(df['test_type'].str.extract('(\d+)', expand=False) \
.astype(int).sort_values().index).reset_index(drop=True)

Related

FIlrer csv table to have just 2 columns. Python pandas pd .pd

i got .csv file with lines like this :
result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
I need to make them look like this:
time value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info.
I tried this but it doeasnt work enough:
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary
It gives this output:
0
9 83.908
10 80.342
11 79.178
12 75.621
13 72.826
... ...
73522 10.726
73523 5.241
Here is the canonical way to project down to a subset of columns in the pandas ecosystem.
df = df[['_time', '_value']]
You can simply use the keyword argument usecols of pandas.read_csv :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])
NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :
SettingWithCopyWarning:
A value is trying to be set on a copy of a
slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =
value instead
So, in your case you can use :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]
Another option is pandas.DataFrame.copy,
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()
.read_csv has a usecols parameter to specify which columns you want in the DataFrame.
df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)
_time _value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
5 2022-10-24T12:12:57Z 55.68
6 2022-10-24T12:13:02Z 25.92
7 2022-10-24T12:13:08Z 5.71

Column wise concatenation for each set of values

I am trying to append rows by column for every set of 4 rows/values.
I have 11 values the first 4 values should be in one concatenated row and row-5 to row-8 as one value and last 3 rows as one value even if the splitted values are not four.
df_in = pd.DataFrame({'Column_IN': ['text 1','text 2','text 3','text 4','text 5','text 6','text 7','text 8','text 9','text 10','text 11']})
and my expected output is as follows
df_out = pd.DataFrame({'Column_OUT': ['text 1&text 2&text 3&text 4','text 5&text 6&text 7&text 8','text 9&text 10&text 11']})
I have tried to get my desired output df_out as below.
df_2 = df_1.iloc[:-7].agg('&'.join).to_frame()
Any modification required to get required output?
Try using groupby and agg:
>>> df_in.groupby(df_in.index // 4).agg('&'.join)
Column_IN
0 text 1&text 2&text 3&text 4
1 text 5&text 6&text 7&text 8
2 text 9&text 10&text 11
>>>

How to get the whole row from data frame having maximum value for every 7 records in the pandas data frame?

For ex. consider the data frame given below:
Timestamp in_speed
1625638530 268.78
1625638590 262.75
1625638650 265.43
1625638710 270.67
1625638770 261.13
1625638830 265.49
1625638890 266.51
1625638950 270.54
1625639010 275.12
1625639070 267.62
1625639130 267.20
1625639190 265.29
1625639250 261.95
1625639310 264.39
1625639370 270.76
1625639430 291.18
I want to extract the whole row containing the maximum value for every 7 rows. Hence, desired output will be:
1625638710 270.67
1625639010 275.12
1625639430 291.18
Use DataFrameGroupBy.idxmax for indices by maximal values and select by DataFrame.loc:
df = df.loc[df.groupby(df.index // 7)['in_speed'].idxmax()]
#alternative for not default index
#df = df.loc[df.groupby(np.arange(len(df)) // 7)['in_speed'].idxmax()]
print (df)
Timestamp in_speed
3 1625638710 270.67
8 1625639010 275.12
15 1625639430 291.18

Multiple dataframe transformations based on a single column

I was looking for a similar question but I did not find a solution for what I want to do. any help is welcome
so here is the code to get an example of my Dataframe :
import pandas as pd
L = [[0.1998,'IN TIME,IN TIME','19708,19708','MR SD#5 W/Z SD#6 X/Y',20.5],
[0.3983,'LATE,IN TIME','11206,18054','MR SD#4 A/B SD#1 C/D',19.97]]
df = pd.DataFrame(L,columns=['Time','status','F_nom','info','Delta'])
output :
I would like to create two new rows for each row in my main dataframe based on 'Info' column
as we can see on the column 'Info' in my main dataframe each row contains two different SD#
i would like to have only one SD# per row
Also i would like to keep the corresponding values of the columns : Time , Status , F_norm ,Delta
Finaly create a new column 'type info' that contains the specific string for each SD# (W/Z or A/B etc.) and all this by keeping the index of my main data_frame !
Here is the desired result :
I hope i was clear enough, waiting for your returns thank you.
Use:
#split values by comma or whitespace
df['status'] = df['status'].str.split(',')
df['F_nom'] = df['F_nom'].str.split(',')
info = df.pop('info').str.split()
#select values by indexing
df['info'] = info.str[1::2]
df['type_info'] = info.str[2::2]
#reshape to Series
s = df.set_index(['Time','Delta']).stack()
#create new DataFrame and reshape to expected output
df1 = (pd.DataFrame(s.values.tolist(), index=s.index)
.stack()
.unstack(2)
.reset_index(level=2, drop=True)
.reset_index())
print (df1)
Time Delta status F_nom info type_info
0 0.1998 20.50 IN TIME 19708 SD#5 W/Z
1 0.1998 20.50 IN TIME 19708 SD#6 X/Y
2 0.3983 19.97 LATE 11206 SD#4 A/B
3 0.3983 19.97 IN TIME 18054 SD#1 C/D
Another solution:
df['status'] = df['status'].str.split(',')
df['F_nom'] = df['F_nom'].str.split(',')
info = df.pop('info').str.split()
df['info'] = info.str[1::2]
df['type_info'] = info.str[2::2]
from itertools import chain
lens = df['status'].str.len()
df = pd.DataFrame({
'Time' : df['Time'].values.repeat(lens),
'status' : list(chain.from_iterable(df['status'].tolist())),
'F_nom' : list(chain.from_iterable(df['F_nom'].tolist())),
'info' : list(chain.from_iterable(df['info'].tolist())),
'Delta' : df['Delta'].values.repeat(lens),
'type_info' : list(chain.from_iterable(df['type_info'].tolist())),
})
print (df)
Time status F_nom info Delta type_info
0 0.1998 IN TIME 19708 SD#5 20.50 W/Z
1 0.1998 IN TIME 19708 SD#6 20.50 X/Y
2 0.3983 LATE 11206 SD#4 19.97 A/B
3 0.3983 IN TIME 18054 SD#1 19.97 C/D

Shift down one row then rename the column

My data is looking like this:
pd.read_csv('/Users/admin/desktop/007538839.csv').head()
105586.18
0 105582.910
1 105585.230
2 105576.445
3 105580.016
4 105580.266
I want to move that 105568.18 to the 0 index because now it is the column name. And after that I want to name this column 'flux'. I've tried
pd.read_csv('/Users/admin/desktop/007538839.csv', sep='\t', names = ["flux"])
but it did not work, probably because the dataframe is not in the right format.
How can I achieve that?
For me your code working very nice:
import pandas as pd
temp=u"""105586.18
105582.910
105585.230
105576.445
105580.016
105580.266"""
#after testing replace 'pd.compat.StringIO(temp)' to '/Users/admin/desktop/007538839.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep='\t', names = ["flux"])
print (df)
flux
0 105586.180
1 105582.910
2 105585.230
3 105576.445
4 105580.016
5 105580.266
For overwrite original file with same data with new header flux:
df.to_csv('/Users/admin/desktop/007538839.csv', index=False)
Try this:
df=pd.read_csv('/Users/admin/desktop/007538839.csv',header=None)
df.columns=['flux']
header=None is the friend of yours.

Categories

Resources