Load CSV to Pandas as MultiIndex - python

At the moment i try to read a *.txt with "read_csv". Works fine so far.
In[1]: df = pd.read_csv('Data.txt', skiprows=range(0,4), sep='\t', header = 0, skipinitialspace = True)
If i set the header = 0 i get the Elementlabels, but they repeat for each value of CTF1, CTF2, CTF3... and so on. So there are multiple Elements in the Header with the same value:
20052065, 20052065 .1, 20052065 .2 ... --> 20052065 .11
In[2]: print(df)
Out[2]:
Unnamed: 0 ELEMENT 20052065 20052066 20052082 20052087 20052089 \
0 TIME[s] TEMP[C] CTF1 CTF1 CTF1 CTF1 CTF1
1 0.000 24.000 -4.234 -6.728 -14.386 -4.356 -6.926
2 60.000 36.137 -29.308 -24.795 -26.937 -30.134 -24.735
3 120.000 49.013 -48.825 -36.383 -29.986 -49.897 -35.748
20052090 20052116 20052119 ... 20052116.10 20052119.10 20052065.11 \
0 CTF1 CTF1 CTF1 ... CU3 CU3 CU_M
1 -10.205 -9.934 -14.012 ... 0.001 0.001 0.003
2 -23.474 -23.982 -27.175 ... -0.016 -0.015 0.023
3 -28.007 -28.904 -29.788 ... -0.035 -0.032 0.036
So i would like to create a MultiIndex with CTF1, CTF2, CTF3,... as the "upper" Index and down there the Elementlabels. In the end i would like to select a Value by its 1. Level and its 2. Level Index. Got no Idea how to get this work. :-/
The *.txt looks like:

Related

Express pandas operations as pipeline

df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
df['cover'] = df.loc[:, 'cover'] * 100.
df['id'] = df['condition'].map(constants.dict_c)
df['temperature'] = (df['min_t'] + df['max_t])/2.
Is there a way to express the code above as a pandas pipeline? I am stuck at the first step where I rename some columns in the dataframe and select a subset of the columns.
-- EDIT:
Data is here:
max_t col_a min_t cover condition pressure
0 38.02 1523106000 19.62 0.48 269.76 1006.64
1 39.02 1523196000 20.07 0.29 266.77 1008.03
2 39 1523282400 19.48 0.78 264.29 1008.29
3 39.11 1523368800 20.01 0.7 263.68 1008.29
4 38.59 1523455200 20.88 0.83 262.35 1007.36
5 39.33 1523541600 22 0.65 261.87 1006.82
6 38.96 1523628000 24.05 0.57 259.27 1006.96
7 39.09 1523714400 22.53 0.88 256.49 1007.94
I think need assign:
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
.assign(cover = df['cover'] * 100.,
id = df['condition'].map(constants.dict_c),
temperature = (df['min_t'] + df['max_t'])/2.)

get means and SEM in one df with pandas groupby

I'd like to find an efficient way to use the df.groupby() function in pandas to return both the means and standard deviations of a data frame - preferably in one shot!
import pandas as PD
df = pd.DataFrame({'case':[1, 1, 2, 2, 3, 3],
'condition':[1,2,1,2,1,2],
'var_a':[0.92, 0.88, 0.90, 0.79, 0.94, 0.85],
'var_b':[0.21, 0.15, 0.1, 0.16, 0.17, 0.23]})
with that data, I'd like an easier way (if there is one!) to perform the following:
grp_means = df.groupby('case', as_index=False).mean()
grp_sems = df.groupby('case', as_index=False).sem()
grp_means.rename(columns={'var_a':'var_a_mean', 'var_b':'var_b_mean'},
inplace=True)
grp_sems.rename(columns={'var_a':'var_a_SEM', 'var_b':'var_b_SEM'},
inplace=True)
grouped = pd.concat([grp_means, grp_sems[['var_a_SEM', 'var_b_SEM']]], axis=1)
grouped
Out[1]:
case condition var_a_mean var_b_mean var_a_SEM var_b_SEM
0 1 1.5 0.900 0.18 0.900 0.18
1 2 1.5 0.845 0.13 0.845 0.13
2 3 1.5 0.895 0.20 0.895 0.20
I also recently learned of the .agg() function, and tried df.groupby('grouper column') agg('var':'mean', 'var':sem') but this just returns a SyntaxError.
I think need DataFrameGroupBy.agg, but then remove column ('condition','sem') and map for convert MultiIndex to columns:
df = df.groupby('case').agg(['mean','sem']).drop(('condition','sem'), axis=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
case condition_mean var_a_mean var_a_sem var_b_mean var_b_sem
0 1 1.5 0.900 0.020 0.18 0.03
1 2 1.5 0.845 0.055 0.13 0.03
2 3 1.5 0.895 0.045 0.20 0.03

How can I read in from two files,insert new columns, and compute functions like mean if there are blank values?

I have this file called 'test.txt' and it looks like this:
3.H5 5.40077
2.H8 7.75894
3.H6 7.60437
3.H5 5.40001
5.H5 5.70502
4.H8 7.55438
5.H1' 5.43574
5.H6 7.96472
""
""
""
""
""
""
6.H6 7.96178
6.H5 5.71068
""
""
7.H8 8.29385
7.H1' 6.01136
""
""
""
""
8.H5 5.51053
8.H6 7.67437
I want to see if the values in the first column are the same (i.e.: if 8.H5 occurs more than once), and if they are, I want to count how many times and take their average. I want my output to look like this:
Atom nVa predppm avgppm stdev delta QPred QMulti qTotal
1.H1' 1 5.820 5.737 0.000 0.000 0.985 1.000 0.995
2.H1' 1 5.903 5.892 0.000 0.000 0.998 1.000 0.999
3.H1' 1 5.549 5.454 0.000 0.000 0.983 1.000 0.994
4.H1' 1 5.741 5.737 0.000 0.000 0.999 1.000 1.000
6.H1' 1 5.543 5.600 0.000 0.000 0.990 1.000 0.997
8.H1' 1 5.363 5.359 0.000 0.000 0.999 1.000 1.000
10.H1' 1 5.378 5.408 0.000 0.000 0.995 1.000 0.998
11.H1' 1 5.501 5.497 0.000 0.000 0.999 1.000 1.000
14.H1' 1 5.962 5.893 0.000 0.000 0.988 1.000 0.996
Right now, my code reads from test.txt and computes the count and the mean of the values and gives an output which looks like this (output.txt):
Atom nVa avgppm
1.H1' 1 5.737
2.H1' 1 5.892
3.H1' 1 5.454
4.H1' 1 5.737
6.H1' 1 5.600
But it does not account for the "" rows, how can I get my code to skip lines that have ""?
I also have a file called test2.txt which looks like this:
5.H6 7.72158 0.3
6.H6 7.70272 0.3
7.H8 8.16859 0.3
8.H6 7.65014 0.3
9.H8 8.1053 0.3
10.H6 7.5231 0.3
12.H6 7.72805 0.3
13.H6 8.02977 0.3
14.H6 7.69624 0.3
17.H8 7.24899 0.3
16.H8 8.27957 0.3
18.H6 7.6439 0.3
19.H8 7.65501 0.3
20.H8 7.78512 0.3
21.H8 8.06057 0.3
22.H8 7.47677 0.3
23.H6 7.7306 0.3
24.H6 7.80104 0.3
I want to read in values from the first column of test.txt and values from the first column in test2.txt and see if they are the same (i.e.: if 20.H8 = 20.H8) and if they are, I want to insert a column in my output.txt between the nVa column and the avgppm column, and put in the values from test2.txt. How can I insert a column into this output file which also accounts for the blank spaces, by not using those lines?
This is my current code:
import pandas as pd
import os
import sys
test = 'test.txt'
test2 = 'test2.txt'
df = pd.read_csv(test, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
gb = (df.groupby("Atom", as_index=False)
.agg({"ppm":["count","mean"]})
.rename(columns={"count":"nVa", "mean":"avgppm"}))
gb.head()
gb.columns = gb.columns.droplevel()
gb = gb.rename(columns={"":"Atom"})
gb.to_csv("output.txt", sep =" ", index=False)
df2 = pd.read_csv(test2, sep = r'/s+', header = None)
df2.columns = ["Atoms","ppms","error"]
shift1 = df2["Atoms"]
shift2 = df2["ppms"]
I'm not exactly sure how to proceed.
To drop the row with "" as the values, use the dropna method of the data frame. You can follow this by reset_index to reset the row counts
df = pd.read_csv(test, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
df = df.dropna().reset_index(drop=True)
gb = ...
To find matching values, you can use merge method and compare the columns of interest.
df2 = pd.read_csv(test2, sep = r'/s+', header = None)
df2.columns = ["Atoms","ppms","error"]
gb.merge(df2, left_on='Atom', right_on='Atoms', how='left').drop(['Atoms','ppms'], axis=1)
This will leave you with NA values if the value in gb is not in df2.
A left merge() should be able to bring df and df2 together the way you want.
df = pd.read_csv("test.txt", sep=" ", header=None, names=["Atom", "ppm"])
df2 = pd.read_csv("test2.txt", sep=" ", header=None, names=["Atom", "ppms", "error"])
gb = df.groupby("Atom").agg(["count", "mean"])
gb.merge(df2.set_index("Atom"), how="left", left_index=True, right_index=True)
(ppm, count) (ppm, mean) ppms error
Atom
2.H8 1 7.75894 NaN NaN
3.H5 2 5.40039 NaN NaN
3.H6 1 7.60437 NaN NaN
4.H8 1 7.55438 NaN NaN
5.H1' 1 5.43574 NaN NaN
5.H5 1 5.70502 NaN NaN
5.H6 1 7.96472 7.72158 0.3
6.H5 1 5.71068 NaN NaN
6.H6 1 7.96178 7.70272 0.3
7.H1' 1 6.01136 NaN NaN
7.H8 1 8.29385 8.16859 0.3
8.H5 1 5.51053 NaN NaN
8.H6 1 7.67437 7.65014 0.3
Note: It doesn't seem that you even need dropna() for the missing rows in df. read_csv() interprets the "" values as NaN, and groupby() ignores NaN when grouping.

Pandas mean() for multiindex

I have df:
CU Parameters 1 2 3
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.3
I would like to create two separate dfs, getting the mean of column values by grouping index on level 0 (CU):
df1: (379-H and 625-H)
Parameters 1 2 3
Output Energy, (Wh/h) 1.37 0.63 0.657
df2: (the rest)
Parameters 1 2 3
Output Energy, (Wh/h) 0.69 0.74 0.65
I can get the mean for all using by grouping level 1:
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all').groupby(level=1).mean()
but how do I group these according to level 0?
SOLUTION:
lightsonly = ["379-H", "625-H"]
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all')
mask = df.index.get_level_values(0).isin(lightsonly)
df1 = df[mask].groupby(level=1).mean()
df2 = df[~mask].groupby(level=1).mean()
Use get_level_values + isin for True and False index and then get mean with rename by dict:
d = {True: '379-H and 625-H', False: 'the rest'}
df.index = df.index.get_level_values(0).isin(['379-H', '625-H'])
df = df.mean(level=0).rename(d)
print (df)
1 2 3
the rest 0.691 0.7485 0.650
379-H and 625-H 1.370 0.6395 0.657
For separately dfs is possible also use boolean indexing:
mask= df.index.get_level_values(0).isin(['379-H', '625-H'])
df1 = df[mask].mean().rename('379-H and 625-H').to_frame().T
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
df2 = df[~mask].mean().rename('the rest').to_frame().T
print (df2)
1 2 3
the rest 0.691 0.7485 0.65
Another numpy solution with DataFrame constructor:
a1 = df[mask].values.mean(axis=0)
#alternatively
#a1 = df.values[mask].mean(axis=0)
df1 = pd.DataFrame(a1.reshape(-1, len(a1)), index=['379-H and 625-H'], columns=df.columns)
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
Consider the dataframe df where CU and Parameters are assumed to be in the index.
1 2 3
CU Parameters
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0.000
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.300
Then we can groupby the truth values of whether the first level values are in the list ['379-H', '625-H'].
m = {True: 'Main', False: 'Rest'}
l = ['379-H', '625-H']
g = df.index.get_level_values('CU').isin(l)
df.groupby(g).mean().rename(index=m)
1 2 3
Rest 0.691 0.7485 0.650
Main 1.370 0.6395 0.657
#Use a lambda function to change index to 2 groups and then groupby using the modified index.
df.groupby(by=lambda x:'379-H,625-H' if x[0] in ['379-H','625-H'] else 'Others').mean()
Out[22]:
1 2 3
379-H,625-H 1.370 0.6395 0.657
Others 0.691 0.7485 0.650

pandas long to wide multicolumn reshaping

I have a pandas data frame as follows:
request_id crash_id counter num_acc_x num_acc_y num_acc_z
745109.0 670140638.0 0 0.010 0.000 -0.045
745109.0 670140638.0 1 0.016 -0.006 -0.034
745109.0 670140638.0 2 0.016 -0.006 -0.034
my id vars are : "request_id" and "crash_id", the target vars are nu_acc_x, num_acc_y and num_acc_z
I would like to create a new DataFrame where target vars are wide reshaped, that is adding max(counter)*3 new vars like num_acc_x_0, num_acc_x_1, ... num_acc_y_0,num_acc_y_1,... num_acc_z_0, num_acc_z_1 possibly without having a pivot as final result (I would like a true DataFrame as in R).
Thanks in advance for the attention
I think you need set_index with unstack, last create columns names from MultiIndex by map:
df = df.set_index(['request_id','crash_id','counter']).unstack()
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034
Another solution with aggreagting duplicates with pivot_table:
df = df.pivot_table(index=['request_id','crash_id'], columns='counter', aggfunc='mean')
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034
df = df.groupby(['request_id','crash_id','counter']).mean().unstack()
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034

Categories

Resources