Read in .dat file with headers throughout

Read in .dat file with headers throughout - python

I'm trying to read in a .dat file but it's comprised of chunks of non-columnular data with headers throughout.
I've tried reading it in in pandas:
new_df = pd.read_csv(os.path.join(pathname, item), delimiter='\t', skiprows = 2)
And it helpfully comes out like this:
Cyclic Acquisition Unnamed: 1 Unnamed: 2 24290-15 Y Unnamed: 4 \
0 Stored at: 100 cycle NaN NaN
1 Points: 2 NaN NaN NaN
2 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
3 in lbf s segments NaN
4 -0.036677472 -149.27879 19.976563 198 NaN
5 0.031659406 149.65636 20.077148 199 NaN
6 Cyclic Acquisition NaN NaN 24290-15 Y NaN
7 Stored at: 200 cycle NaN NaN
8 Points: 2 NaN NaN NaN
9 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
10 in lbf s segments NaN
11 -0.036623772 -149.73801 39.975586 398 NaN
12 0.031438459 149.48193 40.078125 399 NaN
13 Cyclic Acquisition NaN NaN 24290-15 Y NaN
14 Stored at: 300 cycle NaN NaN
Do I need to resort to .genfromtext() or is there a panda-riffic way to accomplish this?

I developed a work-around. I needed the Displacement data pairs, as well as some data that was all divisible evenly by 100.
To get to the Displacement data, I first pretended 'Cyclic Acquisition' was a valid column name, coerced errors on the values forced to be numeric and forced the values included to be just the ones that worked out to numbers:
displacement = new_df['Cyclic Acquisition'][pd.to_numeric(new_df['Cyclic Acquisition'], errors='coerce').notnull()]
4 -0.036677472
5 0.031659406
11 -0.036623772
12 0.031438459
Then, because the chunks remaining were paired low and high values that needed to be operated on together, I selected every other value for the "low" values starting with the 0th value, and the same logic for the "high" values. I reset the index because my plan was to create a different DataFrame with the necessary info in it and I wanted it to keep values in appropriate relationship to each other.
displacement_low = displacement[::2].reset_index(drop = True)
0 -0.036677472
1 -0.036623772
displacement_high = displacement[1::2].reset_index(drop = True)
0 0.031659406
1 0.031438459
Then, to get the cycles, I followed the same basic principle to get that column down to just numbers, then I put the values into a list and used a list comprehension to require the divisibility, and switched it back to a Series.
cycles = new_df['Unnamed: 1'][pd.to_numeric(new_df['Unnamed: 1'], errors='coerce').notnull()].astype('float').tolist()
[100.0, 2.0, -149.27879, 149.65636, 200.0, 2.0, -149.73801, 149.48193...]
cycles = pd.Series([val for val in cycles if val%100 == 0])
0 100.0
1 200.0
...
I then created a new df with that data and named the columns as desired:
df = pd.concat([displacement_low, displacement_high, cycles], axis = 1)
df.columns = ['low', 'high', 'cycles']
low high cycles
0 -0.036677 0.031659 100.0
1 -0.036624 0.031438 200.0

Related

what is the best way to create running total columns in pandas

What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'
output:
test desired_output_level_1 desired_output_level_2
0 NaN NaN NaN
1 NaN NaN NaN
2 X 1 1
3 X 1 2
4 X 1 3
5 X 1 4
6 NaN NaN NaN
7 X 2 1
8 X 2 2
9 X 2 3
10 X 2 4
11 X 2 5
12 X 2 6
13 NaN NaN NaN
14 NaN NaN NaN
15 X 3 1
16 X 3 2
The test column can only contain X's or NaNs.
The number of consecutive X's is random.
In the 'desired_output_level_1' column, trying to count up the number of series of X's.
In the 'desired_output_level_2' column, trying to find the duration of each series.
Can anyone help? Thanks in advance.

Perhaps not the most pandastic way, but seems to yield what you are after.
Three key points:
we are operating on only rows that are not NaN, so let's create a mask:
mask = df['test'].notna()
For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each level_1 group and do .transform to preserve the indexing:
df.loc[mask, "level_2"] = (
df.loc[mask, ["level_1"]]
.assign(level_2=1)
.groupby("level_1")["level_2"]
.transform("cumsum")
)
Last step (if needed) is to transform columns to strings:
df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')

How to assign/change values to top N values in dataframe using nlargest?

So using .nlargest I can get top N values from my dataframe.
Now if I run the following code:
df.nlargest(25, 'Change')['TopN']='TOP 25'
I expect to change all affected values in TopN column to become TOP 25. But somehow this assignemnt does not work and those rows remain unaffected. What am I doing wrong?

Assuming you really want the TOPN (limited to N values as nlargest would do), use the index from df.nlargest(25, 'Change') and loc:
df.loc[df.nlargest(25, 'Change').index, 'TopN'] = 'TOP 25'
Note the difference with the other approach that will give you all matching values:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'
Highlighting the difference:
df = pd.DataFrame({'Change': [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]})
df.loc[df.nlargest(4, 'Change').index, 'TOP4 (A)'] = 'X'
df.loc[df['Change'].isin(df['Change'].nlargest(4)), 'TOP4 (B)'] = 'X'
output:
Change TOP4 (A) TOP4 (B)
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 X X
4 5 X X
5 1 NaN NaN
6 2 NaN NaN
7 3 NaN NaN
8 4 NaN X
9 5 X X
10 1 NaN NaN
11 2 NaN NaN
12 3 NaN NaN
13 4 NaN X
14 5 X X

one thing to be aware of is that nlargest does not return ties by default, as in, on the 25th position if you have 5 rows where Change = 25th ranked value, nlargest would only return 25 rows rather than 29 rows unless you specify the parameter keep to be all
Using this parameter, it would be possible to identify the top 25 as
df.loc[df.nlargest(25, 'Change', 'all').index, 'TopN'] = 'Top 25'

Solution for compare top25 values by all values of column is:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'

Iterate through rows and identify which columns is true, assign new column the name of the column.header

I have the following DataFrame:
Index
Time Lost
Cause 1
Cause 2
Cause 3
0
40
x
Nan
Nan
1
15
Nan
x
Nan
2
65
x
Nan
Nan
3
10
Nan
Nan
x
There is only one "X" per row which identifies the cause of the time lost column. I am trying to iterate through each row (and column) to determine which column holds the "X". I would then like to add a "Type" column with the name of the column header that was True for each row. This is what I would like as a result:
Index
Time Lost
Cause 1
Cause 2
Cause 3
Type
0
40
x
Nan
Nan
Cause 1
1
15
Nan
x
Nan
Cause 2
2
65
x
Nan
Nan
Cause 1
3
10
Nan
Nan
x
Cause 3
Currently my code looks like this, I am trying to iterate through the DataFrame. However, I'm not sure if there is a function or non-iterative approach to assign the proper value to the "Type" column:
cols = ['Cause1', 'Cause 2', 'Cause 3']
for index, row in df.iterrows():
for col in cols:
if df.loc[index,col] =='X':
df.loc[index,'Type'] = col
continue
else:
df.loc[index,'Type'] = 'Other'
continue
The issue I get with this code is that it iterates but only identifies rows with the last item in the cols list and the remainder go to "Other".
Any help is appreciated!

You could use idxmax on the boolean array of your data:
df['Type'] = df.drop('Time Lost', axis=1).eq('x').idxmax(axis=1)
Note that this only report the first cause if several
output:
Time Lost Cause 1 Cause 2 Cause 3 Type
0 40 x Nan Nan Cause 1
1 15 Nan x Nan Cause 2
2 65 x Nan Nan Cause 1
3 10 Nan Nan x Cause 3

How to purify pandas dataframe efficiently?

It's hard for me to write my question to proper words, so thank you for reading my question.
I have a dataframe, and it has two columns, high, low, which record the
higher values and lower values.
For example:
high low
0 NaN NaN
1 100.0 NaN
2 NaN 50.0
3 110.0 NaN
4 NaN NaN
5 120.0 NaN
6 100.0 NaN
7 NaN NaN
8 NaN 30.0
9 NaN NaN
10 NaN 20.0
11 NaN NaN
12 110.0 NaN
13 NaN NaN
I want to merge the continuous ones (on the same side), and leave the highest (lowest) one.
"the continuous ones" means the values in the high column between two values in the low column, or the values in the low column between two values in the high column
The high values on index 3, 5, 6 should be merged, and the highest value on index 5 (the value 120) should be left.
The low values on index 8, 10 should be merged, and the lowest value on index 10 (the value 20) should be left.
The result is like that:
high low
0 NaN NaN
1 100.0 NaN
2 NaN 50.0
3 NaN NaN
4 NaN NaN
5 120.0 NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN 20.0
11 NaN NaN
12 110.0 NaN
13 NaN NaN
I tried to write a for loop to handle the data, but it was very slow when the data is large (more than 10,000).
The code is:
import pandas as pd
data=pd.DataFrame(dict(high=[None,100,None,110,None,120,100,None,None,None,None,None,110,None],
low=[None,None,50,None,None,None,None,None,30,None,20,None,None,None]))
flag = None
flag_index = None
for i in range(len(data)):
if not pd.isna(data['high'][i]):
if flag == 'flag_high':
higher = data['high'].iloc[[i, flag_index]].idxmax()
lower = flag_index if i == higher else i
flag_index = higher
data['high'][lower] = None
else:
flag = 'flag_high'
flag_index = i
elif not pd.isna(data['low'][i]):
if flag == 'flag_low':
lower = data['low'].iloc[[i, flag_index]].idxmin()
higher = flag_index if i == lower else i
flag_index = lower
data['low'][higher] = None
else:
flag = 'flag_low'
flag_index = i
Is there any efficient way to do that?
Thank you

For a line oriented iterative processing like that, pandas usually does a bad job, or more exactly is not efficient at all. But you can always directly process the underlying numpy arrays:
import pandas as pd
import numpy as np
data=pd.DataFrame(dict(high=[None,100,None,110,None,120,100,None,None,None,None,None,110,None],
low=[None,None,50,None,None,None,None,None,30,None,20,None,None,None]))
npdata = data.values
flag = None
flag_index = None
for i in range(len(npdata)):
if not np.isnan(npdata[i][0]):
if flag == 'flag_high':
if npdata[i][0] > npdata[flag_index][0]:
npdata[flag_index][0] = np.nan
flag_index = i
else:
npdata[i][0] = np.nan
else:
flag = 'flag_high'
flag_index = i
elif not np.isnan(npdata[i][1]):
if flag == 'flag_low':
if npdata[i][1] < npdata[flag_index][1]:
npdata[flag_index][1] = np.nan
flag_index = i
else:
npdata[i][1] = np.nan
else:
flag = 'flag_low'
flag_index = i
In my test it is close to 10 times faster.
The larger the dataframe, the higher the gain: at 1500 rows, using directly numpy arrays is 30 times faster.

pandas - partially updating DataFrame with derived calculations of a subset groupby

I have a DataFrame with some NaN records that I want to fill based on a combination of data of the NaN record (index in this example) and of the non-NaN records. The original DataFrame should be modified.
Details of input/output/code below:
I have an initial DataFrame that contains some pre-calculated data:
Initial Input
raw_data = {'raw':[x for x in range(5)]+[np.nan for x in range(2)]}
source = pd.DataFrame(raw_data)
raw
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 NaN
6 NaN
I want to identify and perform calculations to "update" the NaN data, where the calculations are based on data of the non-NaN data and some data from the NaN records.
In this contrived example I am calculating this as:
Calculate average/mean of 'valid' records.
Add this to the index number of 'invalid' records.
Finally this needs to be updated on the initial DataFrame.
Desired Output
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0
The current solution I have (below) makes a calculation on a copy then updates the original DataFrame.
# Setup grouping by NaN in 'raw'
source['valid'] = ~np.isnan(source['raw'])*1
subsets = source.groupby('valid')
# Mean of 'valid' is used later to fill 'invalid' records
valid_mean = subsets.get_group(1)['raw'].mean()
# Operate on a copy of group(0), then update the original DataFrame
invalid = subsets.get_group(0).copy()
invalid['raw'] = subsets.get_group(0).index + valid_mean
source.update(invalid)
Is there a less clunky or more efficient way to do this? The real application is on significantly larger DataFrames (and with a significantly longer process of processing NaN rows).
Thanks in advance.

You can use combine_first:
#mean by default omit `NaN`s
m = source['raw'].mean()
#same as
#m = source['raw'].dropna().mean()
print (m)
2.0
#create valid column if necessary
source['valid'] = source['raw'].notnull().astype(int)
#update NaNs
source['raw'] = source['raw'].combine_first(source.index.to_series() + m)
print (source)
raw valid
0 0.0 1
1 1.0 1
2 2.0 1
3 3.0 1
4 4.0 1
5 7.0 0
6 8.0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read in .dat file with headers throughout - python

Related

what is the best way to create running total columns in pandas

How to assign/change values to top N values in dataframe using nlargest?

Iterate through rows and identify which columns is true, assign new column the name of the column.header

How to purify pandas dataframe efficiently?

pandas - partially updating DataFrame with derived calculations of a subset groupby

Categories

Resources