How to purify pandas dataframe efficiently?

How to purify pandas dataframe efficiently? - python

It's hard for me to write my question to proper words, so thank you for reading my question.
I have a dataframe, and it has two columns, high, low, which record the
higher values and lower values.
For example:
high low
0 NaN NaN
1 100.0 NaN
2 NaN 50.0
3 110.0 NaN
4 NaN NaN
5 120.0 NaN
6 100.0 NaN
7 NaN NaN
8 NaN 30.0
9 NaN NaN
10 NaN 20.0
11 NaN NaN
12 110.0 NaN
13 NaN NaN
I want to merge the continuous ones (on the same side), and leave the highest (lowest) one.
"the continuous ones" means the values in the high column between two values in the low column, or the values in the low column between two values in the high column
The high values on index 3, 5, 6 should be merged, and the highest value on index 5 (the value 120) should be left.
The low values on index 8, 10 should be merged, and the lowest value on index 10 (the value 20) should be left.
The result is like that:
high low
0 NaN NaN
1 100.0 NaN
2 NaN 50.0
3 NaN NaN
4 NaN NaN
5 120.0 NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN 20.0
11 NaN NaN
12 110.0 NaN
13 NaN NaN
I tried to write a for loop to handle the data, but it was very slow when the data is large (more than 10,000).
The code is:
import pandas as pd
data=pd.DataFrame(dict(high=[None,100,None,110,None,120,100,None,None,None,None,None,110,None],
low=[None,None,50,None,None,None,None,None,30,None,20,None,None,None]))
flag = None
flag_index = None
for i in range(len(data)):
if not pd.isna(data['high'][i]):
if flag == 'flag_high':
higher = data['high'].iloc[[i, flag_index]].idxmax()
lower = flag_index if i == higher else i
flag_index = higher
data['high'][lower] = None
else:
flag = 'flag_high'
flag_index = i
elif not pd.isna(data['low'][i]):
if flag == 'flag_low':
lower = data['low'].iloc[[i, flag_index]].idxmin()
higher = flag_index if i == lower else i
flag_index = lower
data['low'][higher] = None
else:
flag = 'flag_low'
flag_index = i
Is there any efficient way to do that?
Thank you

For a line oriented iterative processing like that, pandas usually does a bad job, or more exactly is not efficient at all. But you can always directly process the underlying numpy arrays:
import pandas as pd
import numpy as np
data=pd.DataFrame(dict(high=[None,100,None,110,None,120,100,None,None,None,None,None,110,None],
low=[None,None,50,None,None,None,None,None,30,None,20,None,None,None]))
npdata = data.values
flag = None
flag_index = None
for i in range(len(npdata)):
if not np.isnan(npdata[i][0]):
if flag == 'flag_high':
if npdata[i][0] > npdata[flag_index][0]:
npdata[flag_index][0] = np.nan
flag_index = i
else:
npdata[i][0] = np.nan
else:
flag = 'flag_high'
flag_index = i
elif not np.isnan(npdata[i][1]):
if flag == 'flag_low':
if npdata[i][1] < npdata[flag_index][1]:
npdata[flag_index][1] = np.nan
flag_index = i
else:
npdata[i][1] = np.nan
else:
flag = 'flag_low'
flag_index = i
In my test it is close to 10 times faster.
The larger the dataframe, the higher the gain: at 1500 rows, using directly numpy arrays is 30 times faster.

Related

How to assign/change values to top N values in dataframe using nlargest?

So using .nlargest I can get top N values from my dataframe.
Now if I run the following code:
df.nlargest(25, 'Change')['TopN']='TOP 25'
I expect to change all affected values in TopN column to become TOP 25. But somehow this assignemnt does not work and those rows remain unaffected. What am I doing wrong?

Assuming you really want the TOPN (limited to N values as nlargest would do), use the index from df.nlargest(25, 'Change') and loc:
df.loc[df.nlargest(25, 'Change').index, 'TopN'] = 'TOP 25'
Note the difference with the other approach that will give you all matching values:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'
Highlighting the difference:
df = pd.DataFrame({'Change': [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]})
df.loc[df.nlargest(4, 'Change').index, 'TOP4 (A)'] = 'X'
df.loc[df['Change'].isin(df['Change'].nlargest(4)), 'TOP4 (B)'] = 'X'
output:
Change TOP4 (A) TOP4 (B)
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 X X
4 5 X X
5 1 NaN NaN
6 2 NaN NaN
7 3 NaN NaN
8 4 NaN X
9 5 X X
10 1 NaN NaN
11 2 NaN NaN
12 3 NaN NaN
13 4 NaN X
14 5 X X

one thing to be aware of is that nlargest does not return ties by default, as in, on the 25th position if you have 5 rows where Change = 25th ranked value, nlargest would only return 25 rows rather than 29 rows unless you specify the parameter keep to be all
Using this parameter, it would be possible to identify the top 25 as
df.loc[df.nlargest(25, 'Change', 'all').index, 'TopN'] = 'Top 25'

Solution for compare top25 values by all values of column is:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'

Read in .dat file with headers throughout

I'm trying to read in a .dat file but it's comprised of chunks of non-columnular data with headers throughout.
I've tried reading it in in pandas:
new_df = pd.read_csv(os.path.join(pathname, item), delimiter='\t', skiprows = 2)
And it helpfully comes out like this:
Cyclic Acquisition Unnamed: 1 Unnamed: 2 24290-15 Y Unnamed: 4 \
0 Stored at: 100 cycle NaN NaN
1 Points: 2 NaN NaN NaN
2 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
3 in lbf s segments NaN
4 -0.036677472 -149.27879 19.976563 198 NaN
5 0.031659406 149.65636 20.077148 199 NaN
6 Cyclic Acquisition NaN NaN 24290-15 Y NaN
7 Stored at: 200 cycle NaN NaN
8 Points: 2 NaN NaN NaN
9 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
10 in lbf s segments NaN
11 -0.036623772 -149.73801 39.975586 398 NaN
12 0.031438459 149.48193 40.078125 399 NaN
13 Cyclic Acquisition NaN NaN 24290-15 Y NaN
14 Stored at: 300 cycle NaN NaN
Do I need to resort to .genfromtext() or is there a panda-riffic way to accomplish this?

I developed a work-around. I needed the Displacement data pairs, as well as some data that was all divisible evenly by 100.
To get to the Displacement data, I first pretended 'Cyclic Acquisition' was a valid column name, coerced errors on the values forced to be numeric and forced the values included to be just the ones that worked out to numbers:
displacement = new_df['Cyclic Acquisition'][pd.to_numeric(new_df['Cyclic Acquisition'], errors='coerce').notnull()]
4 -0.036677472
5 0.031659406
11 -0.036623772
12 0.031438459
Then, because the chunks remaining were paired low and high values that needed to be operated on together, I selected every other value for the "low" values starting with the 0th value, and the same logic for the "high" values. I reset the index because my plan was to create a different DataFrame with the necessary info in it and I wanted it to keep values in appropriate relationship to each other.
displacement_low = displacement[::2].reset_index(drop = True)
0 -0.036677472
1 -0.036623772
displacement_high = displacement[1::2].reset_index(drop = True)
0 0.031659406
1 0.031438459
Then, to get the cycles, I followed the same basic principle to get that column down to just numbers, then I put the values into a list and used a list comprehension to require the divisibility, and switched it back to a Series.
cycles = new_df['Unnamed: 1'][pd.to_numeric(new_df['Unnamed: 1'], errors='coerce').notnull()].astype('float').tolist()
[100.0, 2.0, -149.27879, 149.65636, 200.0, 2.0, -149.73801, 149.48193...]
cycles = pd.Series([val for val in cycles if val%100 == 0])
0 100.0
1 200.0
...
I then created a new df with that data and named the columns as desired:
df = pd.concat([displacement_low, displacement_high, cycles], axis = 1)
df.columns = ['low', 'high', 'cycles']
low high cycles
0 -0.036677 0.031659 100.0
1 -0.036624 0.031438 200.0

Pandas append different from documentation [duplicate]

This question already has answers here:
Appending a list or series to a pandas DataFrame as a row?
(13 answers)
Create a Pandas Dataframe by appending one row at a time
(31 answers)
Closed 1 year ago.
I am having trouble using pandas dataframe.append() as it doesn't work the way it is described in the the help(pandas.DataFrame.append), or online in the various sites, blogs, answered questions etc.
This is exactly what I am doing
import pandas as pd
import numpy as np
dataset = pd.DataFrame.from_dict({"0": [0,0,0,0]}, orient="index", columns=["time", "cost", "mult", "class"])
row= [3, 1, 3, 1]
dataset = dataset.append(row, sort=True )
Trying to get to this result
time cost mult class
0 0.0 0.0 0.0 0.0
1 1 1 1 1
what I am getting instead is
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
0 3.0 NaN NaN NaN NaN
1 1.0 NaN NaN NaN NaN
2 3.0 NaN NaN NaN NaN
3 1.0 NaN NaN NaN NaN
I have tried all sorts of things, but some examples (online and in documentation) can't be done since .append() doesn't uses anymore the parameter "columns"
append(self, other, ignore_index: 'bool' = False, verify_integrity:
'bool' = False, sort: 'bool' = False) -> 'DataFrame'
Append rows of other to the end of caller, returning a new object. other : DataFrame or Series/dict-like object, or list of these
The data to append.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
verify_integrity : bool, default False
If True, raise ValueError on creating index with duplicates.
sort : bool, default False
Sort columns if the columns of self and other are not aligned.
I have tried all combinations of those parameter but it keeps showing me that crap of new rows with values on a new separated columns, moreover it changes the order of the columns that I defined in the initial dataset. (I have tried also various things with .concat but it still gave similar problems wven with axis=0)
Since even the examples in the documentaition don't show this result while having the same code structure, if anyone could enlighten me on what is happening and why, and how to fix this, it would be great.
In response to the answer, I had already tried
row= pd.Series([3, 1, 3, 1])
row = row.to_frame()
dataset = dataset.append(row, ignore_index=True )
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
1 3.0 NaN NaN NaN NaN
2 1.0 NaN NaN NaN NaN
3 3.0 NaN NaN NaN NaN
4 1.0 NaN NaN NaN NaN
alternatively
row= pd.Series([3, 1, 3, 1])
dataset = dataset.append(row, ignore_index=True )
time cost mult class 0 1 2 3
0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 3.0 1.0 3.0 1.0
without the ingore_index raises this error in this second case
TypeError: Can only append a Series if ignore_index=True or if the
Series has a name

One option is to just explicitly turn the list into a pd.Series:
In [46]: dataset.append(pd.Series(row, index=dataset.columns), ignore_index=True)
Out[46]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
You can also do it natively with a dict:
In [47]: dataset.append(dict(zip(dataset.columns, row)), ignore_index=True)
Out[47]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
The issue you're having is that other needs to be a DataFrame, a Series (or another dict-like object), or a list of DataFrames or Serieses, not a list of integers.

Pandas: Multiple Pivots for Key Value Pairs... Faster way?

How can I transform the first table into the second table (fast and < 8 GB RAM)?
I have data that's stored in a database in a "key-value pair" format. When I query the data, I get back this "flattened" kind of table, and I need to unflatten it by essentially pivoting each pair of key-value columns. It's easier to understand if you look at my example tables below.
Unlike other related questions, I have some different rules/restrictions:
I don't know which keys, values, or key-value pairs will be missing before I query the data
key-value column names are always identified by "KEY_###" and "VALUE_###" where ### is some number
the ### number identifies the pair (e.g., KEY_1 pairs with VALUE_1; KEY_2 does not pair with VALUE_3)
the KEY's are always alphanumeric characters or missing
the VALUE's are always numbers or missing
not all rows have every key-val pair (e.g., in table below, entry_ID B has they key N, but A does not)
sometimes there is a key but the value is missing (e.g., in table below, KEY_4 is not missing but VALUE_4 is missing)
if there is a non-missing value, then there will be a non-missing key
There may be more than 1 unique key name in a key column (e.g., KEY_0 has N and O)
If a key shows up in one key column, it will not be in any other key column (e.g., KEY_0 has both N and O, but neither show up in any other KEY_### column)
To bring down the level of abstraction... think of my data as representing the outcomes of a battery of tests on unique samples where not all samples get the same tests. The row entry_ID represents a unique sample. The key represents one of the tests, and the value is the outcome of that test. Not all samples get the same tests. Some samples get a test, but don't complete, so the outcome is missing. For example, in table below, sample A got tests P and Q, but sample B only got test N.
However, the tests change with time, but the database does not. So tests and outcomes are uploaded into the same named columns in the database. This prevents me from simply using changing the column name (e.g., for KEY_1/VALUE_1 I could change VALUE_1 to "P" and be done, but it would not work for KEY_0/VALUE_0). The examples below is a simplified and smaller case.
My typical queries will be at least 10k rows with at least 300 key-val pair columns having more than 300 unique keys that need to be pivoted into a more reasonable format for analysis. They keys are much longer strings and the values are floats... hence my question. Thanks!
First table:
i entry_ID KEY_0 VALUE_0 KEY_1 VALUE_1 KEY_2 VALUE_2 KEY_3 VALUE_3 KEY_4 VALUE_4
0 A None NaN P 183.0 Q 238.0 None NaN R NaN
1 B N 886.0 None NaN None NaN None NaN R NaN
2 C N 156.0 P 905.0 Q 566.0 None NaN R NaN
3 D N 843.0 P 396.0 None NaN None NaN R NaN
4 E None NaN None NaN Q 118.0 None NaN R NaN
5 F N 719.0 P 721.0 Q 526.0 None NaN R NaN
6 G N 894.0 P 136.0 Q 438.0 None NaN R NaN
7 H None NaN P 646.0 None NaN None NaN R NaN
8 I N 447.0 P 978.0 Q 458.0 None NaN R NaN
9 J None NaN None NaN Q 390.0 None NaN R NaN
10 K O 843.0 P 745.0 Q 107.0 None NaN R NaN
11 L O 882.0 None NaN None NaN None NaN R NaN
12 M O 382.0 P 876.0 Q 829.0 None NaN R NaN
Second table:
i entry_ID N O P Q
0 A NaN NaN 183.0 238.0
1 B 886.0 NaN NaN NaN
2 C 156.0 NaN 905.0 566.0
3 D 843.0 NaN 396.0 NaN
4 E NaN NaN NaN 118.0
5 F 719.0 NaN 721.0 526.0
6 G 894.0 NaN 136.0 438.0
7 H NaN NaN 646.0 NaN
8 I 447.0 NaN 978.0 458.0
9 J NaN NaN NaN 390.0
10 K NaN 843.0 745.0 107.0
11 L NaN 882.0 NaN NaN
12 M NaN 382.0 876.0 829.0
Reproducible example to create the First Table above (requires Python 3, pandas, and numpy, tqdm is optional)...
import pandas, string, itertools, numpy, time, os
#from tqdm import tqdm
SOME_LETTERS = string.ascii_uppercase
N_KEYVAL_PAIRS = 100
SCALABLE = 3
entry_ID = [''.join(x) for x in list(itertools.permutations(SOME_LETTERS[:13], r=SCALABLE))] # for first 13 letters, n=154440 with r=5
source_keys = [''.join(x) for x in list(itertools.permutations(SOME_LETTERS[13:], r=SCALABLE))] # for first 13 letters, n=154440 with r=5
dick = dict()
dick['entry_ID'] = entry_ID
value_col_names = ['VALUE_' + str(x) for x in range(N_KEYVAL_PAIRS)]
key_col_names = ['KEY_' + str(x) for x in range(N_KEYVAL_PAIRS)]
list_of_cols = ['entry_ID']
source_key_count = 0
#for keycol, valcol in zip(tqdm(key_col_names), value_col_names):
for keycol, valcol in zip(key_col_names, value_col_names):
dummy_values = numpy.random.randint(1, high=1000, size=len(entry_ID), dtype='l')
n_not_null = int(len(entry_ID) * 0.75) # about 25% data is null
n_nulls = len(entry_ID) - n_not_null
dum_vals = numpy.concatenate((numpy.full(n_nulls, numpy.nan), dummy_values[:n_not_null]))
numpy.random.shuffle(dum_vals) # in place!
dummy_keys = numpy.full(len(dum_vals), source_keys[source_key_count], dtype=object)
if numpy.isnan(dum_vals[0]):
source_key_count = source_key_count + 1
dum_keys = numpy.concatenate((dummy_keys[:n_not_null],
numpy.full(n_nulls, source_keys[source_key_count], dtype=object)))
#print('yes')
else:
dum_keys = dummy_keys
numpy.place(dum_keys, numpy.isnan(dum_vals), [None]) # in place!
source_key_count = source_key_count + 1
dick[keycol] = dum_keys
dick[valcol] = dum_vals
list_of_cols.append(keycol)
list_of_cols.append(valcol)
## Add example of both empty
empty_val = numpy.full(len(entry_ID), numpy.nan)
empty_key = numpy.full(len(entry_ID), None, dtype=object)
empty_k_col = 'KEY_' + str(N_KEYVAL_PAIRS + 0)
empty_v_col ='VALUE_' + str(N_KEYVAL_PAIRS + 0)
dick[empty_k_col] = empty_key
dick[empty_v_col] = empty_val
list_of_cols.append(empty_k_col)
list_of_cols.append(empty_v_col)
## Add example of empty val with key
emptyv_val = numpy.full(len(entry_ID), numpy.nan)
notempty_key = numpy.full(len(entry_ID), source_keys[source_key_count], dtype=object)
notempty_k_col = 'KEY_' + str(N_KEYVAL_PAIRS + 1)
emptyv_v_col ='VALUE_' + str(N_KEYVAL_PAIRS + 1)
dick[notempty_k_col] = notempty_key
dick[emptyv_v_col] = emptyv_val
list_of_cols.append(notempty_k_col)
list_of_cols.append(emptyv_v_col)
my_data = pandas.DataFrame(dick)
my_data = my_data[list_of_cols]
#print(my_data.to_string()) # printing can take some time
Here is my hacky attempt. It works (I think), but it takes so long, especially as the size of the table grows. I don't know what this is in big O, but it's bad. Like several minutes for my real data of 20k rows and 300 key-val pairs. And it consumes lots of RAM.
Run this snippet after the code above...
### PARSE KEY-VALUE PAIRS
# find KV pair columns
df_tgt = my_data
list_KEY_colnames = sorted(list(df_tgt.filter(regex='^KEY_[0-9]{1,3}$').columns))
list_VALUE_colnames = sorted(list(df_tgt.filter(regex='^VALUE_[0-9]{1,3}$').columns))
new_list_KEY_colnames = list_KEY_colnames
new_list_VALUE_colnames = list_VALUE_colnames
allkeys_withnan = pandas.unique(df_tgt[new_list_KEY_colnames].values.ravel()) # assume dupe names from multiple name cols will never be in the same row
allkeys = allkeys_withnan[pandas.notnull(allkeys_withnan)]
df_kv_parsed = pandas.DataFrame(index=df_tgt.index, columns=allkeys) # init
print(time.strftime("%H:%M:%S") + "\tSTARTING PIVOTING\t{}".format(str(os.getpid())))
##### START PIVOTING EACH PAIR ONE BY ONE UGH
#for each_key, each_value in zip(tqdm(new_list_KEY_colnames), new_list_VALUE_colnames):
for each_key, each_value in zip(new_list_KEY_colnames, new_list_VALUE_colnames):
df_single_col_parsed = df_tgt.loc[:, [each_key, each_value]].dropna().pivot(columns=each_key, values=each_value)
df_kv_parsed[df_single_col_parsed.columns.values] = df_single_col_parsed
print(time.strftime("%H:%M:%S") + "\tDONE PIVOTING\t{}".format(str(os.getpid())))
##### KILL ORIGINAL KV PAIRS
df_tgt.drop(list_KEY_colnames, axis=1, inplace=True)
df_tgt.drop(list_VALUE_colnames, axis=1, inplace=True)
##### MERGE WITH ORIGINAL AND THEN SAVE
df_fully_parsed = pandas.concat([df_tgt, df_kv_parsed], axis=1, ignore_index=False)
print(time.strftime("%H:%M:%S") + "\tDONE MERGING\t{}".format(str(os.getpid())))
## REMOVE NULL COLUMNS
df_fully_parsed.dropna(axis=1, how='all', inplace=True)

Remove NaN values from dataframe without fillna or Interpolate

I have a dataset:
367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN
When I plot it, using plt.plot(df, '-o') I get this:
But what I would like is for the datapoints from each column to be connected in a line, like so:
I understand that matplotlib does not connect datapoints that are separate by NaN values. I looked at all the options here for dealing with missing data, but all of them would essentially misrepresent the data in the dataframe. This is because each value within the dataframe represents an incident; if I try to replace the NaNs with scalar values or use the interpolate option, I get a bunch of points that are not actually in my dataset. Here's what interpolate looks like:
df_wanted2 = df.apply(pd.Series.interpolate)
If I try to use dropna I'll lose entire rows\columns from the dataframe, and these rows hold valuable data.
Does anyone know a way to connect up my dots? I suspect I need to extract individual arrays from the datasframe and plot them, as is the advice given here, but this seems like a lot of work (and my actual dataframe is much bigger.) Does anyone have a solution?

use interpolate method with parameter 'index'
df.interpolate('index').plot(marker='o')
alternative answer
plot after iteritems
for _, c in df.iteritems():
c.dropna().plot(marker='o')
extra credit
only interpolate from first valid index to last valid index for each column
for _, c in df.iteritems():
fi, li = c.first_valid_index(), c.last_valid_index()
c.loc[fi:li].interpolate('index').plot(marker='o')

Try iterating through with apply and then inside the apply function drop the missing values
def make_plot(s):
s.dropna().plot()
df.apply(make_plot)

An alternative would be to outsource the NaN handling to the graph libary Plotly using its connectgaps function.
import plotly
import pandas as pd
txt = """367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN"""
data_points = [line.split(' ') for line in txt.splitlines()[1:]]
df = pd.DataFrame(data_points)
data = list()
for i in range(1, len(df.columns)):
data.append(plotly.graph_objs.Scatter(
x = df.iloc[:,0].tolist(),
y = df.iloc[:,i].tolist(),
mode = 'line',
connectgaps = True
))
fig = dict(data=data)
plotly.plotly.sign_in('user', 'token')
plot_url = plotly.plotly.plot(fig)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to purify pandas dataframe efficiently? - python

Related

How to assign/change values to top N values in dataframe using nlargest?

Read in .dat file with headers throughout

Pandas append different from documentation [duplicate]

Pandas: Multiple Pivots for Key Value Pairs... Faster way?

Remove NaN values from dataframe without fillna or Interpolate

Categories

Resources