Pandas: Multiple Pivots for Key Value Pairs... Faster way? - python

How can I transform the first table into the second table (fast and < 8 GB RAM)?
I have data that's stored in a database in a "key-value pair" format. When I query the data, I get back this "flattened" kind of table, and I need to unflatten it by essentially pivoting each pair of key-value columns. It's easier to understand if you look at my example tables below.
Unlike other related questions, I have some different rules/restrictions:
I don't know which keys, values, or key-value pairs will be missing before I query the data
key-value column names are always identified by "KEY_###" and "VALUE_###" where ### is some number
the ### number identifies the pair (e.g., KEY_1 pairs with VALUE_1; KEY_2 does not pair with VALUE_3)
the KEY's are always alphanumeric characters or missing
the VALUE's are always numbers or missing
not all rows have every key-val pair (e.g., in table below, entry_ID B has they key N, but A does not)
sometimes there is a key but the value is missing (e.g., in table below, KEY_4 is not missing but VALUE_4 is missing)
if there is a non-missing value, then there will be a non-missing key
There may be more than 1 unique key name in a key column (e.g., KEY_0 has N and O)
If a key shows up in one key column, it will not be in any other key column (e.g., KEY_0 has both N and O, but neither show up in any other KEY_### column)
To bring down the level of abstraction... think of my data as representing the outcomes of a battery of tests on unique samples where not all samples get the same tests. The row entry_ID represents a unique sample. The key represents one of the tests, and the value is the outcome of that test. Not all samples get the same tests. Some samples get a test, but don't complete, so the outcome is missing. For example, in table below, sample A got tests P and Q, but sample B only got test N.
However, the tests change with time, but the database does not. So tests and outcomes are uploaded into the same named columns in the database. This prevents me from simply using changing the column name (e.g., for KEY_1/VALUE_1 I could change VALUE_1 to "P" and be done, but it would not work for KEY_0/VALUE_0). The examples below is a simplified and smaller case.
My typical queries will be at least 10k rows with at least 300 key-val pair columns having more than 300 unique keys that need to be pivoted into a more reasonable format for analysis. They keys are much longer strings and the values are floats... hence my question. Thanks!
First table:
i entry_ID KEY_0 VALUE_0 KEY_1 VALUE_1 KEY_2 VALUE_2 KEY_3 VALUE_3 KEY_4 VALUE_4
0 A None NaN P 183.0 Q 238.0 None NaN R NaN
1 B N 886.0 None NaN None NaN None NaN R NaN
2 C N 156.0 P 905.0 Q 566.0 None NaN R NaN
3 D N 843.0 P 396.0 None NaN None NaN R NaN
4 E None NaN None NaN Q 118.0 None NaN R NaN
5 F N 719.0 P 721.0 Q 526.0 None NaN R NaN
6 G N 894.0 P 136.0 Q 438.0 None NaN R NaN
7 H None NaN P 646.0 None NaN None NaN R NaN
8 I N 447.0 P 978.0 Q 458.0 None NaN R NaN
9 J None NaN None NaN Q 390.0 None NaN R NaN
10 K O 843.0 P 745.0 Q 107.0 None NaN R NaN
11 L O 882.0 None NaN None NaN None NaN R NaN
12 M O 382.0 P 876.0 Q 829.0 None NaN R NaN
Second table:
i entry_ID N O P Q
0 A NaN NaN 183.0 238.0
1 B 886.0 NaN NaN NaN
2 C 156.0 NaN 905.0 566.0
3 D 843.0 NaN 396.0 NaN
4 E NaN NaN NaN 118.0
5 F 719.0 NaN 721.0 526.0
6 G 894.0 NaN 136.0 438.0
7 H NaN NaN 646.0 NaN
8 I 447.0 NaN 978.0 458.0
9 J NaN NaN NaN 390.0
10 K NaN 843.0 745.0 107.0
11 L NaN 882.0 NaN NaN
12 M NaN 382.0 876.0 829.0
Reproducible example to create the First Table above (requires Python 3, pandas, and numpy, tqdm is optional)...
import pandas, string, itertools, numpy, time, os
#from tqdm import tqdm
SOME_LETTERS = string.ascii_uppercase
N_KEYVAL_PAIRS = 100
SCALABLE = 3
entry_ID = [''.join(x) for x in list(itertools.permutations(SOME_LETTERS[:13], r=SCALABLE))] # for first 13 letters, n=154440 with r=5
source_keys = [''.join(x) for x in list(itertools.permutations(SOME_LETTERS[13:], r=SCALABLE))] # for first 13 letters, n=154440 with r=5
dick = dict()
dick['entry_ID'] = entry_ID
value_col_names = ['VALUE_' + str(x) for x in range(N_KEYVAL_PAIRS)]
key_col_names = ['KEY_' + str(x) for x in range(N_KEYVAL_PAIRS)]
list_of_cols = ['entry_ID']
source_key_count = 0
#for keycol, valcol in zip(tqdm(key_col_names), value_col_names):
for keycol, valcol in zip(key_col_names, value_col_names):
dummy_values = numpy.random.randint(1, high=1000, size=len(entry_ID), dtype='l')
n_not_null = int(len(entry_ID) * 0.75) # about 25% data is null
n_nulls = len(entry_ID) - n_not_null
dum_vals = numpy.concatenate((numpy.full(n_nulls, numpy.nan), dummy_values[:n_not_null]))
numpy.random.shuffle(dum_vals) # in place!
dummy_keys = numpy.full(len(dum_vals), source_keys[source_key_count], dtype=object)
if numpy.isnan(dum_vals[0]):
source_key_count = source_key_count + 1
dum_keys = numpy.concatenate((dummy_keys[:n_not_null],
numpy.full(n_nulls, source_keys[source_key_count], dtype=object)))
#print('yes')
else:
dum_keys = dummy_keys
numpy.place(dum_keys, numpy.isnan(dum_vals), [None]) # in place!
source_key_count = source_key_count + 1
dick[keycol] = dum_keys
dick[valcol] = dum_vals
list_of_cols.append(keycol)
list_of_cols.append(valcol)
## Add example of both empty
empty_val = numpy.full(len(entry_ID), numpy.nan)
empty_key = numpy.full(len(entry_ID), None, dtype=object)
empty_k_col = 'KEY_' + str(N_KEYVAL_PAIRS + 0)
empty_v_col ='VALUE_' + str(N_KEYVAL_PAIRS + 0)
dick[empty_k_col] = empty_key
dick[empty_v_col] = empty_val
list_of_cols.append(empty_k_col)
list_of_cols.append(empty_v_col)
## Add example of empty val with key
emptyv_val = numpy.full(len(entry_ID), numpy.nan)
notempty_key = numpy.full(len(entry_ID), source_keys[source_key_count], dtype=object)
notempty_k_col = 'KEY_' + str(N_KEYVAL_PAIRS + 1)
emptyv_v_col ='VALUE_' + str(N_KEYVAL_PAIRS + 1)
dick[notempty_k_col] = notempty_key
dick[emptyv_v_col] = emptyv_val
list_of_cols.append(notempty_k_col)
list_of_cols.append(emptyv_v_col)
my_data = pandas.DataFrame(dick)
my_data = my_data[list_of_cols]
#print(my_data.to_string()) # printing can take some time
Here is my hacky attempt. It works (I think), but it takes so long, especially as the size of the table grows. I don't know what this is in big O, but it's bad. Like several minutes for my real data of 20k rows and 300 key-val pairs. And it consumes lots of RAM.
Run this snippet after the code above...
### PARSE KEY-VALUE PAIRS
# find KV pair columns
df_tgt = my_data
list_KEY_colnames = sorted(list(df_tgt.filter(regex='^KEY_[0-9]{1,3}$').columns))
list_VALUE_colnames = sorted(list(df_tgt.filter(regex='^VALUE_[0-9]{1,3}$').columns))
new_list_KEY_colnames = list_KEY_colnames
new_list_VALUE_colnames = list_VALUE_colnames
allkeys_withnan = pandas.unique(df_tgt[new_list_KEY_colnames].values.ravel()) # assume dupe names from multiple name cols will never be in the same row
allkeys = allkeys_withnan[pandas.notnull(allkeys_withnan)]
df_kv_parsed = pandas.DataFrame(index=df_tgt.index, columns=allkeys) # init
print(time.strftime("%H:%M:%S") + "\tSTARTING PIVOTING\t{}".format(str(os.getpid())))
##### START PIVOTING EACH PAIR ONE BY ONE UGH
#for each_key, each_value in zip(tqdm(new_list_KEY_colnames), new_list_VALUE_colnames):
for each_key, each_value in zip(new_list_KEY_colnames, new_list_VALUE_colnames):
df_single_col_parsed = df_tgt.loc[:, [each_key, each_value]].dropna().pivot(columns=each_key, values=each_value)
df_kv_parsed[df_single_col_parsed.columns.values] = df_single_col_parsed
print(time.strftime("%H:%M:%S") + "\tDONE PIVOTING\t{}".format(str(os.getpid())))
##### KILL ORIGINAL KV PAIRS
df_tgt.drop(list_KEY_colnames, axis=1, inplace=True)
df_tgt.drop(list_VALUE_colnames, axis=1, inplace=True)
##### MERGE WITH ORIGINAL AND THEN SAVE
df_fully_parsed = pandas.concat([df_tgt, df_kv_parsed], axis=1, ignore_index=False)
print(time.strftime("%H:%M:%S") + "\tDONE MERGING\t{}".format(str(os.getpid())))
## REMOVE NULL COLUMNS
df_fully_parsed.dropna(axis=1, how='all', inplace=True)

Related

How to assign/change values to top N values in dataframe using nlargest?

So using .nlargest I can get top N values from my dataframe.
Now if I run the following code:
df.nlargest(25, 'Change')['TopN']='TOP 25'
I expect to change all affected values in TopN column to become TOP 25. But somehow this assignemnt does not work and those rows remain unaffected. What am I doing wrong?
Assuming you really want the TOPN (limited to N values as nlargest would do), use the index from df.nlargest(25, 'Change') and loc:
df.loc[df.nlargest(25, 'Change').index, 'TopN'] = 'TOP 25'
Note the difference with the other approach that will give you all matching values:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'
Highlighting the difference:
df = pd.DataFrame({'Change': [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]})
df.loc[df.nlargest(4, 'Change').index, 'TOP4 (A)'] = 'X'
df.loc[df['Change'].isin(df['Change'].nlargest(4)), 'TOP4 (B)'] = 'X'
output:
Change TOP4 (A) TOP4 (B)
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 X X
4 5 X X
5 1 NaN NaN
6 2 NaN NaN
7 3 NaN NaN
8 4 NaN X
9 5 X X
10 1 NaN NaN
11 2 NaN NaN
12 3 NaN NaN
13 4 NaN X
14 5 X X
one thing to be aware of is that nlargest does not return ties by default, as in, on the 25th position if you have 5 rows where Change = 25th ranked value, nlargest would only return 25 rows rather than 29 rows unless you specify the parameter keep to be all
Using this parameter, it would be possible to identify the top 25 as
df.loc[df.nlargest(25, 'Change', 'all').index, 'TopN'] = 'Top 25'
Solution for compare top25 values by all values of column is:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'

Read in .dat file with headers throughout

I'm trying to read in a .dat file but it's comprised of chunks of non-columnular data with headers throughout.
I've tried reading it in in pandas:
new_df = pd.read_csv(os.path.join(pathname, item), delimiter='\t', skiprows = 2)
And it helpfully comes out like this:
Cyclic Acquisition Unnamed: 1 Unnamed: 2 24290-15 Y Unnamed: 4 \
0 Stored at: 100 cycle NaN NaN
1 Points: 2 NaN NaN NaN
2 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
3 in lbf s segments NaN
4 -0.036677472 -149.27879 19.976563 198 NaN
5 0.031659406 149.65636 20.077148 199 NaN
6 Cyclic Acquisition NaN NaN 24290-15 Y NaN
7 Stored at: 200 cycle NaN NaN
8 Points: 2 NaN NaN NaN
9 Ch 2 Displacement Ch 2 Force Time Ch 2 Count NaN
10 in lbf s segments NaN
11 -0.036623772 -149.73801 39.975586 398 NaN
12 0.031438459 149.48193 40.078125 399 NaN
13 Cyclic Acquisition NaN NaN 24290-15 Y NaN
14 Stored at: 300 cycle NaN NaN
Do I need to resort to .genfromtext() or is there a panda-riffic way to accomplish this?
I developed a work-around. I needed the Displacement data pairs, as well as some data that was all divisible evenly by 100.
To get to the Displacement data, I first pretended 'Cyclic Acquisition' was a valid column name, coerced errors on the values forced to be numeric and forced the values included to be just the ones that worked out to numbers:
displacement = new_df['Cyclic Acquisition'][pd.to_numeric(new_df['Cyclic Acquisition'], errors='coerce').notnull()]
4 -0.036677472
5 0.031659406
11 -0.036623772
12 0.031438459
Then, because the chunks remaining were paired low and high values that needed to be operated on together, I selected every other value for the "low" values starting with the 0th value, and the same logic for the "high" values. I reset the index because my plan was to create a different DataFrame with the necessary info in it and I wanted it to keep values in appropriate relationship to each other.
displacement_low = displacement[::2].reset_index(drop = True)
0 -0.036677472
1 -0.036623772
displacement_high = displacement[1::2].reset_index(drop = True)
0 0.031659406
1 0.031438459
Then, to get the cycles, I followed the same basic principle to get that column down to just numbers, then I put the values into a list and used a list comprehension to require the divisibility, and switched it back to a Series.
cycles = new_df['Unnamed: 1'][pd.to_numeric(new_df['Unnamed: 1'], errors='coerce').notnull()].astype('float').tolist()
[100.0, 2.0, -149.27879, 149.65636, 200.0, 2.0, -149.73801, 149.48193...]
cycles = pd.Series([val for val in cycles if val%100 == 0])
0 100.0
1 200.0
...
I then created a new df with that data and named the columns as desired:
df = pd.concat([displacement_low, displacement_high, cycles], axis = 1)
df.columns = ['low', 'high', 'cycles']
low high cycles
0 -0.036677 0.031659 100.0
1 -0.036624 0.031438 200.0

Remove duplicate values in a pandas column, but ignore one value

I'm sure there is an elegant solution for this, but I cannot find one. In a pandas dataframe, how do I remove all duplicate values in a column while ignoring one value?
repost_of_post_id title
0 7139471603 Man with an RV needs a place to park for a week
1 6688293563 Land for lease
2 None 2B/1.5B, Dishwasher, In Lancaster
3 None Looking For Convenience? Check Out Cordova Par...
4 None 2/bd 2/ba, Three Sparkling Swimming Pools, Sit...
5 None 1 bedroom w/Closet is bathrooms in Select Unit...
6 None Controlled Access/Gated, Availability 24 Hours...
7 None Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent
8 7143099582 Need Help Getting Approved?
9 None *MOVE IN READY APT* REQUEST TOUR TODAY!
What I want is to keep all None values in repost_of_post_id, but omit any duplicates of the numerical values, for example if there are duplicates of 7139471603 in the dataframe.
[UPDATE]
I got the desired outcome using this script, but I would like to accomplish this in a one-liner, if possible.
# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned
ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")
ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)
You could try dropping the None values, then detecting duplicates, then filtering them out of the original list.
In [1]: import pandas as pd
...: from string import ascii_lowercase
...:
...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5]
...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])})
...: print(df)
...:
...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
6 2.0 g
7 3.0 h
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
You could use drop_duplicates and merge with the NaNs as follows:
df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer')
This will keep the first occurence of ids duplicated and all NaNs rows.

How to purify pandas dataframe efficiently?

It's hard for me to write my question to proper words, so thank you for reading my question.
I have a dataframe, and it has two columns, high, low, which record the
higher values and lower values.
For example:
high low
0 NaN NaN
1 100.0 NaN
2 NaN 50.0
3 110.0 NaN
4 NaN NaN
5 120.0 NaN
6 100.0 NaN
7 NaN NaN
8 NaN 30.0
9 NaN NaN
10 NaN 20.0
11 NaN NaN
12 110.0 NaN
13 NaN NaN
I want to merge the continuous ones (on the same side), and leave the highest (lowest) one.
"the continuous ones" means the values in the high column between two values in the low column, or the values in the low column between two values in the high column
The high values on index 3, 5, 6 should be merged, and the highest value on index 5 (the value 120) should be left.
The low values on index 8, 10 should be merged, and the lowest value on index 10 (the value 20) should be left.
The result is like that:
high low
0 NaN NaN
1 100.0 NaN
2 NaN 50.0
3 NaN NaN
4 NaN NaN
5 120.0 NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN 20.0
11 NaN NaN
12 110.0 NaN
13 NaN NaN
I tried to write a for loop to handle the data, but it was very slow when the data is large (more than 10,000).
The code is:
import pandas as pd
data=pd.DataFrame(dict(high=[None,100,None,110,None,120,100,None,None,None,None,None,110,None],
low=[None,None,50,None,None,None,None,None,30,None,20,None,None,None]))
flag = None
flag_index = None
for i in range(len(data)):
if not pd.isna(data['high'][i]):
if flag == 'flag_high':
higher = data['high'].iloc[[i, flag_index]].idxmax()
lower = flag_index if i == higher else i
flag_index = higher
data['high'][lower] = None
else:
flag = 'flag_high'
flag_index = i
elif not pd.isna(data['low'][i]):
if flag == 'flag_low':
lower = data['low'].iloc[[i, flag_index]].idxmin()
higher = flag_index if i == lower else i
flag_index = lower
data['low'][higher] = None
else:
flag = 'flag_low'
flag_index = i
Is there any efficient way to do that?
Thank you
For a line oriented iterative processing like that, pandas usually does a bad job, or more exactly is not efficient at all. But you can always directly process the underlying numpy arrays:
import pandas as pd
import numpy as np
data=pd.DataFrame(dict(high=[None,100,None,110,None,120,100,None,None,None,None,None,110,None],
low=[None,None,50,None,None,None,None,None,30,None,20,None,None,None]))
npdata = data.values
flag = None
flag_index = None
for i in range(len(npdata)):
if not np.isnan(npdata[i][0]):
if flag == 'flag_high':
if npdata[i][0] > npdata[flag_index][0]:
npdata[flag_index][0] = np.nan
flag_index = i
else:
npdata[i][0] = np.nan
else:
flag = 'flag_high'
flag_index = i
elif not np.isnan(npdata[i][1]):
if flag == 'flag_low':
if npdata[i][1] < npdata[flag_index][1]:
npdata[flag_index][1] = np.nan
flag_index = i
else:
npdata[i][1] = np.nan
else:
flag = 'flag_low'
flag_index = i
In my test it is close to 10 times faster.
The larger the dataframe, the higher the gain: at 1500 rows, using directly numpy arrays is 30 times faster.

How to append new columns to a pandas groupby object from a list of values

I want to code a script that takes series values from a column, splits them into strings and makes a new column for each of the resulting strings (filled with NaN right now). As the df is groupedby Column1, I want to do this for every group
My input data frame looks like this:
df1:
Column1 Column2
0 L17 a,b,c,d,e
1 L7 a,b,c
2 L6 a,b,f
3 L6 h,d,e
What I finally want to have is:
Column1 Column2 a b c d e f h
0 L17 a,b,c,d,e nan nan nan nan nan nan nan
1 L7 a,b,c nan nan nan nan nan nan nan
2 L6 a,b,f nan nan nan nan nan nan nan
My code currently looks like this:
def NewCols(x):
for item, frame in group['Column2'].iteritems():
Genes = frame.split(',')
for value in Genes:
string = value
x[string] = np.nan
return x
df1.groupby('Column1').apply(NewCols)
My thought behind this was that the code loops through Column2 of every grouped object, splitting the values contained in frame at comma and creating a list for that group. So far the code works fine. Then I added
for value in Genes:
string = value
x[string] = np.nan
return x
with the intention of adding a new column for every value contained in the list Genes. However, my output looks like this:
Column1 Column2 d
0 L17 a,b,c,d,e nan
1 L7 a,b,c nan
2 L6 a,b,f nan
3 L6 h,d,e nan
and I am pretty much struck dumb. Can someone explain why only one column gets appended (which is not even named after the first value in the first list of the first group) and suggest how I could improve my code?
I think you just return too early in your function, before the end of the two loops. If you indent it back two times like this :
def NewCols(x):
for item, frame in group['Column2'].iteritems():
Genes = frame.split(',')
for value in Genes:
string = value
x[string] = np.nan
return x
UngroupedResGenesLineage.groupby('Column1').apply(NewCols)
It should work fine !
cols = sorted(list(set(df1['Column2'].apply(lambda x: x.split(',')).sum())))
df = df1.groupby('Column1').agg(lambda x: ','.join(x)).reset_index()
pd.concat([df,pd.DataFrame({c:np.nan for c in cols}, index=df.index)], axis=1)
Column1 Column2 a b c d e f h
0 L17 a,b,c,d,e NaN NaN NaN NaN NaN NaN NaN
1 L6 a,b,f,h,d,e NaN NaN NaN NaN NaN NaN NaN
2 L7 a,b,c NaN NaN NaN NaN NaN NaN NaN

Categories

Resources