Pandas reshape a multicolumn dataframe long to wide with conditional check - python

I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...

I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0

Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9

Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()

Related

pandas best way to process string column (not split)

INTRODUCTION TO PROBLEM
I have data encoded in string in one DataFrame column:
id data
0 a 2;0;4208;1;790
1 b 2;0;768;1;47
2 c 2;0;92;1;6
3 d 1;0;341
4 e 3;0;1;2;6;4;132
5 f 3;0;1;1;6;3;492
Data represents count how many times some events happened in our system. We can have 256 different events (each has numerical id assigned from range 0-255). As usually we have only a few events happen in one measurement period is doesn't make sense to store all zeros. That's why data is encoded as follows: first number tells how many events happened during measurement period, then each pair contains event_id and counter.
For example:
"3;0;1;1;6;3;492" means:
3 events happened in measurement period
event with id=0 happened 1 time
event with id=1 happened 6 times
event with id=3 happened 492 time
other events didn't happen
I need to decode the data to separate columns. Expected result is DataFrame which looks like this:
id data_0 data_1 data_2 data_3 data_4
0 a 4208.0 790.0 0.0 0.0 0.0
1 b 768.0 47.0 0.0 0.0 0.0
2 c 92.0 6.0 0.0 0.0 0.0
3 d 341.0 0.0 0.0 0.0 0.0
4 e 1.0 0.0 6.0 0.0 132.0
5 f 1.0 6.0 0.0 492.0 0.0
QUESTION ITSELF
I came up with the following function to do it:
def split_data(data: pd.Series):
tmp = data.str.split(';', expand=True).astype('Int32').fillna(-1)
tmp = tmp.apply(
lambda row: {'{0}_{1}'.format(data.name,row[i*2-1]): row[i*2] for i in range(1,row[0]+1)},
axis='columns',
result_type='expand').fillna(0)
return tmp
df = pd.concat([df, split_data(df.pop('data'))], axis=1)
The problem is that I have millions of lines to process and it takes A LOT of time.
As I don't have that much experience with pandas, I hope someone would be able to help me with more efficient way of performing this task.
EDIT - ANSWER ANALYSIS
Ok, so I took all three answers and performed some benchmarking :) .
Starting conditions: I already have a DataFrame (this will be important!).
As expected all of them were waaaaay faster than my code.
For example for 15 rows with 1000 repeats in timeit:
my code: 0.5827s
Schalton's code: 0.1138s
Shubham's code: 0.2242s
SomeDudes's code: 0.2219
Seems like Schalton's code wins!
However... for 1500 rows with 50 repeats:
my code: 31.1139
Schalton's code: 2.4599s
Shubham's code: 0.511s
SomeDudes's code: 17.15
I decided to check once more, this time only one attempt but for 150 000 rows:
my code: 68.6798s
Schalton's code: 6.3889s
Shubham's code: 0.9520s
SomeDudes's code: 37.8837
Interesting thing happens: as the size of DataFrame gets bigger, all versions except Shubham's take much longer! Two fastest are Schalton's and Shubham's versions. This is were the starting point matters! I already have existing DataFrame so I have to convert it to dictionary. Dictionary itself is processed really fast. Conversion however takes time. Shubham's solution is more or less independent on size! Schalton's works very well for small data sets but due to conversion to dict it gets much slower for large amount of data.
Another comparison, this time 150000 rows with 30 repeats:
Schalton's code: 170.1538s
Shubham's code: 36.32s
However for 15 rows with 30000 repeats:
Schalton's code: 50.4997s
Shubham's code: 74.0916s
SUMMARY
In the end choice between Schalton's version and Shubham's depends on the use case:
for large number of small DataFrames (or with dictionary in the beginning) go with Schalton's solution
for very large DataFrames go with Shubham's solution.
As mentioned above, I have data sets around 1mln rows and more, thus I will go with Shubham's answer.
Code
pairs = df['data'].str.extractall(r'(?<!^)(\d+);(\d+)')
pairs = pairs.droplevel(1).pivot(columns=0, values=1).fillna(0)
df[['id']].join(pairs.add_prefix('data_'))
Explained
Extract all pairs using a regex pattern
0 1
match
0 0 0 4208
1 1 790
1 0 0 768
1 1 47
2 0 0 92
1 1 6
3 0 0 341
4 0 0 1
1 2 6
2 4 132
5 0 0 1
1 1 6
2 3 492
Pivot the pairs to reshape into desired format
0 0 1 2 3 4
0 4208 790 0 0 0
1 768 47 0 0 0
2 92 6 0 0 0
3 341 0 0 0 0
4 1 0 6 0 132
5 1 6 0 492 0
Join the reshaped pairs dataframe back with id column
id data_0 data_1 data_2 data_3 data_4
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 0 132
5 f 1 6 0 492 0
I'd avoid processing this in pandas, assuming you have the data in some other format I'd parse it into lists of dictionaries then load it into pandas.
import pandas as pd
from typing import Dict
data = {
"a": "2;0;4208;1;790",
"b": "2;0;768;1;47",
"c": "2;0;92;1;6",
"d": "1;0;341",
"e": "3;0;1;2;6;4;132",
"f": "3;0;1;1;6;3;492"
}
def get_event_counts(event_str: str, delim: str = ";") -> Dict[str, int]:
"""
given an event string return a dictionary of events
"""
EVENT_COUNT_INDEX = 0
split_event = event_str.split(delim)
event_count = int(split_event[EVENT_COUNT_INDEX])
events = {
split_event[index*2+1]: int(split_event[index*2+2]) for index in range(event_count - 1 // 2)
}
return events
data_records = [{"id": k, **get_event_counts(v)} for k,v in data.items()]
print(pd.DataFrame(data_records))
id 0 1 2 4 3
0 a 4208 790.0 NaN NaN NaN
1 b 768 47.0 NaN NaN NaN
2 c 92 6.0 NaN NaN NaN
3 d 341 NaN NaN NaN NaN
4 e 1 NaN 6.0 132.0 NaN
5 f 1 6.0 NaN NaN 492.0
If you're situated on your current df as the input, you could try this:
def process_starting_dataframe(starting_dataframe: pd.DataFrame) -> pd.DataFrame:
"""
Create a new dataframe from original input with two columns "id" and "data
"""
data_dict = starting_df.T.to_dict()
data_records = [{"id": i['id'], **get_event_counts(i['data'])} for i in data_dict.values()]
return pd.DataFrame(data_records)
A much more efficient method is to construct dicts from your data.
Do you observe how the alternate values in the split string are keys and values?
Then apply pd.Series and fillna(0) to get the dataframe with all required columns for the data.
Then you can concat.
Code:
df_data = df['data'].apply(
lambda x:dict(zip(x.split(';')[1::2], x.split(';')[2::2]))).apply(pd.Series).fillna(0)
df_data.columns = df_data.columns.map('data_{}'.format)
df = pd.concat([df.drop('data',axis=1), df_data], axis=1)
output:
id data_0 data_1 data_2 data_4 data_3
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 132 0
5 f 1 6 0 0 492
If you need sorted columns you can just do:
df = df[sorted(df.columns)]

Python dataframe columns from row&column header

I have a data frame df1 like this:
A B C ...
mean 10 100 1
std 11 110 2
median 12 120 3
I want to make another df with separate col for each df1 col. header-row name pair:
A-mean A-std A-median B-mean B-std B-median C-mean C-std C-median ...
10 11 12 100 110 120 1 2 3
Basically I have used the pandas.DataFrame.describe function and now I would like to transpose it this way.
You can unstack your DataFrame into a Series, flatten the Index, turn it back into a DataFrame and transpose the result.
out = (
df.unstack()
.pipe(lambda s:
s.set_axis(s.index.map('-'.join))
)
.to_frame().T
)
print(out)
A-mean A-std A-median B-mean B-std B-median C-mean C-std C-median
0 10 11 12 100 110 120 1 2 3

Pandas: select rows by random groups while keeping all of the group's variables

My dataframe looks like this:
id std number
A 1 1
A 0 12
B 123.45 34
B 1 56
B 12 78
C 134 90
C 1234 100
C 12345 111
I'd like to select random rows of Id while retaining all of the information in the other rows, such that dataframe would look like this:
id std number
A 1 1
A 0 12
C 134 90
C 1234 100
C 12345 111
I tried it with
size = 1000
replace = True
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df2 = df1.groupby('Id', as_index=False).apply(fn)
and
df2 = df1.sample(n=1000).groupby('id')
but obviously that didn't work. Any help would be appreciated.
You need create random ids first and then compare original column id by Series.isin in boolean indexing:
#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111
Or:
N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]

How to apply percentile ranking on the pivot table?

How to apply percentile ranking on the pivot table ?
Dummy Dataset
import pandas as pd
df = pd.DataFrame({"Business": ["Hotel","Hotel", "Transport", "Agri", "Tele","Hotel", "Transport", "Agri", "Tele"],
"Location": ["101","101", "101", "101", "103",'105','102','103','106'],
"Area" : ['A','A','A','A','B','C','D','B','F']})
activity_cat_countby_subarea = df.groupby(['Area', 'Location','Business']).size().reset_index(name='counts')
activity_cat_countby_subarea = activity_cat_countby_subarea.reset_index().sort_values(['counts'], ascending=False)
After converting to the pivot table here I am applying the ranking on the overall count level.
activity_cat_countby_subarea['overll_pct_rank'] = activity_cat_countby_subarea['counts'].rank(pct=True)
But my requirement I need to apply the ranking based on each business count. i.e I need to find the ranking for each business i.e "hotel" and their count.
Kindly assist let me know if you need more information
Instead of doing this:
activity_cat_countby_subarea['overll_pct_rank'] = activity_cat_countby_subarea['counts'].rank(pct=True)
Do this:
activity_cat_countby_subarea['overll_pct_rank']=activity_cat_countby_subarea.groupby(['Business','counts']).rank(pct=True)
activity_cat_countby_subarea.sort_index(inplace=True)
#Output
index Area Location Business counts overll_pct_rank
0 0 A 101 Agri 1 0.5
1 1 A 101 Hotel 2 1.0
2 2 A 101 Transport 1 0.5
3 3 B 103 Agri 1 1.0
4 4 B 103 Tele 1 0.5
5 5 C 105 Hotel 1 1.0
6 6 D 102 Transport 1 1.0
7 7 F 106 Tele 1 1.0

Pandas Boolean indexing with two dataframes

I have two pandas dataframes:
df1
'A' 'B'
0 0
0 2
1 1
1 1
1 3
df2
'ID' 'value'
0 62
1 70
2 76
3 4674
4 3746
I want to assign df.value as a new column D to df1, but just when df.A == 0.
df1.B and df2.ID are supposed to be the identifiers.
Example output:
df1
'A' 'B' 'D'
0 0 62
0 2 76
1 1 NaN
1 1 NaN
1 3 NaN
I tried the following:
df1['D'][ df1.A == 0 ] = df2['value'][df2.ID == df1.B]
However, since df2 and df1 don't have the same length, I get the a ValueError.
ValueError: Series lengths must match to compare
This is quite certainly due to the boolean indexing in the last part: [df2.ID == df1.B]
Does anyone know how to solve the problem without needing to iterate over the dataframe(s)?
Thanks a bunch!
==============
Edit in reply to #EdChum: It worked perfectly with the example data, but I have issues with my real data. df1 is a huge dataset. df2 looks like this:
df2
ID value
0 1 1.00000
1 2 1.00000
2 3 1.00000
3 4 1.00000
4 5 1.00000
5 6 1.00000
6 7 1.00000
7 8 1.00000
8 9 0.98148
9 10 0.23330
10 11 0.56918
11 12 0.53251
12 13 0.58107
13 14 0.92405
14 15 0.00025
15 16 0.14863
16 17 0.53629
17 18 0.67130
18 19 0.53249
19 20 0.75853
20 21 0.58647
21 22 0.00156
22 23 0.00000
23 24 0.00152
24 25 1.00000
After doing the merging, the output is the following: first 133 times 0.98148, then 47 times 0.00025 and then it continues with more sequences of values from df2 until finally a sequence of NaN entries appear...
Out[91]: df1
A B D
0 1 3 0.98148
1 0 9 0.98148
2 0 9 0.98148
3 0 7 0.98148
5 1 21 0.98148
7 1 12 0.98148
... ... ... ...
2592 0 2 NaN
2593 1 17 NaN
2594 1 16 NaN
2596 0 17 NaN
2597 0 6 NaN
Any idea what might have happened here? They are all int64.
==============
Here are two csv with data that reproduces the problem.
df1:
https://owncloud.tu-berlin.de/public.php?service=files&t=2a7d244f55a5772f16aab364e78d3546
df2:
https://owncloud.tu-berlin.de/public.php?service=files&t=6fa8e0c2de465cb4f8a3f8890c325eac
To reproduce:
import pandas as pd
df1 = pd.read_csv("../../df1.csv")
df2 = pd.read_csv("../../df2.csv")
df1['D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']
Slightly tricky this one, there are 2 steps here, first is to select only the rows in df where 'A' is 0, then merge to this the other df where 'B' and 'ID' match but perform a 'left' merge, then select the 'value' column from this and assign to the df:
In [142]:
df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']
df
Out[142]:
A B D
0 0 0 62
1 0 2 76
2 1 1 NaN
3 1 1 NaN
4 1 3 NaN
Breaking this down will show what is happening:
In [143]:
# boolean mask on condition
df[df.A == 0]
Out[143]:
A B D
0 0 0 62
1 0 2 76
In [144]:
# merge using 'B' and 'ID' columns
df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')
Out[144]:
A B D ID value
0 0 0 62 0 62
1 0 2 76 2 76
After all the above you can then assign directly:
df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']
This works as it will align with the left hand side idnex so any missing values will automatically be assigned NaN
EDIT
Another method and one that seems to work for your real data is to use map to perform the lookup for you, map accepts a dict or series as a param and will lookup the corresponding value, in this case you need to set the index to 'ID' column, this reduces your df to one with just the 'Value' column:
df['D'] = df[df.A==0]['B'].map(df1.set_index('ID')['value'])
So the above performs boolean indexing as before and then calls map on the 'B' column and looksup the corresponding 'Value' in the other df after we set the index on 'ID'.
Update
I looked at your data and my first method and I can see why this fails, the alignment to the left hand side df fails so you get 1192 values in a continuous row and then the rest of the rows are NaN up to row 2500.
What does work is if you apply the same mask to the left hand side like so:
df1.loc[df1.A==0, 'D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']
So this masks the rows on the left hand side correctly and assigns the result of the merge

Categories

Resources