INTRODUCTION TO PROBLEM
I have data encoded in string in one DataFrame column:
id data
0 a 2;0;4208;1;790
1 b 2;0;768;1;47
2 c 2;0;92;1;6
3 d 1;0;341
4 e 3;0;1;2;6;4;132
5 f 3;0;1;1;6;3;492
Data represents count how many times some events happened in our system. We can have 256 different events (each has numerical id assigned from range 0-255). As usually we have only a few events happen in one measurement period is doesn't make sense to store all zeros. That's why data is encoded as follows: first number tells how many events happened during measurement period, then each pair contains event_id and counter.
For example:
"3;0;1;1;6;3;492" means:
3 events happened in measurement period
event with id=0 happened 1 time
event with id=1 happened 6 times
event with id=3 happened 492 time
other events didn't happen
I need to decode the data to separate columns. Expected result is DataFrame which looks like this:
id data_0 data_1 data_2 data_3 data_4
0 a 4208.0 790.0 0.0 0.0 0.0
1 b 768.0 47.0 0.0 0.0 0.0
2 c 92.0 6.0 0.0 0.0 0.0
3 d 341.0 0.0 0.0 0.0 0.0
4 e 1.0 0.0 6.0 0.0 132.0
5 f 1.0 6.0 0.0 492.0 0.0
QUESTION ITSELF
I came up with the following function to do it:
def split_data(data: pd.Series):
tmp = data.str.split(';', expand=True).astype('Int32').fillna(-1)
tmp = tmp.apply(
lambda row: {'{0}_{1}'.format(data.name,row[i*2-1]): row[i*2] for i in range(1,row[0]+1)},
axis='columns',
result_type='expand').fillna(0)
return tmp
df = pd.concat([df, split_data(df.pop('data'))], axis=1)
The problem is that I have millions of lines to process and it takes A LOT of time.
As I don't have that much experience with pandas, I hope someone would be able to help me with more efficient way of performing this task.
EDIT - ANSWER ANALYSIS
Ok, so I took all three answers and performed some benchmarking :) .
Starting conditions: I already have a DataFrame (this will be important!).
As expected all of them were waaaaay faster than my code.
For example for 15 rows with 1000 repeats in timeit:
my code: 0.5827s
Schalton's code: 0.1138s
Shubham's code: 0.2242s
SomeDudes's code: 0.2219
Seems like Schalton's code wins!
However... for 1500 rows with 50 repeats:
my code: 31.1139
Schalton's code: 2.4599s
Shubham's code: 0.511s
SomeDudes's code: 17.15
I decided to check once more, this time only one attempt but for 150 000 rows:
my code: 68.6798s
Schalton's code: 6.3889s
Shubham's code: 0.9520s
SomeDudes's code: 37.8837
Interesting thing happens: as the size of DataFrame gets bigger, all versions except Shubham's take much longer! Two fastest are Schalton's and Shubham's versions. This is were the starting point matters! I already have existing DataFrame so I have to convert it to dictionary. Dictionary itself is processed really fast. Conversion however takes time. Shubham's solution is more or less independent on size! Schalton's works very well for small data sets but due to conversion to dict it gets much slower for large amount of data.
Another comparison, this time 150000 rows with 30 repeats:
Schalton's code: 170.1538s
Shubham's code: 36.32s
However for 15 rows with 30000 repeats:
Schalton's code: 50.4997s
Shubham's code: 74.0916s
SUMMARY
In the end choice between Schalton's version and Shubham's depends on the use case:
for large number of small DataFrames (or with dictionary in the beginning) go with Schalton's solution
for very large DataFrames go with Shubham's solution.
As mentioned above, I have data sets around 1mln rows and more, thus I will go with Shubham's answer.
Code
pairs = df['data'].str.extractall(r'(?<!^)(\d+);(\d+)')
pairs = pairs.droplevel(1).pivot(columns=0, values=1).fillna(0)
df[['id']].join(pairs.add_prefix('data_'))
Explained
Extract all pairs using a regex pattern
0 1
match
0 0 0 4208
1 1 790
1 0 0 768
1 1 47
2 0 0 92
1 1 6
3 0 0 341
4 0 0 1
1 2 6
2 4 132
5 0 0 1
1 1 6
2 3 492
Pivot the pairs to reshape into desired format
0 0 1 2 3 4
0 4208 790 0 0 0
1 768 47 0 0 0
2 92 6 0 0 0
3 341 0 0 0 0
4 1 0 6 0 132
5 1 6 0 492 0
Join the reshaped pairs dataframe back with id column
id data_0 data_1 data_2 data_3 data_4
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 0 132
5 f 1 6 0 492 0
I'd avoid processing this in pandas, assuming you have the data in some other format I'd parse it into lists of dictionaries then load it into pandas.
import pandas as pd
from typing import Dict
data = {
"a": "2;0;4208;1;790",
"b": "2;0;768;1;47",
"c": "2;0;92;1;6",
"d": "1;0;341",
"e": "3;0;1;2;6;4;132",
"f": "3;0;1;1;6;3;492"
}
def get_event_counts(event_str: str, delim: str = ";") -> Dict[str, int]:
"""
given an event string return a dictionary of events
"""
EVENT_COUNT_INDEX = 0
split_event = event_str.split(delim)
event_count = int(split_event[EVENT_COUNT_INDEX])
events = {
split_event[index*2+1]: int(split_event[index*2+2]) for index in range(event_count - 1 // 2)
}
return events
data_records = [{"id": k, **get_event_counts(v)} for k,v in data.items()]
print(pd.DataFrame(data_records))
id 0 1 2 4 3
0 a 4208 790.0 NaN NaN NaN
1 b 768 47.0 NaN NaN NaN
2 c 92 6.0 NaN NaN NaN
3 d 341 NaN NaN NaN NaN
4 e 1 NaN 6.0 132.0 NaN
5 f 1 6.0 NaN NaN 492.0
If you're situated on your current df as the input, you could try this:
def process_starting_dataframe(starting_dataframe: pd.DataFrame) -> pd.DataFrame:
"""
Create a new dataframe from original input with two columns "id" and "data
"""
data_dict = starting_df.T.to_dict()
data_records = [{"id": i['id'], **get_event_counts(i['data'])} for i in data_dict.values()]
return pd.DataFrame(data_records)
A much more efficient method is to construct dicts from your data.
Do you observe how the alternate values in the split string are keys and values?
Then apply pd.Series and fillna(0) to get the dataframe with all required columns for the data.
Then you can concat.
Code:
df_data = df['data'].apply(
lambda x:dict(zip(x.split(';')[1::2], x.split(';')[2::2]))).apply(pd.Series).fillna(0)
df_data.columns = df_data.columns.map('data_{}'.format)
df = pd.concat([df.drop('data',axis=1), df_data], axis=1)
output:
id data_0 data_1 data_2 data_4 data_3
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 132 0
5 f 1 6 0 0 492
If you need sorted columns you can just do:
df = df[sorted(df.columns)]
Related
I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.
I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.
Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..
I am joining two tables left_table and right_table on non-unique keys that results in row explosion. I then want to aggregate rows to match the number of rows in left_table. To do this I aggregate over left_table columns.
Weirdly, when I save the table the columns in left_table double. It seems like columns of left_table become an index for resulting dataframe...
Left table
k1 k2 s v c target
0 1 3 20 40 2 2
1 1 2 10 20 1 1
2 1 2 10 80 2 1
Right table
k11 k22 s2 v2
0 1 2 0 100
1 2 3 30 200
2 1 2 10 300
Left join
k1 k2 s v c target s2 v2
0 1 3 20 40 2 2 NaN NaN
1 1 2 10 20 1 1 0.0 100.0
2 1 2 10 20 1 1 10.0 300.0
3 1 2 10 80 2 1 0.0 100.0
4 1 2 10 80 2 1 10.0 300.0
Aggregation code
dic = {}
keys_to_agg_over = left_table_col_names
for col in numeric_cols:
if col in all_cols:
dic[col] = 'median'
left_join = left_join.groupby(keys_to_agg_over).aggregate(dic)
After aggregation (doubled number of left table cols)
k1 k2 s v c target s2 v2
k1 k2 s v c target
1 2 10 20 1 1 1 2 10 20 1 1 5.0 200.0
80 2 1 1 2 10 80 2 1 5.0 200.0
3 20 40 2 2 1 3 20 40 2 2 NaN NaN
Saved to csv file
k1,k2,s,v,c,target,k1,k2,s,v,c,target,s2,v2
1,2,10,20,1,1,1,2,10,20,1,1,5.0,200.0
1,2,10,80,2,1,1,2,10,80,2,1,5.0,200.0
1,3,20,40,2,2,1,3,20,40,2,2,,
I tried resetting index, as left_join.reset_index() but I get
ValueError: cannot insert target, already exists
How to fix the issue of column-doubling?
You have a couple of options:
Store csv not including the index: I guess you are using the to_csv method to store the result in a csv. By default it includes you index columns in the generated csv. you can do to_csv(index=False) to avoid storing them.
reset_index dropping it: you can use left_join.reset_index(drop=True) in order to discard the index columns and not add them in the dataframe. By default reset_index adds the current index columns to the dataframe, generating the ValueError you obtain.
It seems like you are using:
left_join = left_table.merge(right_table, left_on = ["k1", "k2"], "right_on" = ["k11", "k22"] , how = "left")
This will result in a dataframe with repeated rows since indexes 1 and 2 from the left table both can be joined to indexes 0 and 2 of the right table. If that is the behavior you expected, and just want to get rid of duplicated rows you can try using:
left_join = left_join.drop_duplicates()
Before aggregating. This solution won't stop duplicating rows, it will rather eliminate them to not cause any trouble.
You can also pass the parameter as_index = False in the groupby function like this:
left_join = left_join.groupby(keys_to_agg_over, as_index = False).aggregate(dic)
To stop geting the "grouping columns" as indexes.
I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...
I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0
Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9
Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()
I want to multiply a set of columns s_cols with two other columns b, c.
So far, I was doing
s_cols = ['t070101', 't070102', 't070103', 't070104', 't070105', 't070199', 't070201', 't070299']
dfNew = df[s_cols]*df[`c`]*df[`b`]
but that operation sucked all the 16GB of memory off my system and crashed my OSX - the table has 148000 rows.
What should I do instead? I guess applying row-wise requires less active memory, but it seems to be less inefficient than a vectorized operation.
The table:
b TELFS t070101 t070102 t070103 t070104 \
TUCASEID
20030100013280 8155462.672158 2 0 0 0 0
20030100013344 1735322.527819 1 0 0 0 0
20030100013352 3830527.482672 2 60 0 0 0
20030100013848 6622022.995205 4 0 0 0 0
20030100014165 3068387.344956 1 0 0 0 0
t070105 t070199 t070201 t070299 \
TUCASEID
20030100013280 0 0 0 0
20030100013344 0 0 0 0
20030100013352 0 0 0 0
20030100013848 0 0 0 0
20030100014165 0 0 0
c
TUCASEID
20030100013280 31
20030100013344 31
20030100013352 31
20030100013848 31
20030100014165 31
UPDATE
The issue seems to be using df[s_cols]. Multiplication of a single column happens instantly, but already multiplying df[['t070101', 't070102']] was taking long enough that I was afraid of my system crashing again and preemptively shut down the Python process.
My guess is you actually want to do something like the following:
In [11]: cols = ['a', 'b']
In [12]: df1
Out[12]:
a b c d
0 1 4 1 4
1 2 5 2 10
2 3 6 3 18
In [13]: df1[cols].multiply(df1['c'] * df1['d'], axis=0)
Out[13]:
a b
0 4 16
1 40 100
2 162 324
As you can see your code with this example: the index is prepended to the columns (so the size of the DataFrame is N^2 in the length, potentially that would cause memory error / slowdown):
In [21]: df1[cols] * df1['c'] * df1['d']
Out[21]:
0 1 2 a b
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
Aside: you should put the brackets here to ensure the RHS is calculated first.
Another option for problems like this is to use numexpr, see enhancing performance with eval section of the pandas docs. However I don't think there is (currently) support for multiple assignment, so in this case it wouldn't help - nonetheless it is worthwhile reading.
The issue is apparently caused by pandas' suboptimal handling of the data frame slicing df[s_cols].
If instead I do
for col in s_cols:
df[col] = df[col].multiply(df.monthDays * df.TUFNWGTP)
the operation is done almost instantly.