I want to multiply a set of columns s_cols with two other columns b, c.
So far, I was doing
s_cols = ['t070101', 't070102', 't070103', 't070104', 't070105', 't070199', 't070201', 't070299']
dfNew = df[s_cols]*df[`c`]*df[`b`]
but that operation sucked all the 16GB of memory off my system and crashed my OSX - the table has 148000 rows.
What should I do instead? I guess applying row-wise requires less active memory, but it seems to be less inefficient than a vectorized operation.
The table:
b TELFS t070101 t070102 t070103 t070104 \
TUCASEID
20030100013280 8155462.672158 2 0 0 0 0
20030100013344 1735322.527819 1 0 0 0 0
20030100013352 3830527.482672 2 60 0 0 0
20030100013848 6622022.995205 4 0 0 0 0
20030100014165 3068387.344956 1 0 0 0 0
t070105 t070199 t070201 t070299 \
TUCASEID
20030100013280 0 0 0 0
20030100013344 0 0 0 0
20030100013352 0 0 0 0
20030100013848 0 0 0 0
20030100014165 0 0 0
c
TUCASEID
20030100013280 31
20030100013344 31
20030100013352 31
20030100013848 31
20030100014165 31
UPDATE
The issue seems to be using df[s_cols]. Multiplication of a single column happens instantly, but already multiplying df[['t070101', 't070102']] was taking long enough that I was afraid of my system crashing again and preemptively shut down the Python process.
My guess is you actually want to do something like the following:
In [11]: cols = ['a', 'b']
In [12]: df1
Out[12]:
a b c d
0 1 4 1 4
1 2 5 2 10
2 3 6 3 18
In [13]: df1[cols].multiply(df1['c'] * df1['d'], axis=0)
Out[13]:
a b
0 4 16
1 40 100
2 162 324
As you can see your code with this example: the index is prepended to the columns (so the size of the DataFrame is N^2 in the length, potentially that would cause memory error / slowdown):
In [21]: df1[cols] * df1['c'] * df1['d']
Out[21]:
0 1 2 a b
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
Aside: you should put the brackets here to ensure the RHS is calculated first.
Another option for problems like this is to use numexpr, see enhancing performance with eval section of the pandas docs. However I don't think there is (currently) support for multiple assignment, so in this case it wouldn't help - nonetheless it is worthwhile reading.
The issue is apparently caused by pandas' suboptimal handling of the data frame slicing df[s_cols].
If instead I do
for col in s_cols:
df[col] = df[col].multiply(df.monthDays * df.TUFNWGTP)
the operation is done almost instantly.
Related
INTRODUCTION TO PROBLEM
I have data encoded in string in one DataFrame column:
id data
0 a 2;0;4208;1;790
1 b 2;0;768;1;47
2 c 2;0;92;1;6
3 d 1;0;341
4 e 3;0;1;2;6;4;132
5 f 3;0;1;1;6;3;492
Data represents count how many times some events happened in our system. We can have 256 different events (each has numerical id assigned from range 0-255). As usually we have only a few events happen in one measurement period is doesn't make sense to store all zeros. That's why data is encoded as follows: first number tells how many events happened during measurement period, then each pair contains event_id and counter.
For example:
"3;0;1;1;6;3;492" means:
3 events happened in measurement period
event with id=0 happened 1 time
event with id=1 happened 6 times
event with id=3 happened 492 time
other events didn't happen
I need to decode the data to separate columns. Expected result is DataFrame which looks like this:
id data_0 data_1 data_2 data_3 data_4
0 a 4208.0 790.0 0.0 0.0 0.0
1 b 768.0 47.0 0.0 0.0 0.0
2 c 92.0 6.0 0.0 0.0 0.0
3 d 341.0 0.0 0.0 0.0 0.0
4 e 1.0 0.0 6.0 0.0 132.0
5 f 1.0 6.0 0.0 492.0 0.0
QUESTION ITSELF
I came up with the following function to do it:
def split_data(data: pd.Series):
tmp = data.str.split(';', expand=True).astype('Int32').fillna(-1)
tmp = tmp.apply(
lambda row: {'{0}_{1}'.format(data.name,row[i*2-1]): row[i*2] for i in range(1,row[0]+1)},
axis='columns',
result_type='expand').fillna(0)
return tmp
df = pd.concat([df, split_data(df.pop('data'))], axis=1)
The problem is that I have millions of lines to process and it takes A LOT of time.
As I don't have that much experience with pandas, I hope someone would be able to help me with more efficient way of performing this task.
EDIT - ANSWER ANALYSIS
Ok, so I took all three answers and performed some benchmarking :) .
Starting conditions: I already have a DataFrame (this will be important!).
As expected all of them were waaaaay faster than my code.
For example for 15 rows with 1000 repeats in timeit:
my code: 0.5827s
Schalton's code: 0.1138s
Shubham's code: 0.2242s
SomeDudes's code: 0.2219
Seems like Schalton's code wins!
However... for 1500 rows with 50 repeats:
my code: 31.1139
Schalton's code: 2.4599s
Shubham's code: 0.511s
SomeDudes's code: 17.15
I decided to check once more, this time only one attempt but for 150 000 rows:
my code: 68.6798s
Schalton's code: 6.3889s
Shubham's code: 0.9520s
SomeDudes's code: 37.8837
Interesting thing happens: as the size of DataFrame gets bigger, all versions except Shubham's take much longer! Two fastest are Schalton's and Shubham's versions. This is were the starting point matters! I already have existing DataFrame so I have to convert it to dictionary. Dictionary itself is processed really fast. Conversion however takes time. Shubham's solution is more or less independent on size! Schalton's works very well for small data sets but due to conversion to dict it gets much slower for large amount of data.
Another comparison, this time 150000 rows with 30 repeats:
Schalton's code: 170.1538s
Shubham's code: 36.32s
However for 15 rows with 30000 repeats:
Schalton's code: 50.4997s
Shubham's code: 74.0916s
SUMMARY
In the end choice between Schalton's version and Shubham's depends on the use case:
for large number of small DataFrames (or with dictionary in the beginning) go with Schalton's solution
for very large DataFrames go with Shubham's solution.
As mentioned above, I have data sets around 1mln rows and more, thus I will go with Shubham's answer.
Code
pairs = df['data'].str.extractall(r'(?<!^)(\d+);(\d+)')
pairs = pairs.droplevel(1).pivot(columns=0, values=1).fillna(0)
df[['id']].join(pairs.add_prefix('data_'))
Explained
Extract all pairs using a regex pattern
0 1
match
0 0 0 4208
1 1 790
1 0 0 768
1 1 47
2 0 0 92
1 1 6
3 0 0 341
4 0 0 1
1 2 6
2 4 132
5 0 0 1
1 1 6
2 3 492
Pivot the pairs to reshape into desired format
0 0 1 2 3 4
0 4208 790 0 0 0
1 768 47 0 0 0
2 92 6 0 0 0
3 341 0 0 0 0
4 1 0 6 0 132
5 1 6 0 492 0
Join the reshaped pairs dataframe back with id column
id data_0 data_1 data_2 data_3 data_4
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 0 132
5 f 1 6 0 492 0
I'd avoid processing this in pandas, assuming you have the data in some other format I'd parse it into lists of dictionaries then load it into pandas.
import pandas as pd
from typing import Dict
data = {
"a": "2;0;4208;1;790",
"b": "2;0;768;1;47",
"c": "2;0;92;1;6",
"d": "1;0;341",
"e": "3;0;1;2;6;4;132",
"f": "3;0;1;1;6;3;492"
}
def get_event_counts(event_str: str, delim: str = ";") -> Dict[str, int]:
"""
given an event string return a dictionary of events
"""
EVENT_COUNT_INDEX = 0
split_event = event_str.split(delim)
event_count = int(split_event[EVENT_COUNT_INDEX])
events = {
split_event[index*2+1]: int(split_event[index*2+2]) for index in range(event_count - 1 // 2)
}
return events
data_records = [{"id": k, **get_event_counts(v)} for k,v in data.items()]
print(pd.DataFrame(data_records))
id 0 1 2 4 3
0 a 4208 790.0 NaN NaN NaN
1 b 768 47.0 NaN NaN NaN
2 c 92 6.0 NaN NaN NaN
3 d 341 NaN NaN NaN NaN
4 e 1 NaN 6.0 132.0 NaN
5 f 1 6.0 NaN NaN 492.0
If you're situated on your current df as the input, you could try this:
def process_starting_dataframe(starting_dataframe: pd.DataFrame) -> pd.DataFrame:
"""
Create a new dataframe from original input with two columns "id" and "data
"""
data_dict = starting_df.T.to_dict()
data_records = [{"id": i['id'], **get_event_counts(i['data'])} for i in data_dict.values()]
return pd.DataFrame(data_records)
A much more efficient method is to construct dicts from your data.
Do you observe how the alternate values in the split string are keys and values?
Then apply pd.Series and fillna(0) to get the dataframe with all required columns for the data.
Then you can concat.
Code:
df_data = df['data'].apply(
lambda x:dict(zip(x.split(';')[1::2], x.split(';')[2::2]))).apply(pd.Series).fillna(0)
df_data.columns = df_data.columns.map('data_{}'.format)
df = pd.concat([df.drop('data',axis=1), df_data], axis=1)
output:
id data_0 data_1 data_2 data_4 data_3
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 132 0
5 f 1 6 0 0 492
If you need sorted columns you can just do:
df = df[sorted(df.columns)]
I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.
I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.
The code below provides a cumulative count of how many times a specified value changes. The value has to change to return a count.
import pandas as pd
import numpy as np
d = ({
'Who' : ['Out','Even','Home','Home','Even','Away','Home','Out','Even','Away','Away','Home','Away'],
})
#Specified Values
Teams = ['Home', 'Away']
for who in Teams:
s = df[df.Who==who].index.to_series().diff()!=1
df['Change_'+who] = s[s].cumsum()
Output:
Who Change_Home Change_Away
0 Out NaN NaN
1 Even NaN NaN
2 Home 1.0 NaN
3 Home NaN NaN
4 Even NaN NaN
5 Away NaN 1.0
6 Home 2.0 NaN
7 Out NaN NaN
8 Even NaN NaN
9 Away NaN 2.0
10 Away NaN NaN
11 Home 3.0 NaN
12 Away NaN 3.0
I'm trying to further sort the output based off what value precedes Home and Away. As in the code above doesn't differentiate what Home and Away got changed from. It just counts the amount of times it got changed to Home/Away.
Is there a way to alter the code above to split it up into what Home/Away got changed from? Or will it have to start again?
My intended output is:
Even_Away Even_Home Swap_Away Swap_Home Who
0 Out
1 Even
2 1 Home
3 Home
4 Even
5 1 Away
6 1 Home
7 Out
8 Even
9 2 Away
10 Away
11 2 Home
12 1 Away
So Even_ represents how many times it went from Even to Home/Away and Swap_ represents how many times it went from Home to Away or vice versa.
Main function is get_dummies for dynamic solution - create new columns for all previous values defined in Teams list:
#create DataFrame
df = pd.DataFrame(d)
Teams = ['Home', 'Away']
#create boolean mask for check value by list and compare with shifted column
shifted = df['Who'].shift().fillna('')
m1 = df['Who'].isin(Teams)
#mask for exclude same previous values Home_Home, Away_Away
m2 = df['Who'] == shifted
#chain together, ~ invert mask
m = m1 & ~m2
#join column by mask and create indicator df
df1 = pd.get_dummies(np.where(m, shifted + '_' + df['Who'], np.nan))
#rename columns dynamically
c = df1.columns[df1.columns.str.startswith(tuple(Teams))]
c1 = ['Swap_' + x.split('_')[1] for x in c]
df1 = df1.rename(columns = dict(zip(c, c1)))
#count values by cumulative sum, add column Who
df2 = df1.cumsum().mask(df1 == 0, 0).join(df[['Who']])
print (df2)
Swap_Home Even_Away Even_Home Swap_Away Who
0 0 0 0 0 Out
1 0 0 0 0 Even
2 0 0 1 0 Home
3 0 0 0 0 Home
4 0 0 0 0 Even
5 0 1 0 0 Away
6 1 0 0 0 Home
7 0 0 0 0 Out
8 0 0 0 0 Even
9 0 2 0 0 Away
10 0 0 0 0 Away
11 2 0 0 0 Home
12 0 0 0 1 Away
Assume I have a DataFrame of the following form where the first column is a random number, and the other columns will be based on the value in the previous column.
For ease of use, let's say I want each number to be the previous one squared. So it would look like the below.
I know I can write a pretty simple loop to do this, but I also know looping is not usually the most efficient in python/pandas. How could this be done with apply() or rolling_apply()? Or, otherwise be done more efficiently?
My (failed) attempts below:
In [12]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
In [13]: a
Out[13]:
0 1 2 3
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
In [14]: a = a.apply(lambda x: x**2)
In [15]: a
Out[15]:
0 1 2 3
0 1 0 0 0
1 4 0 0 0
2 9 0 0 0
3 16 0 0 0
4 25 0 0 0
In [16]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
In [17]: pandas.rolling_apply(a,1,lambda x: x**2)
C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\spyderlib\widgets\externalshell\start_ipython_kernel.py:1: FutureWarning: pd.rolling_apply is deprecated for DataFrame and will be removed in a future version, replace with
DataFrame.rolling(center=False,window=1).apply(args=<tuple>,kwargs=<dict>,func=<function>)
# -*- coding: utf-8 -*-
Out[17]:
0 1 2 3
0 1.0 0.0 0.0 0.0
1 4.0 0.0 0.0 0.0
2 9.0 0.0 0.0 0.0
3 16.0 0.0 0.0 0.0
4 25.0 0.0 0.0 0.0
In [18]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
In [19]: a = a[:-1]**2
In [20]: a
Out[20]:
0 1 2 3
0 1 0 0 0
1 4 0 0 0
2 9 0 0 0
3 16 0 0 0
In [21]:
So, my issue is mostly how to refer to the previous column value in my DataFrame calculations.
What you're describing is a recurrence relation, and I don't think there is currently any non-loop way to do that. Things like apply and rolling_apply still rely on having all the needed data available before they begin, and outputting all the result data at once at the end. That is, they don't allow you to compute the next value using earlier values of the same series. See this question and this one as well as this pandas issue.
In practical terms, for your example, you only have three columns you want to fill in, so doing a three-pass loop (as shown in some of the other answers) will probably not be a major performance hit.
Unfortunately there isn't a way of doing this with no loops, as far as I know. However, you don't have to loop through every value, just each column. You can just call apply on the previous column and set the next one to the returned value:
a = pd.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
for i in range(3):
a[i+1] = a[i].apply(lambda x: x**2)
a[1] = a[0].apply(lambda x: x**2)
a[2] = a[1].apply(lambda x: x**2)
a[3] = a[2].apply(lambda x: x**2)
will give you
0 1 2 3
0 1 1 1 1
1 2 4 16 256
2 3 9 81 6561
3 4 16 256 65536
4 5 25 625 390625
In this special case, we know this about the columns
0 will be what ever is there to the power of 1
1 will be what ever is in column 0 to the power of 2
2 will be what ever is in column 1 to the power of 2...
or will be what ever is in column 0 to the power of 4
3 will be what ever is in column 2 to the power of 2...
or will be what ever is in column 1 to the power of 4...
or will be what ever is in column 0 to the power of 8
So we can indeed vectorize your example with
np.power(df.values[:, [0]], np.power(2, np.arange(4)))
array([[ 1, 1, 1, 1],
[ 2, 4, 16, 256],
[ 3, 9, 81, 6561],
[ 4, 16, 256, 65536],
[ 5, 25, 625, 390625]])
Wrap this in a pretty dataframe
pd.DataFrame(
np.power(df.values[:, [0]], np.power(2, np.arange(4))),
df.index, df.columns)
0 1 2 3
0 1 1 1 1
1 2 4 16 256
2 3 9 81 6561
3 4 16 256 65536
4 5 25 625 390625
I'm trying to calculate what I am calling "delta values", meaning the amount that has changed between two consecutive rows.
For example
A | delta_A
1 | 0
2 | 1
5 | 3
9 | 4
I managed to do that starting with this code (basically copied from a MatLab program I had)
df = df.assign(delta_A=np.zeros(len(df.A)))
df['delta_A'][0] = 0 # start at 'no-change'
df['delta_A'][1:] = df.A[1:].values - df.A[:-1].values
Which generates the dataframe correctly, and seems to have no further negative affects
However, I think there is something wrong with that approach becuase I get these messages.
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
.../__main__.py:5: SettingWithCopyWarning
So, I didn't really understand what that link was trying to say, and I found this post
Adding new column to existing DataFrame in Python pandas
And, as the latest edit to the answer there says to use this code, but I have already used that syntax...
df1 = df1.assign(e=p.Series(np.random.randn(sLength)).values)
So, question is - Is the loc() function the way to go, or what is the more correct way to get that column?
It seems you need diff and then replace NaN with 0:
df['delta_A'] = df.A.diff().fillna(0).astype(int)
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1
Alternative solution with assign
df = df.assign(delta_A=df.A.diff().fillna(0).astype(int))
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1
Another solution if you need to replace only first NaN value:
df['delta_A'] = df.A.diff()
df.loc[df.index[0], 'delta_A'] = 0
print (df)
A delta_A
0 0 0.0
1 4 4.0
2 7 3.0
3 8 1.0
Your solution can be modified with iloc, but I think it's better to use the diff function:
df['delta_A'] = 0 # convert all values to 0
df['delta_A'].iloc[1:] = df.A[1:].values - df.A[:-1].values
#also works
#df['delta_A'][1:] = df.A[1:].values - df.A[:-1].values
print (df)
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1