Create new column with various conditional logic between other columns - python

I have the following dataset
test = pd.DataFrame({'date':['2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'],
'account':['a','a','a','b','b','b','c','c','c','d','e'],
'tot_chg':[2072,2072,2072,322,322,322,483,483,483,140,570],
'denied':[1878,1036,1036,322,161,161,150,322,322,105,570],
'denied_sum':[1878,2914,3950,322,483,644,150,472,794,105,570]})
in which I would like to append a new column called denied_true based on the following parameters:
while denied_sum is less than tot_chgs, return denied
until the denied_sum exceeds tot_chgs, then compute the remaining difference between the sum of all prior denied_true less the tot_chgs
and if denied ever equals tot_chgs at the first instance, just return denied and make remaining rows for the account 0
The output should effectively look like this:
The dataframe for the output is:
output = pd.DataFrame({'date':['2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'],
'account':['a','a','a','b','b','b','c','c','c','d','e'],
'tot_chg':[2072,2072,2072,322,322,322,483,483,483,140,570],
'denied':[1878,1036,1036,322,161,161,150,322,322,105,570],
'denied_sum':[1878,2914,3950,322,483,644,150,472,794,105,570],
'denied_true':[1878,194,0,322,0,0,150,322,11,105,570]})
So far, I have tried the following code using where, but it's missing the condition of subtract the previous denied_true value from the tot_chgs
test['denied_true'] = test.denied_sum.to_numpy()
test.denied_true.where(test.denied_sum.le(test.tot_chg),other=0,inplace=True)
test
However, I'm not really sure how to append multiple conditions to this where function. Maybe I need if/elif loops, or a boolean mask. Any help would be greatly appreciated!

You can convert DataFrame into OrderedDict and to handle it this straightforward way:
import pandas as pd
from collections import OrderedDict
test = pd.DataFrame({'date': ['2018-08-01', '2018-08-02', '2018-08-03', '2019-09-01', '2019-09-02', '2019-09-03', '2020-01-02', '2020-01-03', '2020-01-04', '2020-10-04', '2020-10-05'],
'account': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'e'],
'tot_chg': [2072, 2072, 2072, 322, 322, 322, 483, 483, 483, 140, 570],
'denied': [1878, 1036, 1036, 322, 161, 161, 150, 322, 322, 105, 570],
'denied_sum': [1878, 2914, 3950, 322, 483, 644, 150, 472, 794, 105, 570]})
# convert DataFrame into OrderedDict
od = test.to_dict(into=OrderedDict)
# functions (samples)
def zero(dict, row):
# if denied == denied_sum
# change the dict...
return dict['denied'][row]
def ex(dict, row):
# if exceeds
# change the dict...
return 'exceed()'
def eq(dict, row):
# if equals
# change the dict...
return 'equal()'
def get_value(dict, row):
# conditions
if dict['denied'][row] == dict['denied_sum'][row]: return zero(dict, row)
if dict['denied_sum'][row] < dict['tot_chg'][row]: return dict['denied'][row]
if dict['denied_sum'][row] > dict['tot_chg'][row]: return ex(dict, row)
if dict['denied_sum'][row] == dict['tot_chg'][row]: return eq(dict, row)
# MAIN
# make a list (column) of 'denied_true' values
denied_true_list = [(row, get_value(od, row)) for row in range(len(od["date"]))]
# convert the list into a dict
denied_true_dict = {'denied_true': OrderedDict(denied_true_list)}
# add the dict to the OrderedDict
od.update(OrderedDict(denied_true_dict))
# convert the OrderedDict back into DataFrame
test = pd.DataFrame(od)
Input:
date account tot_chg denied denied_sum
0 2018-08-01 a 2072 1878 1878
1 2018-08-02 a 2072 1036 2914
2 2018-08-03 a 2072 1036 3950
3 2019-09-01 b 322 322 322
4 2019-09-02 b 322 161 483
5 2019-09-03 b 322 161 644
6 2020-01-02 c 483 150 150
7 2020-01-03 c 483 322 472
8 2020-01-04 c 483 322 794
9 2020-10-04 d 140 105 105
10 2020-10-05 e 570 570 570
Output:
date account tot_chg denied denied_sum denied_true
0 2018-08-01 a 2072 1878 1878 1878
1 2018-08-02 a 2072 1036 2914 exceed()
2 2018-08-03 a 2072 1036 3950 exceed()
3 2019-09-01 b 322 322 322 322
4 2019-09-02 b 322 161 483 exceed()
5 2019-09-03 b 322 161 644 exceed()
6 2020-01-02 c 483 150 150 150
7 2020-01-03 c 483 322 472 322
8 2020-01-04 c 483 322 794 exceed()
9 2020-10-04 d 140 105 105 105
10 2020-10-05 e 570 570 570 570
I didn't make a full implementation of your logic in the functions since it's just a sample.
About the same (probably it would be a bit easer) can be done via DataFrame > JSON > DataFrame.
Update. I've tried to implement the function ex(). Here is how it might look like.
def ex(dict, row):
# if exceeds
denied_true_slice = denied_true_list[0:row] # <-- global list
tot_chg_slice = [dict['tot_chg'][r] for r in range(row)]
denied_true_sum = sum ([v for r, v in enumerate(denied_true_slice) if tot_chg_slice[r] > v])
value = tot_chg_slice[-1] - denied_true_sum
return value if value > 0 else 0
I'm not quite sure if it works as supposed. Since I'm not fully understand the quirky conditions. But I'm sure it looks rather ugly and cryptic and probably isn't in line with best Stackoverflow's examples.
Now there is the global list, so, MAIN section now looks like this:
# MAIN
# make a list (column) of 'denied_true' values
denied_true_list = [] # <-- the global list
for row, _ in enumerate(od['date']):
denied_true_list.append(get_value(od,row))
denied_true_list = [(row, value) for row, value in enumerate(denied_true_list)]
# convert the list into a dict
denied_true_dict = {'denied_true': OrderedDict(denied_true_list)}
# add the dict to the OrderedDict
od.update(OrderedDict(denied_true_dict))
# convert the OrderedDict back into DataFrame
test = pd.DataFrame(od)
Output:
date account tot_chg denied denied_sum denied_true
0 2018-08-01 a 2072 1878 1878 1878
1 2018-08-02 a 2072 1036 2914 194
2 2018-08-03 a 2072 1036 3950 0
3 2019-09-01 b 322 322 322 322
4 2019-09-02 b 322 161 483 0
5 2019-09-03 b 322 161 644 0
6 2020-01-02 c 483 150 150 150
7 2020-01-03 c 483 322 472 322
8 2020-01-04 c 483 322 794 0
9 2020-10-04 d 140 105 105 105
10 2020-10-05 e 570 570 570 570
I believe it could be done much more pretty via native Pandas tools.

Related

Unable to allocate 208. GiB for an array with shape (27939587241,) and data type int64?

This is my code:
play_count_with_title = pd.merge(df_count, df_small[['song_id', 'title', 'release']], on = 'song_id' )
final_ratings = pd.merge(play_count_with_title, df_small[['song_id', 'artist_name']], on = 'song_id' )
final_ratings
the error which i got is
Unable to allocate 208. GiB for an array with shape (27939587241,) and data type int64
The Code which enabled this error within the library is
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:124, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
93 #Substitution("\nleft : DataFrame or named Series")
94 #Appender(_merge_doc, indents=0)
95 def merge(
(...)
108 validate: str | None = None,
109 ) -> DataFrame:
110 op = _MergeOperation(
111 left,
112 right,
(...)
122 validate=validate,
123 )
--> 124 return op.get_result(copy=copy)
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:773, in _MergeOperation.get_result(self, copy)
770 if self.indicator:
771 self.left, self.right = self._indicator_pre_merge(self.left, self.right)
--> 773 join_index, left_indexer, right_indexer = self._get_join_info()
775 result = self._reindex_and_concat(
776 join_index, left_indexer, right_indexer, copy=copy
777 )
778 result = result.__finalize__(self, method=self._merge_type)
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:1026, in _MergeOperation._get_join_info(self)
1022 join_index, right_indexer, left_indexer = _left_join_on_index(
1023 right_ax, left_ax, self.right_join_keys, sort=self.sort
1024 )
1025 else:
-> 1026 (left_indexer, right_indexer) = self._get_join_indexers()
1028 if self.right_index:
1029 if len(self.left) > 0:
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:1000, in _MergeOperation._get_join_indexers(self)
998 def _get_join_indexers(self) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]:
999 """return the join indexers"""
-> 1000 return get_join_indexers(
1001 self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
1002 )
File ~\anaconda3\lib\site-packages\pandas\core\reshape\merge.py:1610, in get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
1600 join_func = {
1601 "inner": libjoin.inner_join,
1602 "left": libjoin.left_outer_join,
(...)
1606 "outer": libjoin.full_outer_join,
1607 }[how]
1609 # error: Cannot call function of unknown type
-> 1610 return join_func(lkey, rkey, count, **kwargs)
File ~\anaconda3\lib\site-packages\pandas\_libs\join.pyx:48, in pandas._libs.join.inner_join()
As a beginner i dont understand the error can you guys help me out?
It's hard to know what's going on without a sample of your data. However, this looks like the sort of problem you'd see if there are a lot of duplicated values in both dataframes.
Note that if there are multiple rows which match during the merge, then every combination of left and right rows is emitted by the merge.
For example, here's a tiny example of a 3-element DataFrame being merged with itself. The result has 9 elements!
In [7]: df = pd.DataFrame({'a': [1,1,1], 'b': [1,2,3]})
In [8]: df.merge(df, 'left', on='a')
Out[8]:
a b_x b_y
0 1 1 1
1 1 1 2
2 1 1 3
3 1 2 1
4 1 2 2
5 1 2 3
6 1 3 1
7 1 3 2
8 1 3 3
If your song_id column has a lot of duplicates in it, then the number of elements could be as many as N^2, i.e. 154377**2 == 23832258129 in the worst case.
Try using drop_duplicates('song_id') on each of the merge inputs to see what happens in that case.

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

I have two functions that do some calculation and gives me results. For now, I am able to apply it in one column and get the result in the form of a dataframe.
I need to know how I can apply the function on all the columns in the dataframe and get results as well in the form of a dataframe.
Say I have a data frame as below and I need to apply the function on each column in the data frame and get a dataframe with results corresponding for all the columns.
A B C D E F
1456 6744 9876 374 65413 1456
654 2314 674654 2156 872 6744
875 653 36541 345 4963 9876
6875 7401 3654 465 3547 374
78654 8662 35 6987 6874 65413
658 94512 687 489 8756 5854
Results
A B C D E F
2110 9058 684530 2530 66285 8200
1529 2967 711195 2501 5835 16620
7750 8054 40195 810 8510 10250
85529 16063 3689 7452 10421 65787
Here is simple example
df
A B C D
0 10 11 12 13
1 20 21 22 23
2 30 31 32 33
3 40 41 42 43
# Assume your user defined function is
def mul(x, y):
return x * y
which will multiply the values
Let's say you want to multiply first column 'A' with 3
df['A'].apply(lambda x: mul(x,3))
0 30
1 60
2 90
3 120
Now, you want to apply mul function to all columns of dataframe and create new dataframe with results
df1 = df.applymap(lambda x: mul(x, 3))
df1
A B C D
0 30 33 36 39
1 60 63 66 69
2 90 93 96 99
3 120 123 126 129
pd.DataFrame object also has its own apply method.
From the example given in the documentation of the link above:
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Conclusion: you should be able to apply your function to the whole dataframe.
It looks like this is what you are trying to do in your output:
df = pd.DataFrame(
[[1456, 6744, 9876, 374, 65413, 1456],
[654, 2314, 674654, 2156, 872, 6744],
[875, 653, 36541, 345, 4963, 9876],
[6875, 7401, 3654, 465, 3547, 374],
[78654, 8662, 35, 6987, 6874, 65413],
[658, 94512, 687, 489, 8756, 5854]],
columns=list('ABCDEF'))
def fn(col):
return col[:-2].values + col[1:-1].values
Apply the function as mentioned in previous answers:
>>> df.apply(fn)
A B C D E F
0 2110 9058 684530 2530 66285 8200
1 1529 2967 711195 2501 5835 16620
2 7750 8054 40195 810 8510 10250
3 85529 16063 3689 7452 10421 65787

how to multiply all values from a column in a particular year in pandas

I'm trying to multiply all values in a particular year and push it to another column. With the code below I'm getting this error
TypeError: ("'NoneType' object is not callable", 'occurred at index
I'm getting NaT and NaN when I use shift(1). How can I get it to work?
def check_date():
next_row = df.Date.shift(1)
first_row = df.Date
date1 = pd.to_datetime(first_row).year
date2 = pd.to_datetime(next_row).year
if date1 == date2:
df['all_data_in_year'] = date1 * date2
df.apply(check_date(), axis=1)
DataSet:
Date Open High Low Last Close Total Trade Quantity Turnover (Lacs)
31/12/10 816 824.5 807.3 815 818.45 1165987 9529.64
31/01/11 675 680 654 670.1 669.35 535039 3553.92
28/02/11 550 561.6 542 548.5 548.4 749166 4136.09
31/03/11 621.5 624.7 607.1 618 616.25 628572 3866
29/04/11 654.7 657.95 626 631 632.05 833213 5338.91
31/05/11 575 590 565.6 589.3 585.15 908185 5239.36
30/06/11 527 530.7 521.3 524 524.6 534496 2804.89
29/07/11 496.95 502.9 486 486.2 489.7 500743 2477.96
30/08/11 365.95 382.7 365 380 376.65 844439 3171.6
30/09/11 362.4 365.9 348.1 352 352.75 617537 2196.56
31/10/11 430 439.5 425 429.1 431.2 1033903 4493.97
30/11/11 349.05 354.95 344.15 348 350 686735 2404.1
30/12/11 353 355.9 340.1 340.1 342.75 740222 2565.39
31/01/12 443 451.45 428 445.5 446 1344942 5952.77
29/02/12 485.55 505.9 484 497 495.1 1011007 5004.46
30/03/12 421 436.45 418.4 432.5 432.95 867832 3740.04
30/04/12 410.35 419.4 406.85 414.3 414.05 418539 1733.81
31/05/12 362 363.05 351.2 359 358.3 840753 3000.41
29/06/12 385.05 395.3 382.9 388 389.75 1171690 4581.58
31/07/12 377.75 386 367.7 380.5 381.35 499246 1886.06
31/08/12 473.7 473.7 394.25 399 400.85 631225 2544.24
I think better is avoid loops (apply under the hood) and use numpy.where:
#sample Dataframe with sample datetimes
rng = pd.date_range('2017-04-03', periods=10, freq='8m')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
date1 = df.Date.shift(1).dt.year
date2 = df.Date.dt.year
df['all_data_in_year'] = np.where(date1 == date2, date1 * date2, np.nan)
print (df)
Date a all_data_in_year
0 2017-04-30 0 NaN
1 2017-12-31 1 4068289.0
2 2018-08-31 2 NaN
3 2019-04-30 3 NaN
4 2019-12-31 4 4076361.0
5 2020-08-31 5 NaN
6 2021-04-30 6 NaN
7 2021-12-31 7 4084441.0
8 2022-08-31 8 NaN
9 2023-04-30 9 NaN
EDIT1:
df['new'] = df.groupby( pd.to_datetime(df['Date']).dt.year)['Close'].transform('prod')

Format Pandas Pivot Table

I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648

Categories

Resources