Let's say, I have number A and they call several people B
A B
123 987
123 987
123 124
435 567
435 789
653 876
653 876
999 654
999 654
999 654
999 123
I want to find to whom the person in A has called maximum times and also the number of times.
OUTPUT:
A B Count
123 987 2
435 567 or789 1
653 876 2
999 654 3
How one can think of it is,
A B
123 987 2
124 1
435 567 1
789 1
653 876 2
999 654 3
123 1
Can somebody help me out on how to do this?
Try this
# count the unique values in rows
df.value_counts(['A','B']).sort_index()
A B
123 124 1
987 2
435 567 1
789 1
653 876 2
999 123 1
654 3
dtype: int64
To get the highest values for each unique A:
v = df.value_counts(['A','B'])
# remove duplicated rows
v[~v.reset_index(level=0).duplicated('A').values]
A B
999 654 3
123 987 2
653 876 2
435 567 1
dtype: int64
Use SeriesGroupBy.value_counts which by default sorting values, so get first rows per A by GroupBy.head:
df = df.groupby('A')['B'].value_counts().groupby(level=0).head(1).reset_index(name='Count')
print (df)
A B Count
0 123 987 2
1 435 567 1
2 653 876 2
3 999 654 3
Another idea:
df = df.value_counts(['A','B']).reset_index(name='Count').drop_duplicates('A')
print (df)
A B Count
0 999 654 3
1 123 987 2
2 653 876 2
4 435 567 1
Assume the following simplified framework:
I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:
iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)
parameters
instances classes a b c d
0 0 1127 1460 861 1295
1 1131 1096 1725 1045
2 1639 122 467 1239
3 331 1483 88 1397
1 0 1124 872 1688 131
... ... ... ... ...
98 3 1321 1750 779 1431
99 0 1793 814 1637 1429
1 1370 1646 420 1206
2 983 825 1025 1855
3 1974 567 371 936
Let df be a dataframe that for each instance and each feature (column), report the observed class.
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)),
columns=columns)
a b c d
0 2 0 2 2
1 0 0 2 1
2 2 2 2 2
3 0 2 1 0
4 1 1 1 1
.. .. .. .. ..
95 1 2 0 1
96 2 1 2 1
97 0 0 1 2
98 0 0 0 1
99 1 2 2 2
I would like to create a third dataframe (let's call it new_df) of shape (100, 4) containing the parameters in the dataframe parameters based on the observed classes on the dataframe df.
For example, in the first row of df for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters dataframe, namely 1127 that will populate the first row and column of new df. Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df I would like to observe 1460 and so on.
With a for loop I can obtain the desired result:
new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
for c in df.columns:
new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]
new_df
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.
Is there a way to vectorize such a problem in order to avoid for loops? (at least over 1 dimension)
Reshape both DataFrames, using stack, into a long format, then perform the merge and reshape, with unstack, back to the wide format. There's a bunch of renaming just so we can reference and align the columns in the merge.
(df.rename_axis(index='instances', columns='cols').stack().to_frame('classes')
.merge(parameters.rename_axis(columns='cols').stack().rename('vals'),
on=['instances', 'classes', 'cols'])
.unstack(-1)['vals']
.rename_axis(index=None, columns=None)
)
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
I would like to have a dataframe, created by combine only the total row values on two pivot tables and keeping the same column names, including the All column.
testA:
sum
ALL_APPS
MONTH 2012/08 2012/09 2012/10 All
DESCRIPTION
A1 111 112 113 336
A2 121 122 123 366
A3 131 132 133 396
All 363 366 369 1098
testA:
sum
ALL_APPS
MONTH 2012/08 2012/09 2012/10 All
DESCRIPTION
A1 211 212 213 636
A2 221 222 223 666
A3 231 232 233 696
All 663 666 669 1998
As I result I would like to have a data frame that would look like:
2019/08 2019/09 2019/10 All
363 366 369 1098
663 666 669 1998
I tried:
A=testA.iloc[3]
B=testB.iloc[3]
my_series = pd.concat([A,B],axis=1)
But it does not do what I expected :(
All All
MONTH
sum ALL_APPS 2019/08 363.0 NaN
2019/09 366.0 NaN
2019/10 369.0 NaN
All 1098.0 NaN
CUR_VER 2019/08 NaN 663.0
2019/09 NaN 666.0
2019/10 NaN 669.0
All NaN 1998.0
Try:
my_series=pd.concat([testA.iloc[-1], testB.iloc[-1]], axis=1, ignore_index=True).T
my_series.columns=map(lambda x: x[3], testA.columns)
I am trying to merge a pandas dataframe with a pivot table and it changes the column names. Can I retain the original column names from pivot without having them merged into a single column?
df1:
pn number_of_records date
A103310 0 2017-09-01
B103309 0 2017-06-01
C103308 0 2017-03-01
D103307 2 2016-12-01
E103306 2 2016-09-01
df2 which is a pivot table:
pn t1
a1 b1 c1
A103310 3432 324 124
B103309 342 123 984
C103308 435 487 245
D103307 879 358 234
E103306 988 432 235
doing a merge on this dataframe gives me:
df1_df2 = pd.merge(df1,df2,how="left",on="pn")
gives me the column names as:
pn number_of_records date (t1,a1) (t1,b1) (t1,c1)
How can I instead have them as:
pn number_of_records date t1
a1 b1 c1
in the dataframe after the merge?
Add a level to the columns of df1
pd.concat([df1], axis=1, keys=['']).swaplevel(0, 1, 1).merge(df2, on='pn')
pn number_of_records date t1
a1 b1 c1
0 A103310 0 2017-09-01 3432 324 124
1 B103309 0 2017-06-01 342 123 984
2 C103308 0 2017-03-01 435 487 245
3 D103307 2 2016-12-01 879 358 234
4 E103306 2 2016-09-01 988 432 235
I have two Pandas dataframes, namely: habitat_family and habitat_species. I want to populate habitat_species based on the taxonomical lookupMap and the values in habitat_family:
import pandas as pd
import numpy as np
species = ['tiger', 'lion', 'mosquito', 'ladybug', 'locust', 'seal', 'seabass', 'shark', 'dolphin']
families = ['mammal','fish','insect']
lookupMap = {'tiger':'mammal', 'lion':'mammal', 'mosquito':'insect', 'ladybug':'insect', 'locust':'insect',
'seal':'mammal', 'seabass':'fish', 'shark':'fish', 'dolphin':'mammal' }
habitat_family = pd.DataFrame({'id': range(1,11),
'mammal': [101,123,523,562,546,213,562,234,987,901],
'fish' : [625,254,929,827,102,295,174,777,123,763],
'insect': [345,928,183,645,113,942,689,539,789,814]
}, index=range(1,11), columns=['id','mammal','fish','insect'])
habitat_species = pd.DataFrame(0.0, index=range(1,11), columns=species)
# My highly inefficient solution:
for id in habitat_family.index: # loop through habitat id's
for spec in species: # loop through species
corresp_family = lookupMap[spec]
habitat_species.loc[id,spec] = habitat_family.loc[id,corresp_family]
The nested for loops above do the job. But in reality the sizes of my dataframes are massive and using for loops are not feasible.
Is there a more efficient method to achieve this using maybe dataframe.apply() or a similar function?
EDIT: The desired output habitat_species is:
habitat_species
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
You don't need any loops at all. Check it out:
In [12]: habitat_species = habitat_family[Series(species).replace(lookupMap)]
In [13]: habitat_species.columns = species
In [14]: habitat_species
Out[14]:
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
[10 rows x 9 columns]
First of all, fantastically written question. Thanks.
I would suggest making a DataFrame for each family, and concatenating at the end:
You'll need to reverse your lookupMap:
In [80]: d = {'mammal': ['dolphin', 'lion', 'seal', 'tiger'], 'insect': ['ladybug', 'locust', 'mosquito'], 'fish':
['seabass', 'shark']}
So as an example:
In [83]: k, v = 'mammal', d['mammal']
In [86]: pd.DataFrame([habitat_family[k] for _ in v], index=v).T
Out[86]:
dolphin lion seal tiger
1 101 101 101 101
2 123 123 123 123
3 523 523 523 523
4 562 562 562 562
5 546 546 546 546
6 213 213 213 213
7 562 562 562 562
8 234 234 234 234
9 987 987 987 987
10 901 901 901 901
[10 rows x 4 columns]
Now do that for each family:
In [88]: for k, v in d.iteritems():
....: results.append(pd.DataFrame([habitat_family[k] for _ in v], index=v).T)
And concat:
In [89]: habitat_species = pd.concat(results, axis=1)
In [90]: habi
habitat_family habitat_species
In [90]: habitat_species
Out[90]:
dolphin lion seal tiger ladybug locust mosquito seabass shark
1 101 101 101 101 345 345 345 625 625
2 123 123 123 123 928 928 928 254 254
3 523 523 523 523 183 183 183 929 929
4 562 562 562 562 645 645 645 827 827
5 546 546 546 546 113 113 113 102 102
6 213 213 213 213 942 942 942 295 295
7 562 562 562 562 689 689 689 174 174
8 234 234 234 234 539 539 539 777 777
9 987 987 987 987 789 789 789 123 123
10 901 901 901 901 814 814 814 763 763
[10 rows x 9 columns]
You might consider passing the families as the key parameter to concat if you want a hierarchical index for the columns with (family, species) pairs.
Some profiling, since you said performance matters:
# Mine
In [97]: %%timeit
....: for k, v in d.iteritems():
....: results.append(pd.DataFrame([habitat_family[k] for _ in v], index=v).T)
....: habitat_species = pd.concat(results, axis=1)
....:
1 loops, best of 3: 296 ms per loop
# Your's
In [98]: %%timeit
....: for id in habitat_family.index: # loop through habitat id's
....: for spec in species: # loop through species
....: corresp_family = lookupMap[spec]
....: habitat_species.loc[id,spec] = habitat_family.loc[id,corresp_family]
10 loops, best of 3: 21.5 ms per loop
# Dan's
In [102]: %%timeit
.....: habitat_species = habitat_family[Series(species).replace(lookupMap)]
.....: habitat_species.columns = species
.....:
100 loops, best of 3: 2.55 ms per loop
Looks like Dan wins by a longshot!
This might be the most pandonic:
In [1]: habitat_species.apply(lambda x: habitat_family[lookupMap[x.name]])
Out[1]:
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
%timeit habitat_species.apply(lambda x: habitat_family[lookupMap[x.name]])
1000 loops, best of 3: 1.57 ms per loop
as far as I can tell, the data in the columns don't change but the columns are merely repeated for each corresponding animal.
I.E if you just had a tiger and a lion, you would want a resulting dataframe with the mammal column repeated twice and the header changed?
In that case, you can do:
habitat_species = pd.DataFrame(0.0, index=range(1,11))
for key, value in lookupMap.iteritems():
habitat_species[key] = habitat_family[value]
This will create a new column in the habitat_species dataframe with the name given by key, and assign all the values in the corresponding column in the habitat_family dataframe, whose name is given by value