Delete duplicate values in a row within a Pandas DataFrame (Python)

Delete duplicate values in a row within a Pandas DataFrame (Python) - python

what is the expression to remove duplicate-values in any row within a pandas dataframe as follows....(note: first column is the index (date), followed by four columns of data).
1983-02-16 512 517 510 514,
1983-02-17 513 520 513 517,
1983-02-18 500 500 500 500 <-- duplicate values,
1983-02-21 505 505 496 496
Delete row of duplicate values, end up with this...
1983-02-16 512 517 510 514,
1983-02-17 513 520 513 517,
1983-02-21 505 505 496 496
Could only find how to do this by columns, not rows....Many thanks in advance,
Peter

A slightly more elegant/dynamic (but perhaps less performant version):
In [11]: msk = df1.apply(lambda col: df[1] != col).any(axis=1)
Out[11]:
0 True
1 True
2 False
3 True
dtype: bool
In [12]: msk.index = df1.index # iloc doesn't support masking
In [13]: df1.loc[msk]
Out[13]:
1 2 3 4
1983-02-16 512 517 510 514
1983-02-17 513 520 513 517
1983-02-21 505 505 496 496

import pandas as pd
import io
content = '''\
1983-02-16 512 517 510 514
1983-02-17 513 520 513 517
1983-02-18 500 500 500 500
1983-02-21 505 505 496 496'''
df = pd.read_table(io.BytesIO(content), parse_dates=[0], header=None, sep='\s+',
index_col=0)
index = (df[1] == df[2]) & (df[1] == df[3]) & (df[1] == df[4])
df = df.ix[~index]
print(df)
yields
1 2 3 4
0
1983-02-16 512 517 510 514
1983-02-17 513 520 513 517
1983-02-21 505 505 496 496
df.ix can be used to select rows. df = df.ix[~index] selects all rows where index is False.

Related

sort pivot/dataframe without All row pandas/python

I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!

If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833

You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833

you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)

math operation in list of list

I have created 1000 lists containing some values. I wanted to have a math operation on elements of each list and saving in another list that I want to plot. Each list has the shape like below and I want to subtract the ith element from the i-1th element(distance between two consecutive elements)
[[ 4 29 73 111 130 140 167 231 248 267 284 298 320 333
379 404 421 433 475 510 523 534 544 558 575 602 617 630
661 672 685 698 711 731 742 764 780 828 842 854 874 885
903 916 944 961 985 996 1013 1032 1054 1064 1077 1109 1122 1138
1205 1233 1249 1282 1299 1311 1326 1337 1372 1409 1426 1437 1511 1549
1578 1591 1604 1646]]
I have written the code below but it does not work and I got the error of index out of range.
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import find_peaks
Cases = [f for f in sorted(os.listdir('.')) if f.startswith('config')]
plt.rcParams.update({'font.size': 14})
maxnum = np.max([int(os.path.splitext(f)[0].split('_')[1]) for f in CASES])
CASES = ['configuration_%d.out' % i for i in range(maxnum)]
gg = []
my_l_h = []
for i, d in enumerate(CASES):
a = np.loadtxt(d).T
x = a[3]
peaks, _ = find_peaks(x, distance=10)
gg = [peaks]
L_h = np.array(gg)
for numbers in gg:
jp = L_h[:,i]-L_h[:,i-1]
my_l_h.append(jp)
print(my_l_h)
t = np.arange(0,len(my_l_h)
plt.plot(t,my_l_h)
plt.show()

ValueError: Axes instance argument was not found in a figure, Question with same name has no answer

I am trying to create a seaborn Facetgrid to plot the normality distribution of all columns in my dataFrame decathlon. The data looks as such:
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500
0 938 1061 773 859 896 911 880 732 757 752
1 839 975 870 749 887 878 880 823 863 741
2 814 866 841 887 921 939 819 778 884 691
3 872 898 789 878 848 879 790 790 861 804
4 892 913 742 803 816 869 1004 789 854 699
... ... ... ... ... ... ... ... ... ...
7963 755 760 604 714 812 794 482 571 539 780
7964 830 845 524 767 786 783 601 573 562 535
7965 819 804 653 840 791 699 659 461 448 632
7966 804 720 539 758 830 782 731 487 425 729
7967 687 809 692 714 565 741 804 527 738 523
I am relatively new to python and I can't understand my error. My attempt to format the data and create the grid is as such:
import seaborn as sns
df_stacked = decathlon.stack().reset_index(1).rename({'level_1': 'column', 0: 'values'}, axis=1)
g = sns.FacetGrid(df_stacked, row = 'column')
g = g.map(plt.hist, "values")
However I recieve the following error:
ValueError: Axes instance argument was not found in a figure
Can anyone explain what exactly this error means and how I would go about fixing it?
EDIT
df_stacked looks as such:
column values
0 P100m 938
0 Plj 1061
0 Psp 773
0 Phj 859
0 P400m 896
... ...
7967 P110h 741
7967 Ppv 804
7967 Pdt 527
7967 Pjt 738
7967 P1500 523

I encountered this similar issue when running a Jupyter Notebook.
My solution involved:
Restart the notebook
Re-run the imports %matplotlib inline; import matplotlib.pyplot as plt

As you did not post a full working example its a bit of guessing.
What might go wrong is in the line where you have g = g.map(plt.hist, "values") because the error comes from deep within matplotlib. You can see this here in this SO question where its another function pylab.sca(axes[i]) outside matplotlib due to not being in that module available, is being triggered by matplotlib.
Likely you installed/updated something in your (conda?) environment (changes in environment paths?) and after the next reboot it was found.
I also wonder how you come up with plt.hist ... fully typed it should resemble matplotlib.pyplot.hist ... but guessing... (waiting for your updated example code).

How can I get product of last 12 months from the current row using pandas

I wanted the product of last 12 months data from the current row.
Date Open
21/06/11 839.9
22/06/11 853.35
23/06/11 846.55
24/06/11 874.15
27/06/11 866.7
28/06/11 878.9
29/06/11 875.7
30/06/11 888.7
01/07/11 907
04/07/11 874.4
05/07/11 869.3
06/07/11 848.85
07/07/11 858
08/07/11 873
11/07/11 854
12/07/11 847.5
13/07/11 853.05
14/07/11 863.3
15/07/11 867.7
18/07/11 871.9
19/07/11 867.5
20/07/11 886
21/07/11 875.95
22/07/11 866
25/07/11 892
26/07/11 888.25
27/07/11 875
28/07/11 855
29/07/11 840
01/08/11 838
02/08/11 827.55
03/08/11 826.75
04/08/11 828
05/08/11 799.5
08/08/11 776.7
09/08/11 753
10/08/11 785.35
11/08/11 768.35
12/08/11 783
16/08/11 760
17/08/11 760.5
18/08/11 757.7
19/08/11 731.05
22/08/11 731
23/08/11 760.35
24/08/11 764
25/08/11 761.6
26/08/11 751
29/08/11 731.1
30/08/11 765
02/09/11 796.7
05/09/11 794.5
06/09/11 783.2
07/09/11 824
08/09/11 833.5
09/09/11 852.15
12/09/11 810.35
13/09/11 813.2
14/09/11 813.9
15/09/11 833
16/09/11 850
19/09/11 825
20/09/11 823
21/09/11 850.9
22/09/11 823.95
23/09/11 773.9
26/09/11 769.2
27/09/11 774
28/09/11 799.75
29/09/11 790.5
30/09/11 803.5
03/10/11 791.2
04/10/11 784
05/10/11 772.55
07/10/11 786.7
10/10/11 804.25
11/10/11 835
12/10/11 829.4
13/10/11 850
14/10/11 842
17/10/11 867
18/10/11 825
19/10/11 825.5
20/10/11 834.85
21/10/11 840
24/10/11 848
25/10/11 855
26/10/11 879
28/10/11 899.7
31/10/11 898
01/11/11 870.5
02/11/11 855
03/11/11 867.75
04/11/11 905
08/11/11 879
09/11/11 890.05
11/11/11 859
14/11/11 891.4
15/11/11 871
16/11/11 859.1
17/11/11 845.05
18/11/11 800.3
21/11/11 800
22/11/11 788.1
23/11/11 789.9
24/11/11 775
25/11/11 769.7
28/11/11 765
29/11/11 782
30/11/11 756.7
01/12/11 799
02/12/11 797
05/12/11 808.35
07/12/11 807
08/12/11 802
09/12/11 769.9
12/12/11 760.55
13/12/11 723.9
14/12/11 738
15/12/11 731.9
16/12/11 749
19/12/11 719.2
20/12/11 741.7
21/12/11 727
22/12/11 741.35
23/12/11 760
26/12/11 747.05
27/12/11 766
28/12/11 757.7
29/12/11 733.65
30/12/11 713
02/01/12 696.8
03/01/12 712.25
04/01/12 727.4
05/01/12 715
06/01/12 697.05
07/01/12 716.7
09/01/12 714.45
10/01/12 712
11/01/12 737.9
12/01/12 747.5
13/01/12 742
16/01/12 729.95
17/01/12 716
18/01/12 762
19/01/12 789
20/01/12 790
23/01/12 755.3
24/01/12 774.6
25/01/12 788.7
27/01/12 800
30/01/12 813.9
31/01/12 804.5
01/02/12 818.9
02/02/12 835
03/02/12 830
06/02/12 845.9
07/02/12 842
08/02/12 847
09/02/12 856.75
10/02/12 850.35
13/02/12 841.1
14/02/12 846.9
15/02/12 854.2
16/02/12 831
17/02/12 822.05
21/02/12 817.5
22/02/12 848
23/02/12 832
24/02/12 833.5
27/02/12 821.8
28/02/12 789.05
29/02/12 805.05
01/03/12 811.8
02/03/12 816.25
03/03/12 811
05/03/12 812.05
06/03/12 797
07/03/12 776.55
09/03/12 775.3
12/03/12 790
13/03/12 803.45
14/03/12 828
15/03/12 818
16/03/12 780
19/03/12 781
20/03/12 756.1
21/03/12 760
22/03/12 765.9
23/03/12 743.8
26/03/12 743.9
27/03/12 738
28/03/12 730
29/03/12 718
30/03/12 729.5
02/04/12 749.35
03/04/12 744.25
04/04/12 745
09/04/12 740.05
10/04/12 746
11/04/12 739
12/04/12 733.3
13/04/12 746.05
16/04/12 747.1
17/04/12 754.8
18/04/12 750
19/04/12 753.9
20/04/12 740.05
23/04/12 725.85
24/04/12 739
25/04/12 734.1
26/04/12 737.1
27/04/12 741.3
28/04/12 739.8
30/04/12 737.5
02/05/12 747.9
03/05/12 738.5
04/05/12 733.4
07/05/12 715
08/05/12 718
09/05/12 702
10/05/12 697.25
11/05/12 693
14/05/12 698
15/05/12 679
16/05/12 675
17/05/12 680.25
18/05/12 676.9
21/05/12 686.5
22/05/12 704.6
23/05/12 685.2
24/05/12 694
25/05/12 695
28/05/12 692
29/05/12 702.2
30/05/12 699.65
31/05/12 697
01/06/12 707.35
04/06/12 677
05/06/12 696
06/06/12 704.45
07/06/12 721.05
08/06/12 718
11/06/12 732.7
12/06/12 715
13/06/12 722.25
14/06/12 716
15/06/12 718.5
18/06/12 730.35
19/06/12 717
20/06/12 738
21/06/12 734
22/06/12 713.55
25/06/12 714.2
26/06/12 717.5
27/06/12 726.4
28/06/12 724.4
29/06/12 725.1
02/07/12 735.5
03/07/12 739.95
04/07/12 740
05/07/12 734.95
06/07/12 738
09/07/12 729
10/07/12 731.45
11/07/12 733.45
12/07/12 721.9
13/07/12 720
16/07/12 720
17/07/12 724.8
18/07/12 718
19/07/12 720.2
20/07/12 722.3
23/07/12 715
24/07/12 721
25/07/12 720.4
26/07/12 720.9
27/07/12 719
30/07/12 723
31/07/12 731.6
01/08/12 740.25
02/08/12 742.1
03/08/12 735
06/08/12 748.05
07/08/12 786.05
08/08/12 785.05
09/08/12 788.9
10/08/12 777.65
13/08/12 779.5
14/08/12 787.9
16/08/12 802.05
17/08/12 817.9
21/08/12 816
22/08/12 809.2
23/08/12 810.55
24/08/12 791.75
27/08/12 786
28/08/12 786.85
29/08/12 791
30/08/12 779.75
31/08/12 780
03/09/12 768
04/09/12 763.95
05/09/12 775.25
06/09/12 766.3
07/09/12 778.7
08/09/12 793.5
10/09/12 800
11/09/12 789.5
12/09/12 793.5
13/09/12 798.1
14/09/12 813
17/09/12 848.1
18/09/12 870.2
I tried using something on these lines but did not find a solution:
df['val']= df['Open'].last('12M').transform('prod')
How can I get the result?

If you just need product of last 12 months' value for df['Open'] then you could do something like this:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Date'], inplace=True)
df.sort_index(inplace=True)
df.tail(12).prod()
which gives you
Open 2.843636e+34
dtype: float64

I think you can adapt the following example to get what you need:
# example with 7 days
import pandas as pd
dates = pd.date_range('1/1/2018', periods=7, freq='d')
values = [4,3,7,5,3,2,3]
df = pd.DataFrame({'col1':values}, index=dates)
# get product of last 2 days
df['col1'].last('2d').prod()

How to display a sequence of numbers in column-major order?

Program description:
Find all the prime numbers between 1 and 4,027 and print them in a table which
"reads down", using as few rows as possible, and using as few sheets of paper
as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height
of the columns should all be the same, except for perhaps the last column,
which might have a few blank entries towards its bottom row.
The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down.
2 23 59
3 29 61
5 31 67
7 37 71
11 41 73
13 43 79
17 47 83
19 53 89
ect...
This all I've been able to come up with myself:
def findPrimes(n):
""" Adds calculated prime numbers to a list. """
prime_list = list()
for number in range(1, n + 1):
prime = True
for i in range(2, number):
if(number % i == 0):
prime = False
if prime:
prime_list.append(number)
return prime_list
def displayPrimes():
pass
print(findPrimes(4027))
I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?

Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems.
I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them.
For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end.
from itertools import zip_longest
import locale
import math
locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting
def zip_discard(*iterables, _NULL=object()):
""" Like zip_longest() but doesn't fill out all rows to equal length.
https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue
"""
return [[entry for entry in iterable if entry is not _NULL]
for iterable in zip_longest(*iterables, fillvalue=_NULL)]
def grouper(n, seq):
""" Group elements in sequence into groups of "n" items. """
for i in range(0, len(seq), n):
yield seq[i:i+n]
def tabularize(width, height, numbers):
""" Print list of numbers in column-major tabular form given the dimensions
of the table in characters (rows and columns). Will create multiple
tables of required to display all numbers.
"""
# Determine number of chars needed to hold longest formatted numeric value
gap = 2 # including space between numbers
col_width = len('{:n}'.format(max(numbers))) + gap
# Determine number of columns that will fit within the table's width.
num_cols = width // col_width
chunk_size = num_cols * height # maximum numbers in each table
for i, chunk in enumerate(grouper(chunk_size, numbers), start=1):
print('---- Page {} ----'.format(i))
num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up
table = zip_discard(*grouper(num_rows, chunk))
for row in table:
print(''.join(('{:{width}n}'.format(num, width=col_width)
for num in row)))
def find_primes(n):
""" Create list of prime numbers from 1 to n. """
prime_list = []
for number in range(1, n+1):
for i in range(2, int(math.sqrt(number)) + 1):
if not number % i: # Evenly divisible?
break # Not prime.
else:
prime_list.append(number)
return prime_list
primes = find_primes(4027)
tabularize(80, 15, primes)
Output:
---- Page 1 ----
1 47 113 197 281 379 463 571 659 761 863
2 53 127 199 283 383 467 577 661 769 877
3 59 131 211 293 389 479 587 673 773 881
5 61 137 223 307 397 487 593 677 787 883
7 67 139 227 311 401 491 599 683 797 887
11 71 149 229 313 409 499 601 691 809 907
13 73 151 233 317 419 503 607 701 811 911
17 79 157 239 331 421 509 613 709 821 919
19 83 163 241 337 431 521 617 719 823 929
23 89 167 251 347 433 523 619 727 827 937
29 97 173 257 349 439 541 631 733 829 941
31 101 179 263 353 443 547 641 739 839 947
37 103 181 269 359 449 557 643 743 853 953
41 107 191 271 367 457 563 647 751 857 967
43 109 193 277 373 461 569 653 757 859 971
---- Page 2 ----
977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087
983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089
991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099
997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111
1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113
1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129
1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131
1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137
1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141
1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143
1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153
1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161
1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179
1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203
1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207
---- Page 3 ----
2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413
2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433
2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449
2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457
2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461
2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463
2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467
2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469
2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491
2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499
2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511
2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517
2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527
2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529
2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533
---- Page 4 ----
3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019
3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021
3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027
3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003
3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007
3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete duplicate values in a row within a Pandas DataFrame (Python) - python

Related

sort pivot/dataframe without All row pandas/python

math operation in list of list

ValueError: Axes instance argument was not found in a figure, Question with same name has no answer

How can I get product of last 12 months from the current row using pandas

How to display a sequence of numbers in column-major order?

Categories

Resources