I need help comparing data within a table in python - python

I have the following table
a1b1
a1Eb1
a1b2
a1Eb2
a2b1
a2Eb1
a2b2
a2Eb2
a3b1
a3Eb1
a3b2
a3Eb2
2
20
8
54
3
56
3
67
2
78
7
75
8
30
6
67
6
35
4
56
3
85
6
74
5
54
4
64
7
23
6
48
4
67
4
82
6
65
7
53
8
27
7
35
5
25
3
64
4
34
2
52
4
28
8
27
6
94
2
29
i want to compare the following data:
a1b1 vs a1b2;
then generate arrays containing
a1b1
a1b2
minor a1b1
2
8
20
a1b2
a1b1
minor a1b2
6
8
30
and so for each row of the table
and for each of the following comparisons
a2b1 vs a2b2;
a3b1 vs a3b2;
I have tried to do it with pandas in python
import pandas as pd
import numpy as np
df = pd.DataFrame ({'a1b1':[2,8,5,6,4],
'a1Eb1':[20,30,54,65,34],
'a1b2':[8,6,4,7,2],
'a1Eb2':[54,67,64,53,52],
'a2b1':[3,6,7,8,4],
'a2Eb1':[56,35,23,27,28],
'a2b2':[3,4,6,7,8],
'a2Eb2':[67,56,48,35,27],
'a3b1':[2,3,4,5,6],
'a3Eb1':[78,85,67,25,94],
'a3b2':[7,6,4,3,2],
'a3Eb3':[75,74,82,64,29],
})
but i don't know how to go on.
Output expected
To the first line a1b1<a1b2 then print the following
df1=pd.DataFrame{'a1b1':[2],
'a1b2':[8],
'a1Eb1':[20]}
This can be, a DataFrame, a list or any data structure

If you want to display only specific columns of your dataframe you can use the following syntax with [[ and ]] after the name of the dataframe (df), and in between you just add the names of the columns you want to see. It can be 2,
3 or even all of the columns of the dataframes, as long as you separate their names with a comma and put them between quotes.
df[['a1b1','a1b2']] # to display two columns
df[['a2b1','a2b2']]
df[['a3b1','a3b2']]
to display 3 columns, it could for example be :
df[['a3b1','a3b2','a3b1']]
and so on.

Related

Converting time format to second in a panda dataframe

I have a df with time data and I would like to transform these data to second (see example below).
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 0:19.938 0:24.649 0:3.062
1 1 76 0:17.910 0:25.929 0:3.098
2 2 74 1:02.619 0:27.724 0:3.014
3 3 73 0:20.607 0:27.937 0:3.193
4 4 67 0:19.598 0:28.853 0:2.925
5 5 67 0:21.032 0:30.119 0:3.206
6 6 66 0:27.013 0:31.462 0:3.106
7 7 65 0:27.337 0:36.226 0:3.060
8 8 64 0:37.651 0:47.246 0:2.933
9 9 64 0:59.241 1:8.333 0:3.027
This is the output I would like to obtain.
df["Real time (s)"]
0 19.938
1 17.910
2 62.619
...
I have some useful code but I do not how to itinerate this code in a data frame
x = time.strptime("00:01:00","%H:%M:%S")
datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min, seconds=x.tm_sec).total_seconds()
Add 00: from right side for 0hours, pass to to_timedelta and then add Series.dt.total_seconds:
df["Real time (s)"] = pd.to_timedelta(df["Real time (s)"].radd('00:')).dt.total_seconds()
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 0:24.649 0:3.062
1 1 76 17.910 0:25.929 0:3.098
2 2 74 62.619 0:27.724 0:3.014
3 3 73 20.607 0:27.937 0:3.193
4 4 67 19.598 0:28.853 0:2.925
5 5 67 21.032 0:30.119 0:3.206
6 6 66 27.013 0:31.462 0:3.106
7 7 65 27.337 0:36.226 0:3.060
8 8 64 37.651 0:47.246 0:2.933
9 9 64 59.241 1:8.333 0:3.027
Solution for processing multiple columns:
def to_td(x):
return pd.to_timedelta(x.radd('00:')).dt.total_seconds()
cols = ["Real time (s)", "User time (s)", "Sys time (s)"]
df[cols] = df[cols].apply(to_td)
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 24.649 3.062
1 1 76 17.910 25.929 3.098
2 2 74 62.619 27.724 3.014
3 3 73 20.607 27.937 3.193
4 4 67 19.598 28.853 2.925
5 5 67 21.032 30.119 3.206
6 6 66 27.013 31.462 3.106
7 7 65 27.337 36.226 3.060
8 8 64 37.651 47.246 2.933
9 9 64 59.241 68.333 3.027

Python Pandas: Shape of passed values is (126, 5), indices imply (84, 5) [duplicate]

This question already has answers here:
Pandas concat: ValueError: Shape of passed values is blah, indices imply blah2
(7 answers)
Closed 2 years ago.
I have 2 dataframes with 84 rows, clearly the same lengths, but when i want to concat them to 1 df (concat by column - to have the name, Edge and Offset to the right of Latitude and Longitude), i get this error,.
what is going on?
Latitude Longitude
0 45.403538 -75.735729
1 45.403506 -75.735699
2 45.409095 -75.722588
3 45.409069 -75.722552
4 45.413496 -75.714184
.. ... ...
79 45.415609 -75.644769
80 45.416073 -75.645726
81 45.416193 -75.638802
82 45.416172 -75.638223
[84 rows x 2 columns]
name Edge Offset
0 TUN-W 1 3000
1 TUN-E 2 3000
2 BAY-W 5 102510
3 BAY-E 6 102579
4 PIM-W 5 186035
.. ... ... ...
37 PTSTTW 33 52710
38 PTSTTE 34 18997
39 PAG11 40 24362
40 PAG14 50 9927
41 PHND15 177 11662
[84 rows x 3 columns]
This is my code liner
output_df = pd.concat( [output_df, input_df], axis=1)
I got it:
name Edge Offset
0 TUN-W 1 3000
1 TUN-E 2 3000
2 BAY-W 5 102510
3 BAY-E 6 102579
4 PIM-W 5 186035
.. ... ... ...
37 PTSTTW 33 52710
38 PTSTTE 34 18997
39 PAG11 40 24362
40 PAG14 50 9927
41 PHND15 177 11662
The indexing was messed up. Somewhere in the middle, the index resets to 0 instead of counting up to 84.
So i did a segments_df = segments_df_start.append(segments_df_end).reset_index() (was in earlier part of my code) to fix the indexing for that dataframe before I pass it over.
So ALWAYS remember to check your indexes and .reset_index() , when troubleshooting!!

How can I multiply a numpy array with pandas series?

I have a numpy series of size (50,0)
array([1.01255569e+00, 1.04166667e+00, 1.07158165e+00, 1.10229277e+00,
1.13430127e+00, 1.16387337e+00, 1.20365912e+00, 1.24007937e+00,
1.27877238e+00, 1.31856540e+00, 1.35281385e+00, 1.40291807e+00,
1.45180023e+00, 1.49700599e+00, 1.55183116e+00, 1.60051216e+00,
1.66002656e+00, 1.73370319e+00, 1.80115274e+00, 1.87687688e+00,
1.95312500e+00, 2.04750205e+00, 2.14961307e+00, 2.23613596e+00,
2.34082397e+00, 2.48015873e+00, 2.61780105e+00, 2.75027503e+00,
2.91715286e+00, 3.07881773e+00, 3.31564987e+00, 3.57142857e+00,
3.81679389e+00, 4.17362270e+00, 4.51263538e+00, 4.95049505e+00,
5.59284116e+00, 6.17283951e+00, 7.02247191e+00, 8.03858521e+00,
9.72762646e+00, 1.17370892e+01, 1.47928994e+01, 2.10084034e+01,
3.12500000e+01, 4.90196078e+01, 9.25925926e+01, 2.08333333e+02,
5.00000000e+02, 1.25000000e+03])
And I have a pandas dataframe of length 50 as well.
x
0 9.999740e-01
1 9.981870e-01
2 9.804506e-01
3 9.187764e-01
4 8.031568e-01
5 6.544660e-01
6 5.032716e-01
7 3.707446e-01
8 2.650768e-01
9 1.857835e-01
10 1.285488e-01
11 8.824506e-02
12 6.030141e-02
13 4.111080e-02
14 2.800453e-02
15 1.907999e-02
16 1.301045e-02
17 8.882996e-03
18 6.074386e-03
19 4.161024e-03
20 2.855636e-03
21 1.963543e-03
22 1.352791e-03
23 9.338596e-04
24 6.459459e-04
25 4.476854e-04
26 3.108912e-04
27 2.163201e-04
28 1.508106e-04
29 1.053430e-04
30 7.372442e-05
31 5.169401e-05
32 3.631486e-05
33 2.555852e-05
34 1.802129e-05
35 1.272995e-05
36 9.008454e-06
37 6.386289e-06
38 4.535381e-06
39 3.226546e-06
40 2.299394e-06
41 1.641469e-06
42 1.173785e-06
43 8.407618e-07
44 6.032249e-07
45 4.335110e-07
46 3.120531e-07
47 2.249870e-07
48 1.624726e-07
49 1.175140e-07
And I want to multiply every numpy cells with pandas cell.
Example:
1.01255569e+00*9.999740e-01
1.04166667e+00*9.981870e-01
Desired output
numpy array of same size.
You can just use the .values property of the 'x' series in your Pandas dataframe:
df['x'].values * arr
where df is your dataframe and arr is your array.
The above expression will return the result as a Numpy array. If you want a Pandas DataFrame instead, you can omit the use of .values:
df['x'] * arr
Or np.multiply, multiply n with p['x'].values:
print(np.multiply(n,p['x'].values))
Or pd.Series.multiply:
print(np.array(p['x'].multiply(n)))
Or pd.Series.mul:
print(np.array(p['x'].mul(n)))

Create a pandas dataframe from dictionary whilst maintaining order of columns

When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())

Plot histogram using two columns (values, counts) in python dataframe

I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')

Categories

Resources