This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I was trying to make a sub totals and grand totals for a data. But some where i stuck and couldn't make my deserved output. Could you please assist on this.
data.groupby(['Column4', 'Column5'])['Column1'].count()
Current Output:
Column4 Column5
2018-05-19 Duplicate 220
Informative 3
2018-05-20 Actionable 5
Duplicate 270
Informative 859
Non-actionable 2
2018-05-21 Actionable 8
Duplicate 295
Informative 17
2018-05-22 Actionable 10
Duplicate 424
Informative 36
2018-05-23 Actionable 8
Duplicate 157
Informative 3
2018-05-24 Actionable 5
Duplicate 78
Informative 3
2018-05-25 Actionable 3
Duplicate 80
Expected Output:
Row Labels Actionable Duplicate Informative Non-actionable Grand Total
5/19/2018 219 3 222
5/20/2018 5 270 859 2 1136
5/21/2018 8 295 17 320
5/22/2018 10 424 36 470
5/23/2018 8 157 3 168
5/24/2018 5 78 3 86
5/25/2018 3 80 83
Grand Total 39 1523 921 2 2485
This is a sample data. Could you please have a look with before my ask. I am getting minuted errors. May be i wasn't gave right data. Please kindly check for once.
Column1 Column2 Column3 Column4 Column5 Column6
BI Account Subject1 2:12 PM 5/19/2018 Duplicate Name1
PI Account Subject2 1:58 PM 5/19/2018 Actionable Name2
AI Account Subject3 5:01 PM 5/19/2018 Non-Actionable Name3
BI Account Subject4 5:57 PM 5/19/2018 Informative Name4
PI Account Subject5 6:59 PM 5/19/2018 Duplicate Name5
AI Account Subject6 8:07 PM 5/19/2018 Actionable Name1
You can use pivot to get from your current output to your desired output and then sum to calculate the totals you want.
import pandas as pd
df = df.reset_index().pivot('index', values='Column5', columns='Column4')
# Add grand total columns, summing across all other columns
df['Grand Total'] = df.sum(axis=1)
df.columns.name = None
df.index.name = None
# Add the grand total row, summing all values in a column
df.loc['Grand Total', :] = df.sum()
df is now:
Actionable Duplicate Informative Non-actionable Grand Total
2018-05-19 NaN 220.0 3.0 NaN 223.0
2018-05-20 5.0 270.0 859.0 2.0 1136.0
2018-05-21 8.0 295.0 17.0 NaN 320.0
2018-05-22 10.0 424.0 36.0 NaN 470.0
2018-05-23 8.0 157.0 3.0 NaN 168.0
2018-05-24 5.0 78.0 3.0 NaN 86.0
2018-05-25 3.0 80.0 NaN NaN 83.0
Grand Total 39.0 1524.0 921.0 2.0 2486.0
Just using crosstab
pd.crosstab(df['Column4'], df['Column5'], margins = True, margins_name = 'Grand Total' )
Take a look at this :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html
You need to pivot your table :
df.reset_index().pivot(index='date', columns='Column4', values='Column5')
Related
I have a traffic data that looks like this. Here, each column have data in format meters:seconds. Like in row 1 column 2, 57:9 represents 57 meters and 9 seconds.
0
1
2
3
4
5
6
7
8
9
0:0
57:9
166:34
178:37
203:44
328:63
344:65
436:77
737:108
None
0:0
166:34
178:37
203:43
328:61
436:74
596:51
737:106
None
None
0:0
57:6
166:30
178:33
203:40
328:62
344:64
436:74
596:91
None
0:0
203:43
328:61
None
None
None
None
None
None
None
0:0
57:7
166:20
178:43
203:10
328:61
None
None
None
None
I want to extract meters values from the dataframe and store them in a list in ascending order. Then create a new dataframe in which the the column header will be the meters value (present in the list). Then it will match the meter value in the parent dataframe and add the corresponding second value beneath. The missing meters:second pair should be replaced by NaN and the current pair at the position would move to next column within same row.
The desired outcome is:
list = [0,57,166,178,203,328,344,436,596,737]
dataframe:
0
57
166
178
203
328
344
436
596
737
0
9
34
37
44
63
65
77
NaN
108
0
NaN
34
37
43
61
NaN
74
51
106
0
6
30
33
40
62
64
74
91
None
0
NaN
NaN
NaN
43
61
None
None
None
None
0
7
20
43
10
61
None
None
None
None
I know I must use a loop to iterate over whole dataframe. I am new to python so I am unable to solve this. I tried using str.split() but it work only on 1 column. I have 98 columns and 290 rows. This is just one month data. I will be having 12 month data. So, need suggestions and help.
Try:
tmp = df1.apply(
lambda x: dict(
map(int, val.split(":"))
for val in x
if isinstance(val, str) and ":" in val
),
axis=1,
).to_list()
out = pd.DataFrame(tmp)
print(out[sorted(out.columns)])
Prints:
0 57 166 178 203 328 344 436 596 737
0 0 9.0 34.0 37.0 44 63 65.0 77.0 NaN 108.0
1 0 NaN 34.0 37.0 43 61 NaN 74.0 51.0 106.0
2 0 6.0 30.0 33.0 40 62 64.0 74.0 91.0 NaN
3 0 NaN NaN NaN 43 61 NaN NaN NaN NaN
4 0 7.0 20.0 43.0 10 61 NaN NaN NaN NaN
I have a dataframe likes this:
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp
8845
1135
0
12
12128
1
Shinva group
LDN corp
11
1243
133
121
113
1
Telegraph group
Freename LLC
5487
223
928
0
0
21
Telegraph group
Grt
0
7543
24
3213
15
21
Zero group
PetZoo crp
5574
0
2
0
6478
1
Zero group
Elephant
48324
0
32
118
4
1
I need to subtract values between cells in the column if they have the same Alliance_name.
(it would be perfect not to subtract the last column Sur, but it is not the main target)
I know that for addition we can make something like this:
df = df.groupby('Alliance_name').sum()
But I don't know how to do this with subtraction.
The result should be like this (if we don't subtract the last column):
Alliance_name
Company_name
TOAD
MBA
Class
EVE
TBD
Sur
Shinva group
HVC corp LDN corp
8834
-108
-133
-109
12015
1
Telegraph group
Freename LLC Grt
5487
-7320
904
-3212
-15
21
Zero group
PetZoo crp Elephant
-42750
0
-30
-118
6474
1
Thanks for your help!
You could invert the values to subtract, and then sum them.
df.loc[df.Alliance_name.duplicated(keep="first"), ["TOAD", "MBA", "Class", "EVE", "TBD", "Sur"]] *= -1
df.groupby("Alliance_name").sum()
The .first() and .last() groupby methods can be useful for such tasks.
You can organize the columns you want to skip/compute
>>> df.columns
Index(['Alliance_name', 'Company_name', 'TOAD', 'MBA', 'Class', 'EVE', 'TBD',
'Sur'],
dtype='object')
>>> alliance, company, *cols, sur = df.columns
>>> groups = df.groupby(alliance)
>>> company = groups.first()[[company]]
>>> sur = groups.first()[sur]
>>> groups = groups[cols]
And use .first() - .last() directly:
>>> groups.first() - groups.last()
TOAD MBA Class EVE TBD
Alliance_name
Shinva group 8834 -108 -133 -109 12015
Telegraph group 5487 -7320 904 -3213 -15
Zero group -42750 0 -30 -118 6474
Then .join() the other columns back in
>>> company.join(groups.first() - groups.last()).join(sur).reset_index()
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
1 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
2 Zero group PetZoo crp -42750 0 -30 -118 6474 1
Another approach:
>>> df - df.drop(columns=['Company_name', 'Sur']) .groupby('Alliance_name').shift(-1)
Alliance_name Class Company_name EVE MBA Sur TBD TOAD
0 NaN -133.0 NaN -109.0 -108.0 NaN 12015.0 8834.0
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 904.0 NaN -3213.0 -7320.0 NaN -15.0 5487.0
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN -30.0 NaN -118.0 0.0 NaN 6474.0 -42750.0
5 NaN NaN NaN NaN NaN NaN NaN NaN
You can then drop the all nan rows and fill the remainder values from the original df.
>>> ((df - df.drop(columns=['Company_name', 'Sur'])
.groupby('Alliance_name').shift(-1)).dropna(how='all')[df.columns].fillna(df))
Alliance_name Company_name TOAD MBA Class EVE TBD Sur
0 Shinva group HVC corp 8834 -108 -133 -109 12015 1
2 Telegraph group Freename LLC 5487 -7320 904 -3213 -15 21
4 Zero group PetZoo crp -42750 0 -30 -118 6474 1
I have multiple dataframes that I need to merge into a single dataset based on a unique identifier (uid), and on the timedelta between dates in each dataframe.
Here's a simplified example of the dataframes:
df1
uid tx_date last_name first_name meas_1
0 60 2004-01-11 John Smith 1.3
1 60 2016-12-24 John Smith 2.4
2 61 1994-05-05 Betty Jones 1.2
3 63 2006-07-19 James Wood NaN
4 63 2008-01-03 James Wood 2.9
5 65 1998-10-08 Tom Plant 4.2
6 66 2000-02-01 Helen Kerr 1.1
df2
uid rx_date last_name first_name meas_2
0 60 2004-01-14 John Smith A
1 60 2017-01-05 John Smith AB
2 60 2017-03-31 John Smith NaN
3 63 2006-07-21 James Wood A
4 64 2002-04-18 Bill Jackson B
5 65 1998-10-08 Tom Plant AA
6 65 2005-12-01 Tom Plant B
7 66 2013-12-14 Helen Kerr C
Basically I am trying to merge records for the same person from two separate sources, where there link between records for unique individuals is the 'uid', and the link between rows (where it exists) for each individiual is a fuzzy relationship between 'tx_date' and 'rx_date' that can (usually) be accomodated by a specific time delta. There won't always be an exact or fuzzy match between dates, data could be missing from any column except 'uid', and each dataframe will contain a different but intersecting subset of 'uid's.
I need to be able to concatenate rows where the 'uid' columns match, and where the absolute time delta between 'tx_date' and 'rx_date' is within a given range (e.g. max delta of 14 days). Where the time delta is outside that range, or one of either 'tx_date' or 'rx_date' is missing, or where the 'uid' exists in only one of the dataframes, I still need to retain the data in that row. The end result should be something like:
uid tx_date rx_date first_name last_name meas_1 meas_2
0 60 2004-01-11 2004-01-14 John Smith 1.3 A
1 60 2016-12-24 2017-01-05 John Smith 2.4 AB
2 60 NaT 2017-03-31 John Smith NaN NaN
3 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood NaN NaN
6 64 2002-04-18 NaT Bill Jackson NaN B
7 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
8 65 NaT 2005-12-01 Tom Plant NaN B
9 66 2000-02-01 NaT Helen Kerr 1.1 NaN
10 66 NaT 2013-12-14 Helen Kerr NaN C
Seems like pandas.merge_asof should be useful here, but I've not been able to get it to do quite what I need.
Trying merge_asof on two of the real dataframes I have gave an error ValueError: left keys must be sorted
As per this question the problem there was actually due to there being NaT values in the 'date' column for some rows. I dropped the rows with NaT values, and sorted the 'date' columns in each dataframe, but the result still isn't quite what I need.
The code below shows the steps taken.
import pandas as pd
df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')
df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')
df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')
Result:
uid tx_date rx_date last_name_x first_name_x meas_1 meas_2
3 60 2004-01-11 2004-01-14 John Smith 1.3 A
6 60 2016-12-24 2017-01-05 John Smith 2.4 AB
0 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood 2.9 NaN
1 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
2 66 2000-02-01 NaT Helen Kerr 1.1 NaN
It looks like a left join rather than a full outer join, so anywhere there's a row in df2 without a match on 'uid' and 'date' in df1 is lost (and it's not really clear from this simplified example, but I also need to add the rows back in where the date was NaT).
Is there some way to achieve a lossless merge, either by somehow doing an outer join with merge_asof, or using some other approach?
I have two data frames with different size
df1
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
and
df2
YearDeci Year Month Day Hour ... Seconds Mb Lat Lon
0 1669.510753 1669 6 4 0 ... 0 NaN 33.400 73.200
1 1720.535519 1720 7 15 0 ... 0 NaN 28.700 77.200
2 1780.000000 1780 0 0 0 ... 0 NaN 35.000 77.000
3 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600
4 1803.665753 1803 9 1 0 ... 0 NaN 30.300 78.800
5 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600.
1.I wanted to compare df1 and df2 based on the column 'YearDeci'. and find out the common entries and unique entries(rows other than common rows).
2.output the common rows(with respect to df2) in df1 based on column 'YearDeci'.
3.output the unique rows(with respect to df2) in df1 based on column 'YearDeci'.
*NB: Difference in decimal values up to +/-0.0001 in the 'YearDeci' is tolerable
The expected output is like
row_common=
YearDeci Year Month Day ... Mb Lat Lon
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
row_unique=
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
First compare df1.YearDeci with df2.YearDeci on the "each with each"
principle.
To perform comparison use np.isclose function with the assumed absolute
tolerance.
The result is a boolean array:
first index - index in df1,
second index - index in df2.
Then, using np.argwhere, find indices of True values, i.e. indices
of "correlated" rows from df1 and df2 and create a DateFrame from them.
The code to perform the above operations is:
ind = pd.DataFrame(np.argwhere(np.isclose(df1.YearDeci[:, np.newaxis],
df2.YearDeci[np.newaxis, :], atol=0.0001, rtol=0)),
columns=['ind1', 'ind2'])
Then, having pairs of indices pointing to "correlated" rows in both DataFrames,
perform the following merge:
result = ind.merge(df1, left_on='ind1', right_index=True)\
.merge(df2, left_on='ind2', right_index=True, suffixes=['_1', '_2'])
The final step is to drop both "auxiliary index columns" (ind1 and ind2):
result.drop(columns=['ind1', 'ind2'], inplace=True)
The result (divided into 2 parts) is:
YearDeci_1 Year_1 Month_1 Day_1 Magnitude Lat_1 Lon_1 YearDeci_2 \
0 1720.535519 1720 7 15 6.5 28.37 77.09 1720.535519
Year_2 Month_2 Day_2 Hour Seconds Mb Lat_2 Lon_2
0 1720 7 15 0 0 NaN 28.7 77.2
The indices of the common rows are already in the variable ind
So to find the unique entries, all we need to do is, drop the common rows from the df1 according to the indices in "ind"
So it is better to make another CSV file contain the common entries and read it to a variable.
df1_common = pd.read_csv("df1_common.csv")
df1_uniq = df1.drop(df1.index[ind.ind1])
I have the following quarterly data. But there are some dates where there is no data. I want to create a for loop which iterates through the indexes and checks that if in the assets column the date is NaN. If yes, then create a new data frame containing the part of data frame where is no NaN and the loop breaks.
So for example, the loop starts, between 9/30/2018 and 9/30/2016 everything is OK, then in the next iteration there is NaN (6/30/2016) so I want to create a data frame containing the rows between 9/30/2018 and 9/30/2016 and the loop breaks.
Note: It has to be with some kind of iteration because I want to do it with many excels and for some excels the exact date where the NaN kicks in could be at different times.
date assets debt
9/30/2018 4193 1824
6/30/2018 4281 1929
3/31/2018 4149 1460
12/31/2017 4238 1404
9/30/2017 3804 1401
6/30/2017 3583 1437
3/31/2017 3404 1451
12/31/2016 3181 1445
9/30/2016 3622 1478
6/30/2016 NaN NaN
3/31/2016 NaN NaN
12/31/2015 2566 225
9/30/2015 NaN NaN
6/30/2015 NaN NaN
3/31/2015 NaN NaN
12/31/2014 2917 342
Here is what I have tried to far:
for date in df.index:
if df['assets'][df.index == date].empty == True:
newdf = df[df.index > date]
break
You can use the numpy method isnan to extract the index and then indexing to grab the rest.
idx = np.isnan(df.assets).idxmax() # this is one way
idx = df.assets.isna().idxmax() # this is another way
newdf = df.iloc[:idx]
date assets debt
0 9/30/2018 4193.0 1824.0
1 6/30/2018 4281.0 1929.0
2 3/31/2018 4149.0 1460.0
3 12/31/2017 4238.0 1404.0
4 9/30/2017 3804.0 1401.0
5 6/30/2017 3583.0 1437.0
6 3/31/2017 3404.0 1451.0
7 12/31/2016 3181.0 1445.0
8 9/30/2016 3622.0 1478.0
Putting this in a loop when reading your files should be trivial.