I have created a pandas data frame and then converted it into pivot table.
My pivot table looks like this:
Operators TotalCB Qd(cb) Autopass(cb)
Aircel India 55 11 44
Airtel Ghana 20 17 3
Airtel India 41 9 9
Airtel Kenya 9 4 5
Airtel Nigeria 24 17 7
AT&T USA 18 10 8
I was wondering how to add calculated columns so that I get my pivot table with Autopass% (Autopass(cb)/TotalCB*100) just like we are able to create them in Excel using calculated field option.
I want my pivot table output to be something like below:
Operators TotalCB Qd(cb) Autopass(cb) Qd(cb)% Autopass(cb)%
Aircel India 55 11 44 20% 80%
Airtel Ghana 20 17 3 85% 15%
Airtel India 41 29 9 71% 22%
Airtel Kenya 9 4 5 44% 56%
AT&T USA 18 10 8 56% 44%
How do I define the function which calculates the percentage columns and how to apply that function to my two columns namely Qd(cb) and Autopass(cb) to give me additional calculated columns
This should do it, assuming data is your pivoted dataframe:
data['Autopass(cb)%'] = data['Autopass(cb)'] / data['TotalCB'] * 100
data['Qd(cb)%'] = data['Qd(cb)'] / data['TotalCB'] * 100
Adding a new column to a dataframe is as simple as df['colname'] = new_series. Here we assign it with your requested function, when we do it as a vector operation it creates a new series.
Related
I have an excel file to analyze but have a lot of data that I don't want to analyze, can we delete a column if we don't find the value SpaceX string in the first row like following
SL# State District 10/01/2021 10/01/2021 10/01/2021 11/01/2021 11/01/2021 11/01/2021
SpaceX in Star in StarX out SpaceX out Star out StarX in
1 wb al 10 11 12 13 14 15
2 wb not 23 22 20 24 25 25
Now here I want to delete the columns where in the rows SpaceX not there. And then Want to delete the SpaceX as well to shift up the rows ultimate output will look like as follows
SL# State District 10/01/2021 11/01/2021
1 wb al 10 13
2 wb not 23 24
Tried with loc and iloc functions but no clue at the moment.
Also checked the answer: Drop columns if rows contain a specific value in Pandas but it's different. I'm checking the substring not the exact value match.
Firstly create a boolean mask with startswith() method and fillna() method:
mask=df.loc[0].str.startswith('SpaceX').fillna(True)
Finally use Transpose(T) attribute,loc accessor and drop() method:
df=df.T.loc[mask].T.drop(0)
Output of df:
SL# State District 2021-01-10 00:00:00 2021-01-11 00:00:00 2021-01-12 00:00:00
1 1.0 wb al 10 13 16
2 2.0 wb not 23 13 16
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
I'm pulling in the data frame using tabula. Unfortunately, the data is arranged in rows as below. I need to take the first 23 rows and use them as column headers for the remainder of the data. I need each row to contain these 23 headers for each of about 60 clinics.
Col \
0 Date
1 Clinic
2 Location
3 Clinic Manager
4 Lease Cost
5 Square Footage
6 Lease Expiration
8 Care Provided
9 # of Providers (Full Time)
10 # FTE's Providing Care
11 # Providers (Part-Time)
12 Patients seen per week
13 Number of patients in rooms per provider
14 Number of patients in waiting room
15 # Exam Rooms
16 Procedure rooms
17 Other rooms
18 Specify other
20 Other data:
21 TI Needs:
23 Conclusion & Recommendation
24 Date
25 Clinic
26 Location
27 Clinic Manager
28 Lease Cost
29 Square Footage
30 Lease Expiration
32 Care Provided
33 # of Providers (Full Time)
34 # FTE's Providing Care
35 # Providers (Part-Time)
36 Patients seen per week
37 Number of patients in rooms per provider
38 Number of patients in waiting room
39 # Exam Rooms
40 Procedure rooms
41 Other rooms
42 Specify other
44 Other data:
45 TI Needs:
47 Conclusion & Recommendation
Val
0 9/13/2017
1 Gray Medical Center
2 1234 E. 164th Ave Thornton CA 12345
3 Jane Doe
4 $23,074.80 Rent, $5,392.88 CAM
5 9,840
6 7/31/2023
8 Family Medicine
9 12
10 14
11 1
12 750
13 4
14 2
15 31
16 1
17 X-Ray, Phlebotomist/blood draw
18 NaN
20 Facilities assistance needed. 50% of business...
21 Paint and Carpet (flooring is in good conditio...
23 Lay out and occupancy flow are good for this p...
24 9/13/2017
25 Main Cardiology
26 12000 Wall St Suite 13 Main CA 12345
27 John Doe
28 $9610.42 Rent, $2,937.33 CAM
29 4,406
30 5/31/2024
32 Cardiology
33 2
34 11, 2 - P.T.
35 2
36 188
37 0
38 2
39 6
40 0
41 1 - Pacemaker, 1 - Treadmill, 1- Echo, 1 - Ech...
42 Nurse Office, MA station, Reading Room, 2 Phys...
44 Occupied in Emerus building. Needs facilities ...
45 New build out, great condition.
47 Practice recently relocated from 84th and Alco...
I was able to get my data frame in a better place by fixing the headers. I'm re-posting the first 3 "groups" of data to better illustrate the structure of the data frame. Everything repeats (headers and values) for each clinic.
Try this:
df2 = pd.DataFrame(df[23:].values.reshape(-1, 23),
columns=df[:23][0])
print(df2)
Ideally the number 23 is the number of columns in each row for the result df . you can replace it with the desired number of columns you want.
I am working with an 80 Gb data set in Python. The data has 30 columns and ~180,000,000 rows.
I am using the chunk size parameter in pd.read_csv to read the data in chunks where I then iterate through the data to create a dictionary of the counties with their associated frequency.
This is where I am stuck. Once I have the list of counties, I want to iterate through the chunks row-by-row again summing the values of 2 - 3 other columns associated with each county and place it into a new DataFrame. This would roughly be 4 cols and 3000 rows which is more manageable for my computer.
I really don't know how to do this, this is my first time working with a large data set in python.
import pandas as pd
from collections import defaultdict
df_chunk = pd.read_csv('file.tsv', sep='\t', chunksize=8000000)
county_dict = defaultdict(int)
for chunk in df_chunk:
for county in chunk['COUNTY']:
county_dict[county] += 1
for chunk in df_chunk:
for row in chunk:
# I don't know where to go from here
I expect to be able to make a DataFrame with a column of all the counties, a column for total sales of product "1" per county, another column for sales of product per county, and then more columns of the same as needed.
The idea
I was not sure whether you have data for different counties (e.g. in UK or USA)
or countries (in the world), so I decided to have data concerning countries.
The idea is to:
Group data from each chunk by country.
Generate a partial result for this chunk, as a DataFrame with:
Sums of each column of interest (per country).
Number of rows per country.
To perform concatenation of partial results (in a moment), each partial
result should contain the chunk number, as an additional index level.
Concatenate partial results vertically (due to the additional index level,
each row has different index).
The final result (total sums and row counts) can be computed as
sum of the above result, grouped by country (discarding the chunk
number).
Test data
The source CSV file contains country names and 2 columns to sum (Tab separated):
Country Amount_1 Amount_2
Austria 41 46
Belgium 30 50
Austria 45 44
Denmark 31 42
Finland 42 32
Austria 10 12
France 74 54
Germany 81 65
France 40 20
Italy 54 42
France 51 16
Norway 14 33
Italy 12 33
France 21 30
For the test purpose I assumed chunk size of just 5 rows:
chunksize = 5
Solution
The main processing loop (and preparatory steps) are as follows:
df_chunk = pd.read_csv('Input.csv', sep='\t', chunksize=chunksize)
chunkPartRes = [] # Partial results from each chunk
chunkNo = 0
for chunk in df_chunk:
chunkNo += 1
gr = chunk.groupby('Country')
# Sum the desired columns and size of each group
res = gr.agg(Amount_1=('Amount_1', sum), Amount_2=('Amount_2', sum))\
.join(gr.size().rename('Count'))
# Add top index level (chunk No), then append
chunkPartRes.append(pd.concat([res], keys=[chunkNo], names=['ChunkNo']))
To concatenate the above partial results into a single DataFrame,
but still with separate results from each chunk, run:
chunkRes = pd.concat(chunkPartRes)
For my test data, the result is:
Amount_1 Amount_2 Count
ChunkNo Country
1 Austria 86 90 2
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
2 Austria 10 12 1
France 114 74 2
Germany 81 65 1
Italy 54 42 1
3 France 72 46 2
Italy 12 33 1
Norway 14 33 1
And to generate the final result, summing data from all chunks,
but keeping separation by countries, run:
res = chunkRes.groupby(level=1).sum()
The result is:
Amount_1 Amount_2 Count
Country
Austria 96 102 3
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
France 186 120 4
Germany 81 65 1
Italy 66 75 2
Norway 14 33 1
To sum up
Even if we look only on how numbers of rows per country are computed,
this solution is more "pandasonic" and elegant, than usage of defaultdict
and incrementation in a loop processing each row.
Grouping and counting of rows per group works significantly quicker
than a loop operating on rows.
I'm trying to sum data from multiple columns in my dataframe by pivoting the table and using aggfunc. My dataframe gives emission data for various regions. I don't want to sum some rows so I make a selection of the rows that I want to sum. The output however is two rows for each column:
one is named True and gives the sum of the rows that I defined (this is the column that I want)
the other is named False and gives the sum of the remainder of the rows that I did not define (this one I would like to drop/omit)
The data is numeric regional data for multiple years so what I want to do is add data from some regions in order to get data for larger regions. The years are listed in columns.
The data looks something like this:
inp = [{'Scenario':'Baseline', 'Region':'CHINA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':5,'1995':10,'2000':15},
{'Scenario':'Baseline', 'Region':'INDIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':6,'1995':11,'2000':16},
{'Scenario':'Baseline', 'Region':'INDONESIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':7,'1995':12,'2000':17},
{'Scenario':'Baseline', 'Region':'KOREA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':8,'1995':13,'2000':18},
{'Scenario':'Baseline', 'Region':'JAPAN', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':9,'1995':14,'2000':19},
{'Scenario':'Baseline', 'Region':'THAILAND', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':10,'1995':15,'2000':20},
{'Scenario':'Baseline', 'Region':'RUSSIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':11,'1995':16,'2000':21}]
dt = pd.DataFrame(inp)
dt
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
4 9 14 19 JAPAN Baseline MtCO2eq Methane
5 10 15 20 THAILAND Baseline MtCO2eq Methane
6 11 16 21 RUSSIA Baseline MtCO2eq Methane
I run this piece of code:
dt_test = dt.pivot_table(dt,index=['Scenario','Variable','Unit'],
columns=[(df['Region'] == 'CHINA')|
(df['Region'] == 'INDIA')|
(df['Region'] == 'INDONESIA')
|(df['Region'] == 'KOREA')],
aggfunc=np.sum)
And get this as output:
1990 1995 2000
Region False True False True False True
Scenario Variable Unit
Baseline Methane MtCO2eq 46 10 76 15 106 20
If someone could help me out with either a way to drop this False column for all the years or another nifty way to get the totals that I want that would be amazing.
Use xs:
print (dt_test.xs(True, axis=1, level=1))
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66
But better is filter first by isin and boolean indexing:
df = df[df['Region'].isin(['CHINA','INDIA','INDONESIA','KOREA'])]
print (df)
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
And then aggregate sum per groups:
dt_test = df.groupby(['Scenario','Variable','Unit']).sum()
print (dt_test)
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66