the below array is a result of my model, how to append it back to my dataframe's last column
In[] logireg.predict(X.head(5))
Out[] array([0, 0, 0, 1, 0], dtype=int64)
dataframe data:
age job month
33 blue apr
56 admin jun
37 tech aug
76 retired jun
56 service may
expected output
age job month predict
33 blue apr 0
56 admin jun 0
37 tech aug 0
76 retired jun 1
56 service may 0
use for loop or zip function?
You can just assign it directly to a dataframe column. Assuming your dataframe is called dataframe here:
predictions = logireg.predict(X.head(5))
dataframe['predict'] = predictions
Related
I have 3 tables/df. All have same column names. Bascially they are df for data from different months
October (df1 name)
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
November (df2 name)
Sales_value Sales_units Unique_Customer_id Countries Month
2000 1000 40 14 Nov
112 200 30 10 Nov
December (df3 name)
Sales_value Sales_units Unique_Customer_id Countries Month
20009090 4809509 4500 30 Dec
etc. This is dummy data. Each table has thousands of rows in reality. How to combine all these 3 tables such that columns come only once and all rows are displayed such that rows from October df come first, followed by November df rows followed by December df rows. When i use joins I am getting column names repeated.
Expected output:
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
2000 1000 40 14 Nov
112 200 30 10 Nov
20009090 4809509 4500 30 Dec
Concat combines rows from different tables based on common columns
pd.concat([df1, df2, df3])
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
I am trying to clean up some data out of which I need to keep only the most recent but all of them, if they appear more than once. What confuses me is that the data are actually organised in "groups". I have a dataframe example below along with the comments that might make it clearer:
method year proteins values
0 John 2017 A 10
1 John 2017 B 20
2 John 2018 A 30 # John's method in 2018 is most recent, keep this line and drop index 0 and1
3 Kate 2018 B 11
4 Kate 2018 C 22 # Kate's method appears only in 2018 so keep both lines (index 3 and 4)
5 Patrick 2017 A 90
6 Patrick 2018 A 80
7 Patrick 2018 B 85
8 Patrick 2018 C 70
9 Patrick 2019 A 60
10 Patrick 2019 C 50 # Patrick's method in 2019 is the most recent of Patrick's so keep index 9 and 10 only
So the desired output dataframe is irrelevant of the proteins that are measured but all the measured proteins should be included:
method year proteins values
0 John 2018 A 30
1 Kate 2018 B 11
2 Kate 2018 C 22
3 Patrick 2019 A 60
4 Patrick 2019 C 50
Hope this is clear. I have tried something like this my_df.sort_values('year').drop_duplicates('method', keep='last') but it gives a wrong output. Any ideas? Thank you!
PS: To replicate my initial df, you can copy the below lines:
import pandas as pd
import numpy as np
methodology=["John", "John", "John", "Kate", "Kate", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick"]
year_pract=[2017, 2017, 2018, 2018, 2018, 2017, 2018, 2018, 2018, 2019, 2019]
proteins=['A', 'B', 'A', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C']
values=[10, 20, 30, 11, 22, 90, 80, 85, 70, 60, 50]
my_df=pd.DataFrame(zip(methodology,year,proteins,values), columns=['method','year','proteins','values'])
my_df['year']=my_df['year'].astype(str)
my_df['year']=pd.to_datetime(my_df['year'], format='%Y') # the format never works for me and this is why I add the line below
my_df['year']=my_df['year'].dt.year
Because duplicates is necessary use GroupBy.transform with max and compare by original column year with Series.eq for equal and filtering by boolean indexing:
df = my_df[my_df['year'].eq(my_df.groupby('method')['year'].transform('max'))]
print (df)
method year proteins values
2 John 2018 A 30
3 Kate 2018 B 11
4 Kate 2018 C 22
9 Patrick's 2019 A 60
10 Patrick's 2019 C 50
I'm pulling in the data frame using tabula. Unfortunately, the data is arranged in rows as below. I need to take the first 23 rows and use them as column headers for the remainder of the data. I need each row to contain these 23 headers for each of about 60 clinics.
Col \
0 Date
1 Clinic
2 Location
3 Clinic Manager
4 Lease Cost
5 Square Footage
6 Lease Expiration
8 Care Provided
9 # of Providers (Full Time)
10 # FTE's Providing Care
11 # Providers (Part-Time)
12 Patients seen per week
13 Number of patients in rooms per provider
14 Number of patients in waiting room
15 # Exam Rooms
16 Procedure rooms
17 Other rooms
18 Specify other
20 Other data:
21 TI Needs:
23 Conclusion & Recommendation
24 Date
25 Clinic
26 Location
27 Clinic Manager
28 Lease Cost
29 Square Footage
30 Lease Expiration
32 Care Provided
33 # of Providers (Full Time)
34 # FTE's Providing Care
35 # Providers (Part-Time)
36 Patients seen per week
37 Number of patients in rooms per provider
38 Number of patients in waiting room
39 # Exam Rooms
40 Procedure rooms
41 Other rooms
42 Specify other
44 Other data:
45 TI Needs:
47 Conclusion & Recommendation
Val
0 9/13/2017
1 Gray Medical Center
2 1234 E. 164th Ave Thornton CA 12345
3 Jane Doe
4 $23,074.80 Rent, $5,392.88 CAM
5 9,840
6 7/31/2023
8 Family Medicine
9 12
10 14
11 1
12 750
13 4
14 2
15 31
16 1
17 X-Ray, Phlebotomist/blood draw
18 NaN
20 Facilities assistance needed. 50% of business...
21 Paint and Carpet (flooring is in good conditio...
23 Lay out and occupancy flow are good for this p...
24 9/13/2017
25 Main Cardiology
26 12000 Wall St Suite 13 Main CA 12345
27 John Doe
28 $9610.42 Rent, $2,937.33 CAM
29 4,406
30 5/31/2024
32 Cardiology
33 2
34 11, 2 - P.T.
35 2
36 188
37 0
38 2
39 6
40 0
41 1 - Pacemaker, 1 - Treadmill, 1- Echo, 1 - Ech...
42 Nurse Office, MA station, Reading Room, 2 Phys...
44 Occupied in Emerus building. Needs facilities ...
45 New build out, great condition.
47 Practice recently relocated from 84th and Alco...
I was able to get my data frame in a better place by fixing the headers. I'm re-posting the first 3 "groups" of data to better illustrate the structure of the data frame. Everything repeats (headers and values) for each clinic.
Try this:
df2 = pd.DataFrame(df[23:].values.reshape(-1, 23),
columns=df[:23][0])
print(df2)
Ideally the number 23 is the number of columns in each row for the result df . you can replace it with the desired number of columns you want.
I have a dataset that contains multi-index columns with the first level consisting of a year divided into four quarters. How do I structure the index so as to have 4 sets of months under each quarter?
I found the following piece of code on stack overflow:
index = pd.MultiIndex.from_product([['S1', 'S2'], ['Start', 'Stop']])
print pd.DataFrame([pd.DataFrame(dic).unstack().values], columns=index)
that gave the following output:
S1 S2
Start Stop Start Stop
0 2013-11-12 2013-11-13 2013-11-15 2013-11-17
However, it couldn't solve my requirement of having different sets of months under each quarter of the year.
My data looks like this:
2015
Q1 Q2 Q3 Q4
Country jan Feb March Apr May Jun July Aug Sep Oct Nov Dec
India 45 54 34 34 45 45 43 45 67 45 56 56
Canada 44 34 12 32 35 45 43 41 60 43 55 21
I wish to input the same structure of the dataset into pandas with the specific set of months under each quarter. How should I go about this?
You can also create a MultiIndex in a few other ways. One of these, which is useful if you have a complicated structure, is to construct it from an explicit set of tuples where each tuple is one hierarchical column. Below I first create all of the tuples that you need of the form (year, quarter, month), make a MultiIndex from these, then assign that as the columns of the dataframe.
import pandas as pd
year = 2015
months = [
("Jan", "Feb", "Mar"),
("Apr", "May", "Jun"),
("Jul", "Aug", "Sep"),
("Oct", "Nov", "Dec"),
]
tuples = [(year, f"Q{i + 1}", month) for i in range(4) for month in months[i]]
multi_index = pd.MultiIndex.from_tuples(tuples)
data = [
[45, 54, 34, 34, 45, 45, 43, 45, 67, 45, 56, 56],
[44, 34, 12, 32, 35, 45, 43, 41, 60, 43, 55, 21],
]
df = pd.DataFrame(data, index=["India", "Canada"], columns=multi_index)
df
# 2015
# Q1 Q2 Q3 Q4
# Jan FebMar Apr May Jun Jul Aug Sep Oct Nov Dec
# India 45 54 34 34 45 45 43 45 67 45 56 56
# Canada 44 34 12 32 35 45 43 41 60 43 55 21