Given a DataFrame:
df = pd.DataFrame(
{'AgeAtMedStart': {1: -46.47, 2: 46.47, 3: 46.8, 4: 51.5, 5: 51.5},
'AgeAtMedStop': {1: 46.8, 2: 46.8, 3: nan, 4: -51.9, 5: 51.81},
'MedContinuing': {1: 'No', 2: 'No', 3: 'Yes', 4: 'No', 5: 'No'},
'Medication': {1: 'Med1', 2: 'Med2', 3: 'Med3', 4: 'Med4', 5: 'Med4'},
'YearOfMedStart': {1: 2016.0, 2: 2016.0, 3: 2016.0, 4: 2016.0, 5: 2016.0}}
)
df
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.80 No Med1 2016.0
2 46.47 46.80 No Med2 2016.0
3 46.80 NaN Yes Med3 2016.0
4 51.50 -51.90 No Med4 2016.0
5 51.50 51.81 No Med4 2016.0
I want to filter to retain rows where any of the numeric values in the "AgeAt*" columns is negative.
My expected output for this output is to have row with index 1 since "AgeAtMedStart" has value -46.47 and row with index 4 since "AgeAtMedStop" is -51.9, so the output would be
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.8 No Med1 2016.0
4 51.50 -51.9 No Med4 2016.0
EDIT1:
So I've tried the different answers provided thus far, but all return an empty dataframe. And I believe part of the problem is that I have another column called AgeAtMedStartFlag (and AgeAtMedStopFlag) which contain strings. So for this sample csv:
RecordKey Medication CancerSiteForTreatment CancerSiteForTreatmentCode TreatmentLineCodeKey AgeAtMedStart AgeAtMedStartFlag YearOfMedStart MedContinuing AgeAtMedStop AgeAtMedStopFlag ChangeOfTreatment
1 Drug1 Site1 C1.0 First -46.47 Year And Month Are Known But Day Is Missing And Coded To 15 2016 No 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1 Drug2 Site2 C1.1 First 46.47 Year And Month Are Known But Day Is Missing And Coded To 15 2016 No 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1 Drug3 Site3 C1.2 First 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 2016 Yes Yes
2 Drug4 Site4 C1.3 First 51.5 2016 No 51.9 Yes
2 Drug5 Site5 C1.4 First 51.5 2016 No -51.81 Yes
3 Drug6 Site6 C1.5 First 73.93 2016 No 74.42 Yes
3 Drug7 Site7 C1.6 First 73.93 2016 No 74.42 Yes
4 Drug8 Site8 C1.7 First 36.66 2015 No 37.24 Yes
4 Drug9 Site9 C1.8 First 36.66 2015 No 37.24 Yes
4 Drug10 Site10 C1.9 First 36.66 2015 No 37.24 Yes
9 Drug11 Site11 C1.10 First 43.55 2016 No 43.68 Yes
9 Drug12 Site12 C1.11 First 43.22 2016 No 43.49 Yes
9 Drug13 Site13 C1.12 First 43.55 2016 No 43.68 Yes
9 Drug14 Site14 C1.13 First 43.22 2016 No 43.49 Yes
10 Drug15 Site15 C1.14 First 74.42 2016 No 74.84 Yes
10 Drug16 Site16 C1.15 First 73.56 2015 No 73.98 Yes
10 Drug17 Site17 C1.16 First 73.56 2015 No 73.98 No
10 Drug18 Site18 C1.17 First 74.42 2016 No 74.84 No
10 Drug19 Site19 C1.18 First 73.56 2015 No 73.98 No
10 Drug20 Site20 C1.19 First 74.42 2016 No 74.84 No
11 Drug21 Site21 C1.20 First 70.72 2013 No 72.76 No
11 Drug22 Site22 C1.21 First 68.76 2011 No 70.62 No
11 Drug23 Site23 C1.22 First 73.43 2016 No 73.96 No
11 Drug24 Site24 C1.23 First 72.76 2015 No 73.43 No
with this change to my script:
age_df = df.columns[(df.columns.str.startswith('AgeAt')) & (~df.columns.str.endswith('Flag'))]
df[df[age_df] < 0].to_excel('invalid.xlsx', 'Benjamin_Button')
It returns:
RecordKey Medication CancerSiteForTreatment CancerSiteForTreatmentCode TreatmentLineCodeKey AgeAtMedStart AgeAtMedStartFlag YearOfMedStart MedContinuing AgeAtMedStop AgeAtMedStopFlag ChangeOfTreatment
1 -46.47
1
1
2
2 -51.81
3
3
4
4
4
9
9
9
9
10
10
10
10
10
10
11
11
11
11
Can I modify this implementation to only return the rows where the negatives are and if possible, the rest of the values for those rows? Or even better, just the negative ages and the RecordKey for that row.
Here's a simple one-liner for you. If you need to logically determine if the column is numeric refer to coldspeed's answer. But, if you are ok with explicit column references a simple method like this will work.
Note I'm also filling NaN's with 0; this will meet your requirement even though the data is missing. Nan's can be handled in other ways, but this will suffice here. If you have missing values in other columns you'd like to preserve, this can also be done (I didn't include it here for simplicity).
myData = df.fillna(0).query('AgeAtMedStart < 0 or AgeAtMedStop < 0')
Returns:
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.8 No Med1 2016.0
4 51.50 -51.9 No Med4 2016.0
Pandas native query method is very handy for simple filter expressions.
Refer to the docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
First get the columns of interest:
cols = [col for col in df if col.lower().startswith('AgeAt')]
Then get the DF with those columns:
df_wanted = df[cols]
Then get the rows:
x = df_wanted[df_wanted < 0]
Of course, if you are looking at multiple columns, some of the cells will contain nan.
Related
i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe
If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN
One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.
Datatable:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2017 NaN
1 NISSAN 2017 NaN
2 HYUNDAI 2017 1.0
3 DODGE 2017 NaN
I want to update more than one index and column data on that index with the loc function.
but when I use the loc function, it changes the new values by twos
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
data.loc[indexlister , listcolumns] = listnewvalue
As you can see in the output below. just zero and the first index 'VEHICLE_YEAR' should be 16000, 'NUM_PASSENGERS' should be 28000. BUT, BOTH ZERO AND THE FIRST ROW HAS CHANGED IN BOTH COLUMNS.
How can i check this and change only the columns and index i want.or do you have a different method? thank you very much.
output:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 16000 28000.0
1 NISSAN 16000 28000.0
In the printout, I set fields to be empty so that new entries appear. for example; I want to assign the value 2005 to the 0 index of the column 'VEHICLE_YEAR' and to the 1st index 2005 of the column 'NUM_PASSENGERS'
The output I want is as follows:
ARAÇ VEHICLE_YEAR NUM_PASSENGERS
0 CHEVROLET 2005 Nan
1 NISSAN Nan 2005
2 HYUNDAI Nan Nan
The list you're setting the values with needs to correspond to the number of rows and number of columns you've selected with loc. If it receives a single list, it will assign all selected rows at those columns to that value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ARAC' : ['CHEVROLET', 'NISSAN', 'HYUNDAI', 'DODGE'],
'VEHICLE_YEAR' : [2017, 2017, 2017, 2017],
'NUM_PASSENGERS' : [np.nan, np.nan, 1.0, np.nan]
})
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET NaN 2017
1 NISSAN NaN 2017
2 HYUNDAI 1.0 2017
3 DODGE NaN 2017
df.loc[[0, 2], ['NUM_PASSENGERS', 'VEHICLE_YEAR']] = [[1000, 2014], [3000, 2015]]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 1000.0 2014
1 NISSAN NaN 2017
2 HYUNDAI 3000.0 2015
3 DODGE NaN 2017
If you only want to change the values in the NUM_PASSENGERS column, select only that and give it a single list/array, the same length as your row indices.
df.loc[[0,1,3], ['NUM_PASSENGERS']] = [10, 20, 30]
ARAC NUM_PASSENGERS VEHICLE_YEAR
0 CHEVROLET 10.0 2014
1 NISSAN 20.0 2017
2 HYUNDAI 3000.0 2015
3 DODGE 30.0 2017
The docs might be helpful too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc
If this didn't answer your question, please provide your expected output.
I solved the problem as follows.
I could not describe the problem exactly, I am working on it, but when I changed it that way, it worked. And now I can change the row and column value I want to the value I want.
listcolumns = ['VEHICLE_YEAR', 'NUM_PASSENGERS']
listnewvalue = [16000, 28000]
indexlister = [0, 1]
for i in len(indexlister):
df.loc[lister[count], listcolumn[count]] = listnewvalue[count]
I have the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'clif_cod' : [1,2,3,3,4,4,4],
'peds_val_fat' : [10.2, 15.2, 30.9, 14.8, 10.99, 39.9, 54.9],
'mes' : [1,2,4,5,5,6,12],
'ano' : [2016, 2016, 2016, 2016, 2016, 2016, 2016]})
vetor_valores = df.groupby(['mes','clif_cod']).sum()
which yields me this output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.20
2 2 2016 15.20
4 3 2016 30.90
5 3 2016 14.80
4 2016 10.99
6 4 2016 39.90
12 4 2016 54.90
How do I select rows based on mes and clif_cod?
When I do list(df) I only get ano and peds_val_fat.
IIUC, you can just pass the argument as_index=False to your groupby. You can then access it as you would any other dataframe
vetor_valores = df.groupby(['mes','clif_cod'], as_index=False).sum()
>>> vetor_valores
mes clif_cod ano peds_val_fat
0 1 1 2016 10.20
1 2 2 2016 15.20
2 4 3 2016 30.90
3 5 3 2016 14.80
4 5 4 2016 10.99
5 6 4 2016 39.90
6 12 4 2016 54.90
To access values, you can now use iloc or loc as you would any dataframe:
# Select first row:
vetor_valores.iloc[0]
...
Alternatively, if you've already created your groupby and don't want to go back and re-make it, you can reset the index, the result is identical.
vetor_valores.reset_index()
By using pd.IndexSlice
vetor_valores.loc[[pd.IndexSlice[1,1]],:]
Out[272]:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2
You've got a dataframe with a two-level MultiIndex. Use both values to access rows, e.g., vetor_valores.loc[(4,3)].
Use axis parameter in .loc:
vetor_valores.loc(axis=0)[1,:]
Output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2
I am trying to make a dataframe so that I can send it to a CSV easily, otherwise I have to do this process manually..
I'd like this to be my final output. Each person has a month and year combo that starts at 1/1/2014 and goes to 12/1/2016:
Name date
0 ben 1/1/2014
1 ben 2/1/2014
2 ben 3/1/2014
3 ben 4/1/2014
....
12 dan 1/1/2014
13 dan 2/1/2014
14 dan 3/1/2014
code so far:
import pandas as pd
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df = pd.DataFrame({"Name": listof_people})
for month in months:
df.append({'date': month}, ignore_index=True)
print(df)
When I try looping to create the dataframe it either does not work, I get index errors (because of the non-matching lists) and I'm at a loss.
I've done a good bit of searching and have found some following links that are similar, but I can't reverse engineer the work to fit my case.
Filling empty python dataframe using loops
How to build and fill pandas dataframe from for loop?
I don't want anyone to feel like they are "doing my homework", so if i'm derping on something simple please let me know.
I think you can use product for all combination with to_datetime for column date:
from itertools import product
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df1 = pd.DataFrame(list(product(listof_people, months, days, years)))
df1.columns = ['Name', 'month','day','year']
print (df1)
Name month day year
0 ben 1 1 2014
1 ben 1 1 2015
2 ben 1 1 2016
3 ben 2 1 2014
4 ben 2 1 2015
5 ben 2 1 2016
6 ben 3 1 2014
7 ben 3 1 2015
8 ben 3 1 2016
9 ben 4 1 2014
10 ben 4 1 2015
...
...
df1['date'] = pd.to_datetime(df1[['month','day','year']])
df1 = df1[['Name','date']]
print (df1)
Name date
0 ben 2014-01-01
1 ben 2015-01-01
2 ben 2016-01-01
3 ben 2014-02-01
4 ben 2015-02-01
5 ben 2016-02-01
6 ben 2014-03-01
7 ben 2015-03-01
...
...
mux = pd.MultiIndex.from_product(
[listof_people, years, months],
names=['Name', 'Year', 'Month'])
pd.Series(
1, mux, name='Day'
).reset_index().assign(
date=pd.to_datetime(df[['Year', 'Month', 'Day']])
)[['Name', 'date']]
I want to simply reverse the column order of a given DataFrame.
My DataFrame:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
Actual output:
year team wins losses
0 2010 Bears 11 5
1 2011 Bears 8 8
2 2012 Bears 10 6
3 2011 Packers 15 1
4 2012 Packers 11 5
5 2010 Lions 6 10
6 2011 Lions 10 6
7 2012 Lions 4 12
I thought this would work but it reverses the row order not column order:
football[::-1]
I also tried:
football.columns = football.columns[::-1]
but that reversed the column labels and not the entire column itself.
A solution close to what you have already tried is to use:
>>> football[football.columns[::-1]]
losses wins team year
0 5 11 Bears 2010
1 8 8 Bears 2011
2 6 10 Bears 2012
3 1 15 Packers 2011
4 5 11 Packers 2012
5 10 6 Lions 2010
6 6 10 Lions 2011
7 12 4 Lions 2012
football.columns[::-1] reverses the order of the DataFrame's sequence of columns, and football[...] reindexes the DataFrame using this new sequence.
A more succinct way to achieve the same thing is with the iloc indexer:
football.iloc[:, ::-1]
The first : means "take all rows", the ::-1 means step backwards through the columns.
The loc indexer mentioned in #PietroBattiston's answer works in the same way.
Note: As of Pandas v0.20, .ix indexer is deprecated in favour of .iloc / .loc.
Close to EdChum's answer... but faster:
In [3]: %timeit football.ix[::,::-1]
1000 loops, best of 3: 255 µs per loop
In [4]: %timeit football.ix[::,football.columns[::-1]]
1000 loops, best of 3: 491 µs per loop
Also notice one colon is redundant:
In [5]: all(football.ix[:,::-1] == football.ix[::,::-1])
Out[5]: True
EDIT: a further (minimal) improvement is brought by using .loc rather than .ix, as in football.loc[:,::-1].
Note: As of Pandas v0.20, .ix indexer is deprecated in favour of .iloc / .loc.
You can use fancy indexing .ix, pass the columns and then reverse the list to change the order:
In [27]:
football.ix[::,football.columns[::-1]]
Out[27]:
losses wins team year
0 5 11 Bears 2010
1 8 8 Bears 2011
2 6 10 Bears 2012
3 1 15 Packers 2011
4 5 11 Packers 2012
5 10 6 Lions 2010
6 6 10 Lions 2011
7 12 4 Lions 2012
timings
In [32]:
%timeit football[football.columns[::-1]]
1000 loops, best of 3: 421 µs per loop
In [33]:
%timeit football.ix[::,football.columns[::-1]]
1000 loops, best of 3: 403 µs per loop
fancy indexing is marginally faster in this case