I want to simply reverse the column order of a given DataFrame.
My DataFrame:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
Actual output:
year team wins losses
0 2010 Bears 11 5
1 2011 Bears 8 8
2 2012 Bears 10 6
3 2011 Packers 15 1
4 2012 Packers 11 5
5 2010 Lions 6 10
6 2011 Lions 10 6
7 2012 Lions 4 12
I thought this would work but it reverses the row order not column order:
football[::-1]
I also tried:
football.columns = football.columns[::-1]
but that reversed the column labels and not the entire column itself.
A solution close to what you have already tried is to use:
>>> football[football.columns[::-1]]
losses wins team year
0 5 11 Bears 2010
1 8 8 Bears 2011
2 6 10 Bears 2012
3 1 15 Packers 2011
4 5 11 Packers 2012
5 10 6 Lions 2010
6 6 10 Lions 2011
7 12 4 Lions 2012
football.columns[::-1] reverses the order of the DataFrame's sequence of columns, and football[...] reindexes the DataFrame using this new sequence.
A more succinct way to achieve the same thing is with the iloc indexer:
football.iloc[:, ::-1]
The first : means "take all rows", the ::-1 means step backwards through the columns.
The loc indexer mentioned in #PietroBattiston's answer works in the same way.
Note: As of Pandas v0.20, .ix indexer is deprecated in favour of .iloc / .loc.
Close to EdChum's answer... but faster:
In [3]: %timeit football.ix[::,::-1]
1000 loops, best of 3: 255 µs per loop
In [4]: %timeit football.ix[::,football.columns[::-1]]
1000 loops, best of 3: 491 µs per loop
Also notice one colon is redundant:
In [5]: all(football.ix[:,::-1] == football.ix[::,::-1])
Out[5]: True
EDIT: a further (minimal) improvement is brought by using .loc rather than .ix, as in football.loc[:,::-1].
Note: As of Pandas v0.20, .ix indexer is deprecated in favour of .iloc / .loc.
You can use fancy indexing .ix, pass the columns and then reverse the list to change the order:
In [27]:
football.ix[::,football.columns[::-1]]
Out[27]:
losses wins team year
0 5 11 Bears 2010
1 8 8 Bears 2011
2 6 10 Bears 2012
3 1 15 Packers 2011
4 5 11 Packers 2012
5 10 6 Lions 2010
6 6 10 Lions 2011
7 12 4 Lions 2012
timings
In [32]:
%timeit football[football.columns[::-1]]
1000 loops, best of 3: 421 µs per loop
In [33]:
%timeit football.ix[::,football.columns[::-1]]
1000 loops, best of 3: 403 µs per loop
fancy indexing is marginally faster in this case
Related
I have a df that has country-year data from 2000-2020 with various columns containing the sum total of given events in each country-year unit. In some countries, the event only happened in some of the years, so there are no rows for the remaining years which I would like to have a "0" in all columns in that row.
country
iyear
nwound
Med
claimed
Nigeria
2000
2
5
7
Nigeria
2001
3
15
9
Nigeria
2005
4
6
14
Nigeria
2017
9
41
20
Benin
2004
2
5
7
Benin
2008
3
15
9
Benin
20010
4
6
14
Benin
2019
9
41
20
In short, I'm looking for a way to add rows for all the years 2000-2020 for Nigeria and Benin (and all the other countries not listed) that are missing with each value in the row (for nwound, med and claimed) being 0. Keep in mind, this data set have 18 countries in it so I would want the code to be reproducible.
Use the reindex method from pandas:
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Nigeria', 'Benin', 'Benin', 'Benin', 'Benin'],
'iyear': [2000, 2001, 2005, 2017, 2004, 2008, 2010, 2019],
'nwound': [2, 3, 4, 9, 2, 3, 4, 9],
'Med': [5, 15, 6, 41, 5, 15, 6, 41],
'claimed': [7, 9, 14, 20, 7, 9, 14, 20]})
df = df.set_index(['country', 'iyear'])
countries = df.index.levels[0].tolist()
index = pd.MultiIndex.from_product([countries, range(2000, 2021)], names=['country', 'iyear'])
df = df.reindex(index, fill_value=0)
df = df.reset_index()
print(df)
I am trying to clean up some data out of which I need to keep only the most recent but all of them, if they appear more than once. What confuses me is that the data are actually organised in "groups". I have a dataframe example below along with the comments that might make it clearer:
method year proteins values
0 John 2017 A 10
1 John 2017 B 20
2 John 2018 A 30 # John's method in 2018 is most recent, keep this line and drop index 0 and1
3 Kate 2018 B 11
4 Kate 2018 C 22 # Kate's method appears only in 2018 so keep both lines (index 3 and 4)
5 Patrick 2017 A 90
6 Patrick 2018 A 80
7 Patrick 2018 B 85
8 Patrick 2018 C 70
9 Patrick 2019 A 60
10 Patrick 2019 C 50 # Patrick's method in 2019 is the most recent of Patrick's so keep index 9 and 10 only
So the desired output dataframe is irrelevant of the proteins that are measured but all the measured proteins should be included:
method year proteins values
0 John 2018 A 30
1 Kate 2018 B 11
2 Kate 2018 C 22
3 Patrick 2019 A 60
4 Patrick 2019 C 50
Hope this is clear. I have tried something like this my_df.sort_values('year').drop_duplicates('method', keep='last') but it gives a wrong output. Any ideas? Thank you!
PS: To replicate my initial df, you can copy the below lines:
import pandas as pd
import numpy as np
methodology=["John", "John", "John", "Kate", "Kate", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick"]
year_pract=[2017, 2017, 2018, 2018, 2018, 2017, 2018, 2018, 2018, 2019, 2019]
proteins=['A', 'B', 'A', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C']
values=[10, 20, 30, 11, 22, 90, 80, 85, 70, 60, 50]
my_df=pd.DataFrame(zip(methodology,year,proteins,values), columns=['method','year','proteins','values'])
my_df['year']=my_df['year'].astype(str)
my_df['year']=pd.to_datetime(my_df['year'], format='%Y') # the format never works for me and this is why I add the line below
my_df['year']=my_df['year'].dt.year
Because duplicates is necessary use GroupBy.transform with max and compare by original column year with Series.eq for equal and filtering by boolean indexing:
df = my_df[my_df['year'].eq(my_df.groupby('method')['year'].transform('max'))]
print (df)
method year proteins values
2 John 2018 A 30
3 Kate 2018 B 11
4 Kate 2018 C 22
9 Patrick's 2019 A 60
10 Patrick's 2019 C 50
Given a DataFrame:
df = pd.DataFrame(
{'AgeAtMedStart': {1: -46.47, 2: 46.47, 3: 46.8, 4: 51.5, 5: 51.5},
'AgeAtMedStop': {1: 46.8, 2: 46.8, 3: nan, 4: -51.9, 5: 51.81},
'MedContinuing': {1: 'No', 2: 'No', 3: 'Yes', 4: 'No', 5: 'No'},
'Medication': {1: 'Med1', 2: 'Med2', 3: 'Med3', 4: 'Med4', 5: 'Med4'},
'YearOfMedStart': {1: 2016.0, 2: 2016.0, 3: 2016.0, 4: 2016.0, 5: 2016.0}}
)
df
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.80 No Med1 2016.0
2 46.47 46.80 No Med2 2016.0
3 46.80 NaN Yes Med3 2016.0
4 51.50 -51.90 No Med4 2016.0
5 51.50 51.81 No Med4 2016.0
I want to filter to retain rows where any of the numeric values in the "AgeAt*" columns is negative.
My expected output for this output is to have row with index 1 since "AgeAtMedStart" has value -46.47 and row with index 4 since "AgeAtMedStop" is -51.9, so the output would be
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.8 No Med1 2016.0
4 51.50 -51.9 No Med4 2016.0
EDIT1:
So I've tried the different answers provided thus far, but all return an empty dataframe. And I believe part of the problem is that I have another column called AgeAtMedStartFlag (and AgeAtMedStopFlag) which contain strings. So for this sample csv:
RecordKey Medication CancerSiteForTreatment CancerSiteForTreatmentCode TreatmentLineCodeKey AgeAtMedStart AgeAtMedStartFlag YearOfMedStart MedContinuing AgeAtMedStop AgeAtMedStopFlag ChangeOfTreatment
1 Drug1 Site1 C1.0 First -46.47 Year And Month Are Known But Day Is Missing And Coded To 15 2016 No 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1 Drug2 Site2 C1.1 First 46.47 Year And Month Are Known But Day Is Missing And Coded To 15 2016 No 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1 Drug3 Site3 C1.2 First 46.8 Year And Month Are Known But Day Is Missing And Coded To 15 2016 Yes Yes
2 Drug4 Site4 C1.3 First 51.5 2016 No 51.9 Yes
2 Drug5 Site5 C1.4 First 51.5 2016 No -51.81 Yes
3 Drug6 Site6 C1.5 First 73.93 2016 No 74.42 Yes
3 Drug7 Site7 C1.6 First 73.93 2016 No 74.42 Yes
4 Drug8 Site8 C1.7 First 36.66 2015 No 37.24 Yes
4 Drug9 Site9 C1.8 First 36.66 2015 No 37.24 Yes
4 Drug10 Site10 C1.9 First 36.66 2015 No 37.24 Yes
9 Drug11 Site11 C1.10 First 43.55 2016 No 43.68 Yes
9 Drug12 Site12 C1.11 First 43.22 2016 No 43.49 Yes
9 Drug13 Site13 C1.12 First 43.55 2016 No 43.68 Yes
9 Drug14 Site14 C1.13 First 43.22 2016 No 43.49 Yes
10 Drug15 Site15 C1.14 First 74.42 2016 No 74.84 Yes
10 Drug16 Site16 C1.15 First 73.56 2015 No 73.98 Yes
10 Drug17 Site17 C1.16 First 73.56 2015 No 73.98 No
10 Drug18 Site18 C1.17 First 74.42 2016 No 74.84 No
10 Drug19 Site19 C1.18 First 73.56 2015 No 73.98 No
10 Drug20 Site20 C1.19 First 74.42 2016 No 74.84 No
11 Drug21 Site21 C1.20 First 70.72 2013 No 72.76 No
11 Drug22 Site22 C1.21 First 68.76 2011 No 70.62 No
11 Drug23 Site23 C1.22 First 73.43 2016 No 73.96 No
11 Drug24 Site24 C1.23 First 72.76 2015 No 73.43 No
with this change to my script:
age_df = df.columns[(df.columns.str.startswith('AgeAt')) & (~df.columns.str.endswith('Flag'))]
df[df[age_df] < 0].to_excel('invalid.xlsx', 'Benjamin_Button')
It returns:
RecordKey Medication CancerSiteForTreatment CancerSiteForTreatmentCode TreatmentLineCodeKey AgeAtMedStart AgeAtMedStartFlag YearOfMedStart MedContinuing AgeAtMedStop AgeAtMedStopFlag ChangeOfTreatment
1 -46.47
1
1
2
2 -51.81
3
3
4
4
4
9
9
9
9
10
10
10
10
10
10
11
11
11
11
Can I modify this implementation to only return the rows where the negatives are and if possible, the rest of the values for those rows? Or even better, just the negative ages and the RecordKey for that row.
Here's a simple one-liner for you. If you need to logically determine if the column is numeric refer to coldspeed's answer. But, if you are ok with explicit column references a simple method like this will work.
Note I'm also filling NaN's with 0; this will meet your requirement even though the data is missing. Nan's can be handled in other ways, but this will suffice here. If you have missing values in other columns you'd like to preserve, this can also be done (I didn't include it here for simplicity).
myData = df.fillna(0).query('AgeAtMedStart < 0 or AgeAtMedStop < 0')
Returns:
AgeAtMedStart AgeAtMedStop MedContinuing Medication YearOfMedStart
1 -46.47 46.8 No Med1 2016.0
4 51.50 -51.9 No Med4 2016.0
Pandas native query method is very handy for simple filter expressions.
Refer to the docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
First get the columns of interest:
cols = [col for col in df if col.lower().startswith('AgeAt')]
Then get the DF with those columns:
df_wanted = df[cols]
Then get the rows:
x = df_wanted[df_wanted < 0]
Of course, if you are looking at multiple columns, some of the cells will contain nan.
I have the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'clif_cod' : [1,2,3,3,4,4,4],
'peds_val_fat' : [10.2, 15.2, 30.9, 14.8, 10.99, 39.9, 54.9],
'mes' : [1,2,4,5,5,6,12],
'ano' : [2016, 2016, 2016, 2016, 2016, 2016, 2016]})
vetor_valores = df.groupby(['mes','clif_cod']).sum()
which yields me this output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.20
2 2 2016 15.20
4 3 2016 30.90
5 3 2016 14.80
4 2016 10.99
6 4 2016 39.90
12 4 2016 54.90
How do I select rows based on mes and clif_cod?
When I do list(df) I only get ano and peds_val_fat.
IIUC, you can just pass the argument as_index=False to your groupby. You can then access it as you would any other dataframe
vetor_valores = df.groupby(['mes','clif_cod'], as_index=False).sum()
>>> vetor_valores
mes clif_cod ano peds_val_fat
0 1 1 2016 10.20
1 2 2 2016 15.20
2 4 3 2016 30.90
3 5 3 2016 14.80
4 5 4 2016 10.99
5 6 4 2016 39.90
6 12 4 2016 54.90
To access values, you can now use iloc or loc as you would any dataframe:
# Select first row:
vetor_valores.iloc[0]
...
Alternatively, if you've already created your groupby and don't want to go back and re-make it, you can reset the index, the result is identical.
vetor_valores.reset_index()
By using pd.IndexSlice
vetor_valores.loc[[pd.IndexSlice[1,1]],:]
Out[272]:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2
You've got a dataframe with a two-level MultiIndex. Use both values to access rows, e.g., vetor_valores.loc[(4,3)].
Use axis parameter in .loc:
vetor_valores.loc(axis=0)[1,:]
Output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2
I'm now processing tweet data using python pandas module,
and I stuck with the problem.
I want to make a frequency table(pandas dataframe) from this dictionary:
d = {"Nigeria": 9, "India": 18, "Saudi Arabia": 9, "Japan": 60, "Brazil": 3, "United States": 38, "Spain": 5, "Russia": 3, "Ukraine": 3, "Azerbaijan": 5, "China": 1, "Germany": 3, "France": 12, "Philippines": 8, "Thailand": 5, "Argentina": 9, "Indonesia": 3, "Netherlands": 8, "Turkey": 2, "Mexico": 9, "Italy": 2}
desired output is:
>>> import pandas as pd
>>> df = pd.DataFrame(?????)
>>> df
Country Count
Nigeria 9
India 18
Saudi Arabia 9
.
.
.
(no matter if there's index from 0 to n at the leftmost column)
Can anyone help me to deal with this problem?
Thank you in advance!
You have only a single series (a column of data with index values), really, so this works:
pd.Series(d, name='Count')
You can then construct a DataFrame if you want:
df = pd.DataFrame(pd.Series(d, name='Count'))
df.index.name = 'Country'
Now you have:
Count
Country
Argentina 9
Azerbaijan 5
Brazil 3
...
Use DataFrame constructor and pass values and keys separately to columns:
df = pd.DataFrame({'Country':list(d.keys()),
'Count': list(d.values())}, columns=['Country','Count'])
print (df)
Country Count
0 Azerbaijan 5
1 Indonesia 3
2 Germany 3
3 France 12
4 Mexico 9
5 Italy 2
6 Spain 5
7 Brazil 3
8 Thailand 5
9 Argentina 9
10 Ukraine 3
11 United States 38
12 Turkey 2
13 Nigeria 9
14 Saudi Arabia 9
15 Philippines 8
16 China 1
17 Japan 60
18 Russia 3
19 India 18
20 Netherlands 8
Pass it as a list
pd.DataFrame([d]).T.rename(columns={0:'count'})
That might get the work done but will kill the performance since we are saying the keys are columns and then transposing it. So since d.items() gives us the tuples we can do
df = pd.DataFrame(list(d.items()),columns=['country','count'])
df.head()
country count
0 Germany 3
1 Philippines 8
2 Mexico 9
3 Nigeria 9
4 Saudi Arabia 9