I have a Dataframe with a pandas MultiIndex:
In [1]: import pandas as pd
In [2]: multi_index = pd.MultiIndex.from_product([['CAN','USA'],['total']],names=['country','sex'])
In [3]: df = pd.DataFrame({'pop':[35,318]},index=multi_index)
In [4]: df
Out[4]:
pop
country sex
CAN total 35
USA total 318
Then I remove some rows from that DataFrame:
In [5]: df = df.query('pop > 100')
In [6]: df
Out[6]:
pop
country sex
USA total 318
But when I consult the MutliIndex, it still has both countries in its levels.
In [7]: df.index.levels[0]
Out[7]: Index([u'CAN', u'USA'], dtype='object')
I can fix this myself in a rather strange way:
In [8]: idx_names = df.index.names
In [9]: df = df.reset_index(drop=False)
In [10]: df = df.set_index(idx_names)
In [11]: df
Out[11]:
pop
country sex
USA total 318
In [12]: df.index.levels[0]
Out[12]: Index([u'USA'], dtype='object')
But this seems rather messy. Is there a better way I'm missing?
From version pandas 0.20.0+ use MultiIndex.remove_unused_levels:
print (df.index)
MultiIndex(levels=[['CAN', 'USA'], ['total']],
labels=[[1], [0]],
names=['country', 'sex'])
df.index = df.index.remove_unused_levels()
print (df.index)
MultiIndex(levels=[['USA'], ['total']],
labels=[[0], [0]],
names=['country', 'sex'])
This is something that has bitten me before. Dropping columns or rows does NOT change the underlying MultiIndex, for performance and philosophical reasons, and this is officially not considered a bug (read more here). The short answer is that the developers say "that's not what the MultiIndex is for". If you need a list of the contents of a MultiIndex level after modification, for example for iteration or to check to see if something is included, you can use:
df.index.get_level_values(<levelname>)
This returns the current active values within that index level.
So I guess the "trick" here is that the API native way to do it is to use get_level_values instead of just .index or .columns
I will be surprised if there is a more "built-in" way to eliminate the unused country than to re-create the index in the way you're doing (or some similar way). If you look at your index before and after the slice:
In [165]: df.index
Out[165]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[0, 1], [0, 0]],
names=[u'country', u'sex'])
In [166]: df = df.query('pop > 100')
In [167]: df.index
Out[167]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[1], [0]],
names=[u'country', u'sex'])
you can see that the labels - which are indexes into the level values - have updated but not the level values. This may be an imperfect analogy, but it strikes me that the level values are analogous to an enumerated column in a database table, while the labels are analogous to the actual values of rows in the table. If you delete all the rows in a table with a value of "CAN", it doesn't change the fact that "CAN" is still a valid choice based on the column definition. To remove "CAN" from the enumeration, you have to alter the column definition; this is the equivalent of reindexing the dataframe in pandas.
Related
I have come across this question many a times over internet however not many answers are there except for few of the likes of the following:
Cannot rename the first column in pandas DataFrame
I approached the same using following:
df = df.rename(columns={df.columns[0]: 'Column1'})
Is there a better or cleaner way of doing the rename of the first column of a pandas dataframe? Or any specific column number?
You're already using a cleaner way in pandas.
It is sad that:
df.columns[0] = 'Column1'
Is impossible because Index objects do not support mutable assignments. It would give an TypeError.
You still could do iterable unpacking:
df.columns = ['Column1', *df.columns[1:]]
Or:
df = df.set_axis(['Column1', *df.columns[1:]], axis=1)
Not sure if cleaner, but possible idea is convert to list and set by indexing new value:
df = pd.DataFrame(columns=[4,7,0,2])
arr = df.columns.tolist()
arr[0] = 'Column1'
df.columns = arr
print (df)
Empty DataFrame
Columns: [Column1, 7, 0, 2]
Index: []
Through the loc and iloc methods, Pandas allows us to slice dataframes. Still, I am having trouble to do this when the columns are datetime objects.
For instance, suppose the data frame generated by the following code:
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
Let us try to slice the first two columns of the dataframe through dfloc:
df.loc[0,'01-01-2001':'02-02-2002']
We get the following TypeError:'<' not supported between instances of 'datetime.date' and 'str'
How could this be solved?
df.iloc[0,[0,1]]
Use iloc or loc , but give column name in second parameter as index of that columns and you are passing strings, just give index
To piggyback off of #Ch3steR comment from above that line should work.
dates = pd.to_datetime(dates)
At that point the date conversion should allow you to index the columns that fall in that range based on the date as listed below. Just make sure the end date is a little beyond the end date that you're trying to capture.
# Return all rows in columns between date range 1/1/2001 and 2/3/2002
df.loc[:, '1/1/2001':'2/3/2002']
2001-01-01 2002-02-02
0 1 2
You can call the dates from the list you created earlier and it doesn't give an error.
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
df.loc[0,dates[0]:dates[1]]
The two different formats are here. It's just important that you stick to the one format. Calling from the list works because it guarantees that the format is the same. But as you said, you need to be able to use any dates so the second one is better for you.
>>>dates = pd.to_datetime(dates).date
>>>print("With .date")
With .date
>>>print(dates)
[datetime.date(2001, 1, 1) datetime.date(2002, 2, 2)
datetime.date(2003, 3, 3)]
>>>dates = pd.to_datetime(dates)
>>>print("Without .date")
Without .date
>>>print(dates)
DatetimeIndex(['2001-01-01', '2002-02-02', '2003-03-03'], dtype='datetime64[ns]', freq=None)
In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.
I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo:
group_sum = df.groupby(['name', foo])['tickets'].sum()
How would foo be defined to group the second column into two groups, demarcated by whether values are > 0, for example? Or, is an entirely different approach or syntax used?
Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like
df.groupby(['name', df[1].map(foo)])
Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:
df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()
Something like this will work:
x.groupby(['name', x['value']>0])['tickets'].sum()
Like mentioned above the groupby can accept labels and series. This should give you the answer you are looking for. Here is an example:
data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()
name value
1 False 20
True 100
2 False 100
Name: value2, dtype: int64
I am using pandas in Python 2.7 and read a csv file like this:
import pandas as pd
df = pd.read_csv("test_file.csv")
df has a column titled rating, and a column titled 'review', I do some manipulations on df for example:
df3 = df[df['rating'] != 3]
Now if I look in a debugger at df['review'] and df3['review'] I see this information:
df['review'] = {Series}0
df3['review'] = {Series}1
Also if I want to see the first element of df['review'] I use:
df['review'][0]
which is fine, but if I do the same for df3, I get this error:
df3['review'][0]
{KeyError}0L
However, it looks like I can do this:
df3['review'][1]
Can someone please explain the difference?
Indexing with an integer on a Series doesn't work like a list. In particular, df['review'][0] doesn't get the first element of the "review" column, it gets the element with index 0:
In [4]: s = pd.Series(['a', 'b', 'c', 'd'], index=[1, 0, 2, 3])
In [5]: s
Out[5]:
1 a
0 b
2 c
3 d
dtype: object
In [6]: s[0]
Out[6]: 'b'
Presumably, in generating df3 you dropped the row with index 0. If you actually want to get the first element regardless of the index, use iloc:
In [7]: s.iloc[0]
Out[7]: 'a'
How can I add a header to a DF without replacing the current one? In other words I just want to shift the current header down and just add it to the dataframe as another record.
*secondary question: How do I add tables (example dataframe) to stackoverflow question?
I have this (Note header and how it is just added as a row:
0.213231 0.314544
0 -0.952928 -0.624646
1 -1.020950 -0.883333
I need this (all other records are shifted down and a new record is added)
(also: I couldn't read the csv properly because I'm using s3_text_adapter for the import and I couldn't figure out how to have an argument that ignores header similar to pandas read_csv):
A B
0 0.213231 0.314544
1 -1.020950 -0.883333
Another option is to add it as an additional level of the column index, to make it a MultiIndex:
In [11]: df = pd.DataFrame(randn(2, 2), columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 -0.952928 -0.624646
1 -1.020950 -0.883333
In [13]: df.columns = pd.MultiIndex.from_tuples(zip(['AA', 'BB'], df.columns))
In [14]: df
Out[14]:
AA BB
A B
0 -0.952928 -0.624646
1 -1.020950 -0.883333
This has the benefit of keeping the correct dtypes for the DataFrame, so you can still do fast and correct calculations on your DataFrame, and allows you to access by both the old and new column names.
.
For completeness, here's DSM's (deleted answer), making the columns a row, which, as mentioned already, is usually not a good idea:
In [21]: df_bad_idea = df.T.reset_index().T
In [22]: df_bad_idea
Out[22]:
0 1
index A B
0 -0.952928 -0.624646
1 -1.02095 -0.883333
Note, the dtype may change (if these are column names rather than proper values) as in this case... so be careful if you actually plan to do any work on this as it will likely be slower and may even fail:
In [23]: df.sum()
Out[23]:
A -1.973878
B -1.507979
dtype: float64
In [24]: df_bad_idea.sum() # doh!
Out[24]: Series([], dtype: float64)
If the column names are actually a row that was mistaken as a header row then you should correct this on reading in the data (e.g. read_csv use header=None).
The key is to specify header=None and use column to add header:
data = pd.read_csv('file.csv', skiprows=2, header=None ) # skip blank rows if applicable
df = pd.DataFrame(data)
df = df.iloc[ : , [0,1]] # columns 1 and 2
df.columns = ['A','B'] # title