How do I rename an index row in Python Pandas? [duplicate] - python

This question already has answers here:
How can I change a specific row label in a Pandas dataframe?
(2 answers)
Closed 5 years ago.
I see how to rename columns, but I want to rename an index (row name) that I have in a data frame.
I had a table with 350 rows in it, I then added a total to the bottom. I then removed every row except the last row.
-------------------------------------------------
| | A | B | C |
-------------------------------------------------
| TOTAL | 1243 | 423 | 23 |
-------------------------------------------------
So I have the row called 'Total', and then several columns. I want to rename the word 'Total' to something else.
Is this even possible?
Many thanks

You could use a dictionary structure with rename(), for example,
In [1]: import pandas as pd
df = pd.Series([1, 2, 3])
df
Out[1]: 0 1
1 2
2 3
dtype: int64
In [2]: df.rename({1: 3, 2: 'total'})
Out[2]: 0 1
3 2
total 3
dtype: int64

Easy as this...
df.index.name = 'Name'

Related

How to automatically set an index to a Pandas DataFrame when reading a CSV with or without an index column

Say I have two CSV files. The first one, input_1.csv, has an index column, so when I run:
import pandas as pd
df_1 = pd.read_csv("input_1.csv")
df_1
I get a DataFrame with an index column, as well as a column called Unnamed: 0, which is the same as the index column. I can prevent this duplication by adding the argument index_col=0 and everything is fine.
The second file, input_2.csv, has no index column, i.e., it looks like this:
| stuff | things |
|--------:|---------:|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 4 | 40 |
| 5 | 50 |
Running pd.read_csv("input_2.csv") gives me a DataFrame with an index column. In this case, adding the index_col=0 argument will set in the index column to stuff, as in the CSV file itself.
My problem is that I have a function that contains the read_csv part, and I want it to return a DataFrame with an index column in either case. Is there a way to detect whether the input file has an index column or not, set one if it doesn't, and do nothing if it does?
CSV has no built-in notion of an "index" column, so the answer I think is that this isn't possible in general.
It would be nice if you could say "use 0 as index only if unnamed", but Pandas does not give us that option.
Therefore you will probably need to just check if an Unnamed: column appears, and set those columns to be the index.
By index, I hope you mean a column with serial number either starting at 0 or 1.
You can have some kind of post-import logic to decide, if first column qualifies as an index column:
The logic is, if difference between default index and first column is same for all rows, then the first column contains increasing sequence (starting at any number). Pre-condition is that the column should be numeric.
For example:
idx value
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
pd.api.types.is_numeric_dtype(df[df.columns[0]])
>> True
np.array(df.index) - df.iloc[:,0].values
>> array([-1, -1, -1, -1, -1, -1])
# If all values are equal
len(pd.Series(np.array(df.index) - df.iloc[:,0].values).unique()) == 1
>> True

Header for sub-headers in Pandas

In Pandas I have a table with the next columns:
Number of words | 1 | 2 | 4 |
...And I want to make it like the following:
----------------|worker/node|
Number of words | 1 | 2 | 4 |
So how to "create" this header for sub-features?
And how to merge empty cell (from row 1 where FeatureHeader is located) with "Index" cell in row 2?
In another words, I want to make table headers like this:
Use MultiIndex.from_product for add first level of MultiIndex by your string:
#if necessary convert some columns to index first
df = df.set_index(['Number of words'])
df.columns = pd.MultiIndex.from_product([['Worker/node'], df.columns])

What is the pandas version of tidyr::separate? [duplicate]

This question already has answers here:
Pandas split column into multiple columns by comma
(8 answers)
Closed 3 years ago.
The R package tidyr has a nice separate function to "Separate one column into multiple columns."
What is the pandas version?
For example here is a dataset:
import pandas
from six import StringIO
df = """ i | j | A
AR | 5 | Paris,Green
For | 3 | Moscow,Yellow
For | 4 | New York,Black"""
df = StringIO(df.replace(' ',''))
df = pandas.read_csv(df, sep="|", header=0)
I'd like to separate the A column into 2 columns containing the content of the 2 columns.
This question is related: Accessing every 1st element of Pandas DataFrame column containing lists
The equivalent of tidyr::separate is str.split with a special assignment:
df['Town'], df['Color'] = df['A'].str.split(',', 1).str
print(df)
# i j A Town Color
# 0 AR 5 Paris,Green Paris Green
# 1 For 3 Moscow,Yellow Moscow Yellow
# 2 For 4 NewYork,Black NewYork Black
The equivalent of tidyr::unite is a simple concatenation of the character vectors:
df["B"] = df["i"] + df["A"]
df
# i j A B
# 0 AR 5 Paris,Green ARParis,Green
# 1 For 3 Moscow,Yellow ForMoscow,Yellow
# 2 For 4 NewYork,Black ForNewYork,Black

Python Pandas differing value_counts() in two columns of same len()

I have a pandas data frame that contains two columns, with trace numbers [col_1] and ID numbers [col_2]. Trace numbers can be duplicates, as can ID numbers - however, each trace & ID should correspond only a specific fellow in the adjacent column.
Each of my two columns are the same length, but have different unique value counts, which should be the same, as shown below:
in[1]: Trace | ID
1 | 5054
2 | 8291
3 | 9323
4 | 9323
... |
100 | 8928
in[2]: print('unique traces: ', df['Trace'].value_counts())
print('unique IDs: ', df['ID'].value_counts())
out[3]: unique traces: 100
unique IDs: 99
In the code above, the same ID number (9232) is represented by two Trace numbers (3 & 4) - how can I isolate these incidences? Thanks for looking!
By using the duplicated() function (docs), you can do the following:
df[df['ID'].duplicated(keep=False)]
By setting keep to False, we get all the duplicates (instead of excluding the first or the last one).
Which returns:
Trace ID
2 3 9323
3 4 9323
You can use groupby and filter:
df.groupby('ID').filter(lambda x: x.Trace.nunique() > 1)
Output:
Trace ID
2 3 9323.0
3 4 9323.0
#this should tell you the index of Non-unique Trace or IDs.
df.groupby('ID').filter(lambda x: len(x)>1)
Out[85]:
Trace ID
2 3 9323
3 4 9323
df.groupby('Trace').filter(lambda x: len(x)>1)
Out[86]:
Empty DataFrame
Columns: [Trace, ID]
Index: []

In pandas, How to select the rows that contains NaN? [duplicate]

This question already has answers here:
How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?
(6 answers)
Closed 6 years ago.
Suppose I have the following dataframe in df:
a | b | c
------+-------+-------
5 | 2 | 4
NaN | 6 | 8
5 | 9 | 0
3 | 7 | 1
If I do df.loc[df['a'] == 5] it will correctly return the first and third row, but then if I do a df.loc[df['a'] == np.NaN] it returns nothing.
I think this is more a python thing than a pandas one. If I compare np.nan against anything, even np.nan == np.nan will evaluate as False, so the question is, how should I test for np.nan?
Try using isnull like so:
import pandas as pd
import numpy as np
a=[1,2,3,np.nan,5,6,7]
df = pd.DataFrame(a)
df[df[0].isnull()]

Categories

Resources