Retrieve the index of a cross section of a dataframe in pandas - python

Let's say we have a multi-index dataframe df like this:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| x | q | 3 | 4
| y | p | 5 | 6
| y | q | 7 | 8
Now, I can get the index of cross section of the criteria one > 4 like this:
idx = df[df['one'] > 4].index
And then, using it with .ix:
df.ix[idx]
yields a slice of the frame:
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| y | p | 5 | 6
| y | q | 7 | 8
Now, I want to do the same but with a cross section by one level on the multi-index. The .xs is useful in this manner:
df.xs('p', level='B')
returns:
| | one | two
| A |-------|-------
|----|-------|-------
| x | 1 | 2
| y | 5 | 6
But this dataframe has different index structure and its index is not a slice of the df index.
So, my question is, what should to in the place of idx, so that the following expression
df.ix[idx]
to yield
| | | one | two
| A | B |-------|-------
|----|----|-------|-------
| x | p | 1 | 2
| y | p | 5 | 6

You need to use the argument drop_level and set it to False to keep the index:
In [9]: df.xs('p', level='B', drop_level=False)
Out[9]:
one two
A B
x p 1 2
y p 5 6

Related

Use Python Pandas For Loop to create pivot tables for each column in dataframe

df1:
| A | B | C |ID |
|---|---|---|---|
| 1 | 5 | 2 | Y |
| 4 | 6 | 4 | Z |
df2:
| A | B | C |ID |
|---|---|---|---|
| 2 | 1 | 2 | Y |
| 4 | 6 | 4 | Z |
Merged:
| case | A | B | C |ID |
|------|---|---|---|---|
|before| 1 | 5 | 2 | Y |
|before| 4 | 6 | 4 | Z |
|after | 2 | 1 | 2 | Y |
|after | 4 | 6 | 4 | Z |
desired pivot for column A:
|ID |before|after|
|- |------|-----|
| Y | 1 | 2|
| Z | 4 | 4|
I want to use a for loop to create a pivot table for each column in dfs 1 and 2. The rows of these pivots will be the ID, the columns will be 'case'.
I would like to create a new df for each column's pivot table using a for loop.
Later, I will drop the rows in each pivot table where the before and after values are the same (only keeping the rows where they are different).
I hope I've understood you correctly:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
for c in df.columns.get_level_values(0).unique():
print(df.xs(c, axis=1))
print("-" * 80)
Prints:
case after before
ID
Y 2 1
Z 4 4
--------------------------------------------------------------------------------
case after before
ID
Y 1 5
Z 6 6
--------------------------------------------------------------------------------
case after before
ID
Y 2 2
Z 4 4
--------------------------------------------------------------------------------
EDIT: To add dataframes into a dictionary:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
all_dataframes = {}
for c in df.columns.get_level_values(0).unique():
all_dataframes[c] = df.xs(c, axis=1).copy()
for key, dataframe in all_dataframes.items():
print("Key:", key)
print("DataFrame:", dataframe)
print("-" * 80)

Pandas use replace with a dict only if target column is empty

I create a df:
import pandas as pd
i = pd.DataFrame({1:['a','r','g','a'],2:[7,6,5,""]})
That is:
| 1 | 2 |
|-----|-----|
| a | 7 |
| r | 6 |
| g | 5 |
| a | |
The last entry for a contains an empty string on column 2.
With the following mapping dictionary:
mapping_dict= {"a":'X'}
Current result:
i[2]=i[1].replace(mapping_dict)
| 1 | 2 |
|-----|-----|
| a | X |
| r | 6 |
| g | 5 |
| a | X |
Expected result, only empty column 2 are mapped:
| 1 | 2 |
|-----|-----|
| a | 7 |
| r | 6 |
| g | 5 |
| a | X |
Try this:
i[2] = i[2].replace('',np.NaN).fillna(i[1].map(mapping_dict))
You can simply replace your code of i[2] on the left of assignment statement to i.loc[i[2] == "", 2] to add a filtering condition.
Here, we use .loc to filter the row with the empty string on column 2 and specify column 2 as its second parameter to select the column to update. Then, you can use your current .replace() code to achieve the desired result:
i.loc[i[2] == "", 2] = i[1].replace(mapping_dict)
Result:
print(i)
1 2
0 a 7
1 r 6
2 g 5
3 a X

How to group by certain column then take the count of multiple columns where it is not NA and add them in Pandas Python?

I would like to do a group by the ID and then add the count of values in A and B where it is not NA then add the count of both A and B together. To add on to that, what if I want to count only the y values in A?
+----+---+---+
| ID | A | B |
+----+---+---+
| 1 | x | x |
| 1 | x | x |
| 1 | y | |
| 2 | y | x |
| 2 | y | |
| 2 | y | x |
| 2 | x | x |
| 3 | x | x |
| 3 | | x |
| 3 | y | x |
+----+---+---+
+----+--------+
| ID | Output |
+----+--------+
| 1 | 3 |
| 2 | 6 |
| 3 | 4 |
+----+--------+
Here's a way to do:
df = df.groupby('ID').agg(lambda x: sum(pd.notna(x))).sum(1).reset_index(name='Output')
print(df)
ID Output
0 1 5.0
1 2 7.0
2 3 5.0

Reading data from text file with variable numbers of Column

I am reading data from a text file in python using pandas. There are no header values (column names) assigned to the data in the text file. I want to reshape the data into a readable form. The problem i am facing is variable column lengths
For example in my text file i have
1,2,3,4,5,Hello,7,8
1,2,3,4,5,7,8,
1,2,3,4,5,7,8,
1,2,3,4,5,Hello,7,8,
Now when i create a data frame I want to make sure that in the second row instead of Hello a "NAN" is written as the value for that column is not present. and in the end after giving column names and rearranging the data frame will look like.
1,2,3,4,5,Hello,7,8
1,2,3,4,5,"NA,"7,8,
1,2,3,4,5,"NA",7,8,
1,2,3,4,5,Hello,7,8,
Answer to updated question and also a generalized solution for such case.
focus_col_idx = 5 # The column where you want to bring NaN in expected output
last_idx = df.shape[1] - 1
# Fetching the index of rows which have None in last column
idx = df[df[last_idx].isnull()].index
# Shifting the column values for those rows with index idx
df.iloc[idx,focus_col_idx+1:] = df.iloc[idx,focus_col_idx:last_idx].values
# Putting NaN for second column where row index is idx
df.iloc[idx,focus_col_idx] = np.NaN
df
+---+----+---+---+---+---+-------+---+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+---+---+---+---+-------+---+-----+
| 0 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
| 1 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 2 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 3 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
+---+----+---+---+---+---+-------+---+-----+
Answer to previous data
Assuming only one column is having missing value (say 2nd column as per your previous data). Here's a quick sol -
df = pd.read_table('SO.txt',sep='\,', header=None)
df
+---+---+---+---+---+------+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+------+
| 0 | A | B | C | D | E |
| 1 | A | C | D | E | None |
+---+---+---+---+---+------+
# Fetching the index of rows which have None in last column
idx = df[df[4].isnull()].index
idx
# Int64Index([1], dtype='int64')
# Shifting the column values for those rows with index idx
df.iloc[idx,2:] = df.iloc[idx,1:4].values
df
+---+---+---+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | C | C | D | E | # <- Notice the shifting.
+---+---+---+---+---+---+
# Putting NaN for second column where row index is idx
df.iloc[idx,1] = np.NaN
# Final output
df
+---+---+-----+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+-----+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | NaN | C | D | E |
+---+---+-----+---+---+---+

Aggregate sets of Pandas DataFrames columns

I have a pandas DataFrame with some independent columns, and I'm looking for an efficient way to unwind / aggregate them.
So, let's say I have the table:
+-----+-----+-------+------+-------+
| One | Two | Three | Four | Count |
+-----+-----+-------+------+-------+
| a | x | y | y | 3 |
+-----+-----+-------+------+-------+
| b | z | x | x | 5 |
+-----+-----+-------+------+-------+
| c | y | x | y | 1 |
+-----+-----+-------+------+-------+
Where rows Two, Three and Four are independent.
I would like to end up with the table:
+-----+-------+-------+
| One | Other | Count |
+-----+-------+-------+
| a | x | 3 |
+-----+-------+-------+
| a | y | 6 |
+-----+-------+-------+
| b | x | 10 |
+-----+-------+-------+
| b | z | 5 |
+-----+-------+-------+
| c | x | 1 |
+-----+-------+-------+
| c | y | 2 |
+-----+-------+-------+
How would be the best way to achieve this?
You can use melt function from pandas to reshape your data frame from wide to long format then groupby the One and Other columns and sum the Count column:
import pandas as pd
pd.melt(df, id_vars = ['One', 'Count'], value_name = 'Other').groupby(['One', 'Other'])['Count'].sum().reset_index()
One Other Count
0 a x 3
1 a y 6
2 b x 10
3 b z 5
4 c x 1
5 c y 2

Categories

Resources