df1:
| A | B | C |ID |
|---|---|---|---|
| 1 | 5 | 2 | Y |
| 4 | 6 | 4 | Z |
df2:
| A | B | C |ID |
|---|---|---|---|
| 2 | 1 | 2 | Y |
| 4 | 6 | 4 | Z |
Merged:
| case | A | B | C |ID |
|------|---|---|---|---|
|before| 1 | 5 | 2 | Y |
|before| 4 | 6 | 4 | Z |
|after | 2 | 1 | 2 | Y |
|after | 4 | 6 | 4 | Z |
desired pivot for column A:
|ID |before|after|
|- |------|-----|
| Y | 1 | 2|
| Z | 4 | 4|
I want to use a for loop to create a pivot table for each column in dfs 1 and 2. The rows of these pivots will be the ID, the columns will be 'case'.
I would like to create a new df for each column's pivot table using a for loop.
Later, I will drop the rows in each pivot table where the before and after values are the same (only keeping the rows where they are different).
I hope I've understood you correctly:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
for c in df.columns.get_level_values(0).unique():
print(df.xs(c, axis=1))
print("-" * 80)
Prints:
case after before
ID
Y 2 1
Z 4 4
--------------------------------------------------------------------------------
case after before
ID
Y 1 5
Z 6 6
--------------------------------------------------------------------------------
case after before
ID
Y 2 2
Z 4 4
--------------------------------------------------------------------------------
EDIT: To add dataframes into a dictionary:
df1["case"] = "before"
df2["case"] = "after"
df = pd.concat([df1, df2]).pivot(index="ID", columns="case")
all_dataframes = {}
for c in df.columns.get_level_values(0).unique():
all_dataframes[c] = df.xs(c, axis=1).copy()
for key, dataframe in all_dataframes.items():
print("Key:", key)
print("DataFrame:", dataframe)
print("-" * 80)
I create a df:
import pandas as pd
i = pd.DataFrame({1:['a','r','g','a'],2:[7,6,5,""]})
That is:
| 1 | 2 |
|-----|-----|
| a | 7 |
| r | 6 |
| g | 5 |
| a | |
The last entry for a contains an empty string on column 2.
With the following mapping dictionary:
mapping_dict= {"a":'X'}
Current result:
i[2]=i[1].replace(mapping_dict)
| 1 | 2 |
|-----|-----|
| a | X |
| r | 6 |
| g | 5 |
| a | X |
Expected result, only empty column 2 are mapped:
| 1 | 2 |
|-----|-----|
| a | 7 |
| r | 6 |
| g | 5 |
| a | X |
Try this:
i[2] = i[2].replace('',np.NaN).fillna(i[1].map(mapping_dict))
You can simply replace your code of i[2] on the left of assignment statement to i.loc[i[2] == "", 2] to add a filtering condition.
Here, we use .loc to filter the row with the empty string on column 2 and specify column 2 as its second parameter to select the column to update. Then, you can use your current .replace() code to achieve the desired result:
i.loc[i[2] == "", 2] = i[1].replace(mapping_dict)
Result:
print(i)
1 2
0 a 7
1 r 6
2 g 5
3 a X
I would like to do a group by the ID and then add the count of values in A and B where it is not NA then add the count of both A and B together. To add on to that, what if I want to count only the y values in A?
+----+---+---+
| ID | A | B |
+----+---+---+
| 1 | x | x |
| 1 | x | x |
| 1 | y | |
| 2 | y | x |
| 2 | y | |
| 2 | y | x |
| 2 | x | x |
| 3 | x | x |
| 3 | | x |
| 3 | y | x |
+----+---+---+
+----+--------+
| ID | Output |
+----+--------+
| 1 | 3 |
| 2 | 6 |
| 3 | 4 |
+----+--------+
Here's a way to do:
df = df.groupby('ID').agg(lambda x: sum(pd.notna(x))).sum(1).reset_index(name='Output')
print(df)
ID Output
0 1 5.0
1 2 7.0
2 3 5.0
I am reading data from a text file in python using pandas. There are no header values (column names) assigned to the data in the text file. I want to reshape the data into a readable form. The problem i am facing is variable column lengths
For example in my text file i have
1,2,3,4,5,Hello,7,8
1,2,3,4,5,7,8,
1,2,3,4,5,7,8,
1,2,3,4,5,Hello,7,8,
Now when i create a data frame I want to make sure that in the second row instead of Hello a "NAN" is written as the value for that column is not present. and in the end after giving column names and rearranging the data frame will look like.
1,2,3,4,5,Hello,7,8
1,2,3,4,5,"NA,"7,8,
1,2,3,4,5,"NA",7,8,
1,2,3,4,5,Hello,7,8,
Answer to updated question and also a generalized solution for such case.
focus_col_idx = 5 # The column where you want to bring NaN in expected output
last_idx = df.shape[1] - 1
# Fetching the index of rows which have None in last column
idx = df[df[last_idx].isnull()].index
# Shifting the column values for those rows with index idx
df.iloc[idx,focus_col_idx+1:] = df.iloc[idx,focus_col_idx:last_idx].values
# Putting NaN for second column where row index is idx
df.iloc[idx,focus_col_idx] = np.NaN
df
+---+----+---+---+---+---+-------+---+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+---+---+---+---+-------+---+-----+
| 0 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
| 1 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 2 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 3 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
+---+----+---+---+---+---+-------+---+-----+
Answer to previous data
Assuming only one column is having missing value (say 2nd column as per your previous data). Here's a quick sol -
df = pd.read_table('SO.txt',sep='\,', header=None)
df
+---+---+---+---+---+------+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+------+
| 0 | A | B | C | D | E |
| 1 | A | C | D | E | None |
+---+---+---+---+---+------+
# Fetching the index of rows which have None in last column
idx = df[df[4].isnull()].index
idx
# Int64Index([1], dtype='int64')
# Shifting the column values for those rows with index idx
df.iloc[idx,2:] = df.iloc[idx,1:4].values
df
+---+---+---+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | C | C | D | E | # <- Notice the shifting.
+---+---+---+---+---+---+
# Putting NaN for second column where row index is idx
df.iloc[idx,1] = np.NaN
# Final output
df
+---+---+-----+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+-----+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | NaN | C | D | E |
+---+---+-----+---+---+---+
I have a pandas DataFrame with some independent columns, and I'm looking for an efficient way to unwind / aggregate them.
So, let's say I have the table:
+-----+-----+-------+------+-------+
| One | Two | Three | Four | Count |
+-----+-----+-------+------+-------+
| a | x | y | y | 3 |
+-----+-----+-------+------+-------+
| b | z | x | x | 5 |
+-----+-----+-------+------+-------+
| c | y | x | y | 1 |
+-----+-----+-------+------+-------+
Where rows Two, Three and Four are independent.
I would like to end up with the table:
+-----+-------+-------+
| One | Other | Count |
+-----+-------+-------+
| a | x | 3 |
+-----+-------+-------+
| a | y | 6 |
+-----+-------+-------+
| b | x | 10 |
+-----+-------+-------+
| b | z | 5 |
+-----+-------+-------+
| c | x | 1 |
+-----+-------+-------+
| c | y | 2 |
+-----+-------+-------+
How would be the best way to achieve this?
You can use melt function from pandas to reshape your data frame from wide to long format then groupby the One and Other columns and sum the Count column:
import pandas as pd
pd.melt(df, id_vars = ['One', 'Count'], value_name = 'Other').groupby(['One', 'Other'])['Count'].sum().reset_index()
One Other Count
0 a x 3
1 a y 6
2 b x 10
3 b z 5
4 c x 1
5 c y 2