Pandas - move values of a column under another column - python

So here is my problem.
I'm using pandas to parse csv file.
So my csv file looks like this :
A B C D
1 x 5 e
2 y 6 f
3 z 7 g
What I want to get is :
get all the values of column C
Place them under column A
Same with columns D and B
So it would get me this :
A B C D
1 x
2 y
3 z
5 e
6 f
7 g
However, all i've been able to get is to create a new column that "sums" column A with column C and column B with column D:
A B C D E F
1 x 5 e 15 xe
2 y 6 f 26 yf
3 z 7 g 37 zg
Any idea would be appreciated.
Thanks

Rename column C and D and append them to the bottom of columns A and B`:
result = df[['A', 'B']].append(df[['C','D']].set_axis(['A', 'B'], axis=1)).reset_index(drop=True)

Related

python-CSV Multiple Columns with the same header into one column

I have a CSV file with company data with 22 rows and 6500 columns. The columns have the same names and I should get the columns with the same names stacked into individual columns according to their headers.
I have now the data in one df like this:
Y C Y C Y C
1. a 1. b. 1. c.
2. a. 2. b. 2. c.
and I need to get it like this:
Y C
1. a.
2. a.
1. b.
2. b.
1. c.
2. c.
I would try an attempt where you slice the df in chunks by iteration and concat them back together, since the column names can't be identified distinctly.
EDIT
Changed answer to new input:
chunksize = 2
df = (
pd.concat(
[
df.iloc[:, i:i+chunksize] for i in range(0, len(df.columns), chunksize)
]
)
.reset_index(drop=True))
print(df)
Y C
0 1 a
1 2 a
2 1 b
3 2 b
4 1 c
5 2 c
I couldn't resist looking for a solution.
The best I found so far accounts for the fact that pd.read_csv addresses repeated column names by appending '.N' to the duplicates.
In [2]: df = pd.read_csv('duplicate_columns.csv')
In [3]: df
Out[3]:
1 2 3 4 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2
0 a q j e w e r t y u d s
1 b w w f c e f g d c s a
2 d q e h c f b f a w q r
To put your data into the same column...
Group the columns by their original names.
Apply a flattener to convert to a series of arrays.
Create a new data frame from the series viewed as a dict.
In [3]: grouper = lambda l: l.split('.')[0] # peels off added suffix
In [4]: flattener = lambda v: v.stack().values # reshape groups
In [4]: pd.DataFrame(df.groupby(by=grouper, axis='columns')
...: .apply(flattener)
...: .to_dict())
Out[4]:
1 2 3 4
0 a q j e
1 w e r t
2 y u d s
3 b w w f
4 c e f g
5 d c s a
6 d q e h
7 c f b f
8 a w q r
I'd love to see a cleaner, less obtuse, general solution.

Subtracting multiple columns between dataframes based on key

I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3

Pairwise matrix counts of two columns using pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I am trying to obtain pairwise counts of two column variables using pandas. I have a dataframe of two columns in the following format:
col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h
What I would like to get as output would be the following matrix of counts, for e.g.:
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
I am getting totally confused with pandas iterating over columns, rows, indexes and such. Appreciate some guidance here.
Pandas often has simple functions built in - in this case, you want crosstab:
pd.crosstab(dat['col1'], dat['col2'])
full code:
import pandas as pd
from io import StringIO
x = '''col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h'''
dat = pd.read_csv(StringIO(x), sep = '\s+')
pd.crosstab(dat['col1'], dat['col2'])
You're looking for a crosstab:
count_matrix = pd.crosstab(index=df["col1"], columns=df["col2"])
print(count_matrix)
col2 e f g h
col1
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you don't like the column/index names in (e.g. still seeing "col1" and "col2"), then you can remove them with rename_axis:
count_matrix = count_matrix.rename_axis(index=None, columns=None)
print(count_matrix)
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you want that all together in one snippet:
count_matrix = (pd.crosstab(index=df["col1"], columns=df["col2"])
.rename_axis(index=None, columns=None))

how to groupby in complicated condition in pandas

I have dataframe like this
A B C
0 1 7 a
1 2 8 b
2 3 9 c
3 4 10 a
4 5 11 b
5 6 12 c
I would like to get groupby result (key=column C) below;
A B
d 12 36
"d" means a or b ,
so I would like to groupby only with "a" and "b".
and then put together as "d".
when I sum up with all the key elements then drop, it consume much time....
One option is to use pandas where to transform the C column so that where it was a or b becomes d and then you can groupby the transformed column and do the normal summary on it, and if rows with c is not desired, you can simply drop it after the summary:
df_sum = df.groupby(df.C.where(~df.C.isin(['a', 'b']), "d")).sum().reset_index()
df_sum
# C A B
#0 c 9 21
#1 d 12 36
df_sum.loc[df_sum.C == "d"]
# C A B
#1 d 12 36
To see more clearly how the where clause works:
df.C.where(~df.C.isin(['a','b']), 'd')
# 0 d
# 1 d
# 2 c
# 3 d
# 4 d
# 5 c
# Name: C, dtype: object
It acts like a replace method and replace a and b with d which will be grouped together when passed to groupby function.

Convert N by N Dataframe to 3 Column Dataframe

I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4

Categories

Resources