I am asked to code the following 2 columns, and you have these values, when using the method cat.codes the problem arises that the 2 columns are not with the same codes, what I want is that the data that are equal are with the same code?
Example:
The input is a dataframe
col1 col2
0 A E
1 B F
2 C A
3 D B
4 A B
5 E A
Assuming this input as df:
col1 col2
0 A E
1 B F
2 C A
3 D B
4 A B
5 E A
You can compute the unique values and use them to factorize:
vals = df[['col1', 'col2']].stack().unique()
d = {k:v for v,k in enumerate(vals)}
df['col1_codes'] = df['col1'].map(d)
df['col2_codes'] = df['col2'].map(d)
output:
col1 col2 col1_codes col2_codes
0 A E 0 1
1 B F 2 3
2 C A 4 0
3 D B 5 2
4 A B 0 2
5 E A 1 0
You can try below as well
a b
0 apple nokia
1 xiomi samsung
2 samsung apple
3 moto oneplus
import pandas as pd
from sklearn import preprocessing
cat_var = list(df.a.values)+list(df.b.values)
le = preprocessing.LabelEncoder()
le.fit(cat_var)
df['a'] = le.transform(df.a)
df['b'] = le.transform(df.b)
will give you below output
a b
0 0 2
1 5 4
2 4 0
3 1 3
Related
I have a dataframe df:
(A,B) (B,C) (D,B) (E,F)
0 3 0 1
1 1 3 0
2 2 4 2
I want to split it into different columns for all columns in df as shown below:
A B B C D B E F
0 0 3 3 0 0 1 1
1 1 1 1 3 3 0 0
2 2 2 2 4 4 2 2
and add similar columns together:
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 6 2 4 2 2
how to achieve this using pandas?
With pandas, you can use this :
out = (
df
.T
.reset_index()
.assign(col= lambda x: x.pop("index").str.strip("()").str.split(","))
.explode("col")
.groupby("col", as_index=False).sum()
.set_index("col")
.T
.rename_axis(None, axis=1)
)
# Output :
print(out)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
i think (A, B) as tuple
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).sum(level=0).T
result:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
if future warning occur, use following code:
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).groupby(level=0).sum().T
same result
Use concat with removed levels with MultiIndex in columns by Series.str.findall:
df.columns = df.columns.str.findall('(\w+)').map(tuple)
df = (pd.concat([df.droplevel(x, axis=1) for x in range(df.columns.nlevels)], axis=1)
.groupby(level=0, axis=1)
.sum())
print (df)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
For write ouput to file without index use:
df.to_csv('file.csv', index=False)
You can use findall to extract the variables in the header, then melt and explode, finallypivot_table:
out = (df
.reset_index().melt('index')
.assign(variable=lambda d: d['variable'].str.findall('(\w+)'))
.explode('variable')
.pivot_table(index='index', columns='variable', values='value', aggfunc='sum')
.rename_axis(index=None, columns=None)
)
Output:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
Reproducible input:
df = pd.DataFrame({'(A,B)': [0, 1, 2],
'(B,C)': [3, 1, 2],
'(D,B)': [0, 3, 4],
'(E,F)': [1, 0, 2]})
printing/saving without index:
print(out.to_string(index=False))
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 8 2 4 2 2
# as file
out.to_csv('yourfile.csv', index=False)
I'm trying to add another line of headers above the headers in a pandas dataframe.
Turning this :
import pandas as pd
df = pd.DataFrame(data={'A': range(5), 'B': range(5), 'C':range(5)})
print(df)
A B C
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
into this (for instance):
D E
A B C
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Having both A and B under D and C under E. This doesn't seem like something that would be hard with pandas but yet I can't seem to find the answer. How do you do this?
With some help from #Nk03, you can create a mapping of existing column labels to D and E, then create a MultiIndex :
map_dict = {'A': 'D', 'B': 'D' ,'C':'E'}
df.columns = pd.MultiIndex.from_tuples(zip(df.columns.map(map_dict), df.columns))
D E
A B C
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
I am trying to find the inverse pair and assign a pair number to the pair but am stuck for moving forward from the below.
df1:
col1 col2 no. of records
A B 2
B A 5
C D 4
D C 6
E F 4
G H 6
I am trying get this result.
col1 col2 pair 1 no. of records totalcount
A B 1 2 7
B A 1 5 7
C D 2 4 10
D C 2 6 10
E F 3 4 4
G H 4 6 6
I tried this method but it has only returned true/false.
to make a duplicate dataframe df2 and use isin function but was stucked for a long time while group them together.
df1['row_matched'] = np.where((df1.col1+df1.col2).isin(df2.col2+ df2.col1), df2['row'], "")
will appreciate any help available!
Use rank of group pair of col1, col2, which you could setup with set
In [37]: df['pair'] = (df.apply(lambda x: '-'.join(set(x[['col1', 'col2']])), 1)
.rank(method='dense').astype(int))
In [38]: df['totalcount'] = df.groupby('pair')['no.ofrecords'].transform('sum')
In [39]: df
Out[39]:
col1 col2 no.ofrecords pair totalcount
0 A B 2 1 7
1 B A 5 1 7
2 C D 4 2 10
3 D C 6 2 10
4 E F 4 3 4
5 G H 6 4 6
I have a text file that includes information in the form of:
A 0
B 1
C 4
D 0
E 1
A 0
B 0
C 2
D 1
E 1
A 1
B 0
C 2
D 0
E 0
...
Note that the total number of ABCDE cycles (here only 3 shown) is not known without counting them.
I would like, using Python, to transform this into a matrix that has the form:
A 0 0 1 ...
B 1 0 0 ...
C 4 2 2 ...
D 0 1 0 ...
E 1 1 0 ...
I am not sure what is the best way to do such kind of transformation, does anyone as a python script that does this? Are there any function in Numpy or Pandas that would enable to do this easily? Or should I instead do it without Numpy or Pandas?
Many thanks in advance for your help!
Pandas solution:
import pandas as pd
from pandas.compat import StringIO
temp=u"""
A 0
B 1
C 4
D 0
E 1
A 0
B 0
C 2
D 1
E 1
A 1
B 0
C 2
D 0
E 0"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s+", header=None)
print (df)
0 1
0 A 0
1 B 1
2 C 4
3 D 0
4 E 1
5 A 0
6 B 0
7 C 2
8 D 1
9 E 1
10 A 1
11 B 0
12 C 2
13 D 0
14 E 0
df = pd.pivot(index=df[0], columns=df.groupby(0).cumcount(), values=df[1])
print (df)
0 1 2
0
A 0 0 1
B 1 0 0
C 4 2 2
D 0 1 0
E 1 1 0
option 1
add an index level and unstack
s.index = [s.index, np.arange(len(s)) // 5]
s.unstack()
option 2
reconstruct
pd.DataFrame(s.values.reshape(5, -1), s.index[:5])
setup
I assumed a series with an index as the first column.
import pandas as pd
from pandas.compat import StringIO
txt = """A 0
B 1
C 4
D 0
E 1
A 0
B 0
C 2
D 1
E 1
A 1
B 0
C 2
D 0
E 0"""
s = pd.read_csv(StringIO(txt), sep="\s+", header=None, index_col=0, squeeze=True)
I have a dataset D with Columns from [A - Z] in total 26 columns. I have done some test and got to know which are the useful columns to me in a series S.
D #Dataset with columns from A - Z
S
B 0.78
C 1.04
H 2.38
S has the columns and a value associated with it, So I now know their importance and would like to keep only those Columns in the Dataset eg(B, C, D) How can I do it?
IIUC you can use:
cols = ['B','C','D']
df = df[cols]
Or if column names are in Series as values:
S = pd.Series(['B','C','D'])
df = df[S]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
S = pd.Series(['B','C','D'])
print (S)
0 B
1 C
2 D
dtype: object
print (df[S])
B C D
0 4 7 1
1 5 8 3
2 6 9 5
Or index values:
S = pd.Series([1,2,3], index=['B','C','D'])
print (S)
B 1
C 2
D 3
dtype: int64
print (df[S.index])
B C D
0 4 7 1
1 5 8 3
2 6 9 5