SUMMARY: The output of my code gives me a dataframe of the following format. The column headers of the dataframe are the labels for the text in the column Content. The labels will be used as training data for a multilabel classifier in the next step. This is a snippet of actual data which is much larger.
Since they are columns titles, it is not possible to use them as mapped to the text they are the labels for.
Content A B C D E
zxy 1 2 1
wvu 1 2 1
tsr 1 2 2
qpo 1 1 1
nml 2 2
kji 1 1 2
hgf 1 2
edc 1 2 1
UPDATE: Converting the df to csv shows the empty cells are blank('' vs ' '):
Where Content is the column where the text is, and A, B, C, D, and E are the column headers that need to be turned into the labels. Only columns with 1s or 2s are relevant. The column with empty cells are not relevant and thus don't need to be converted as labels.
UPDATE: After some digging, maybe the numbers might not be ints, but strings.
I know that when entering the text + labels into a classifier for processing, the length of both arrays needs to be equal, else it is not accepted as valid input.
Is there a way I can convert the columns titles to labels for the text in Content in the DF?
EXPECTED OUTPUT:
>>Content A B C D E Labels
0 zxy 1 2 1 A, B, D
1 wvu 1 2 1 A, C, D
2 tsr 1 2 2 A, B, E
3 qpo 1 1 1 B, C, D
4 nml 2 2 C, D
5 kji 1 1 2 A, C, E
6 hgf 1 2 C, E
7 edc 1 2 1 A, B, D
Full Solution:
# first: clear all whitespace before and after a char, fine for all columns
for col in df.columns:
df[col] = df[col].str.strip()
# fill na with 0
df.fillna(0, inplace=True)
# replace '' with 0
df.replace('', 0, inplace=True)
# convert to int, this must only be done on the specific columns with the numeric data
# this list is the column names as you've presented them, if they are different in the real data,
# replace them
for col in ['A', 'B', 'C', 'D', 'E']:
df = df.astype({col: 'int16'})
print(df.info())
# you should end up with something like this.
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
Content 8 non-null object
A 8 non-null int16
B 8 non-null int16
C 8 non-null int16
D 8 non-null int16
E 8 non-null int16
dtypes: int16(5), object(1)
memory usage: 272.0+ bytes
"""
We can do dot, notice here, I treat the blanks as np.nan, if that is a real blank in your data, change the last line
# make certain the label names match the appropriate columns
s=df.loc[:, ['A', 'B', 'C', 'D', 'E']]
# or
s=df.loc[:,'A':]
df['Labels']=(s>0).dot(s.columns+',').str[:-1] # column A:E need to be numeric, not str
# df['Labels']=(~s.isin(['']).dot(s.columns+',').str[:-1]
Here's another way using np.where and groupby:
r, c = np.where(df>0)
df['Labels'] = pd.Series(df.columns[c], index=df.index[r]).groupby(level=[0, 1]).agg(', '.join)
Output:
A B C D E Labels
0 zxy 1 2 0 1 0 A, B, D
1 wvu 1 0 2 1 0 A, C, D
2 tsr 1 2 0 0 2 A, B, E
3 qpo 0 1 1 1 0 B, C, D
4 nml 0 0 2 2 0 C, D
5 kji 1 0 1 0 2 A, C, E
6 hgf 0 0 1 0 2 C, E
7 edc 1 2 0 1 0 A, B, D
You can also do that as follows:
# melt the two dimensional representation to
# a more or less onedimensional representation
df_flat= df.melt(id_vars=['Content'])
# filter out all rows which belong to empty cells
# the following is a fail-safe method, that should
# work for all datatypes you might encouter in your
# columns
df_flat= df_flat[~df_flat['value'].isna() & df_flat['value'] != 0]
df_flat= df_flat[~df_flat['value'].astype('str').str.strip().isin(['', 'nan'])]
# join the variables used per original row
df_flat.groupby(['Content']).agg({'variable': lambda ser: ', '.join(ser)})
The output looks like this:
variable
idx Content
0 zxy A, B, D
1 wvu A, C, D
2 tsr A, B, E
3 qpo B, C, D
4 nml C, D
5 kji A, C, E
6 hgf C, E
7 edc A, B, D
Given the following input data:
import pandas as pd
import io
raw="""idx Content A B C D E
0 zxy 1 2 1
1 wvu 1 2 1
2 tsr 1 2 2
3 qpo 1 1 1
4 nml 2 2
5 kji 1 1 2
6 hgf 1 2
7 edc 1 2 1 """
df= pd.read_fwf(io.StringIO(raw))
df.drop(['idx'], axis='columns', inplace=True)
Edit: I just removed 'idx' right after reading to create a structure like in the original dataframe and added some fail-safe code that works with different datatypes (the two lines below the melt-method). If more is known about how the missing values are actually represented, the code can be simplified.
Related
I have two columns.
The first column has the values A, B, C, D, and the the second column values corresponding to A, B, C, and D.
I'd like to convert/transpose A, B, C and D into 4 columns named A, B, C, D and have whatever values had previously corresponded to A, B, C, D (the original 2nd column) ordered beneath the respective column--A, B, C, or D. The original order must be preserved.
Here's an example.
Input:
A|1
B|2
C|3
D|4
A|3
B|6
C|3
D|6
Desired output:
A|B|C|D
1|2|3|4
3|6|3|6
Any ideas on how I can accomplish this using Pandas/Python?
Thanks a lot!
Very similar to pivoting with two columns (Q/A 10 here):
(df.assign(idx=df.groupby('col1').cumcount())
.pivot(index='idx', columns='col1', values='col2')
)
Output:
col1 A B C D
idx
0 1 2 3 4
1 3 6 3 6
To ensure your, you need to "capture" the order first, I am going to use the unique method for this situtaion:
Given df,
df = pd.DataFrame({'Col1':[*'ZCYBWA']*2, 'Col2':np.arange(12)})
Col1 Col2
0 Z 0
1 C 1
2 Y 2
3 B 3
4 W 4
5 A 5
6 Z 6
7 C 7
8 Y 8
9 B 9
10 W 10
11 A 11
Let's get order using unique:
order = df['Col1'].unique()
Then we can reshape using:
df.set_index([df.groupby('Col1').cumcount(), 'Col1'])['Col2'].unstack()
Col1 A B C W Y Z
0 5 3 1 4 2 0
1 11 9 7 10 8 6
But, adding reindex we can get original order:
df.set_index([df.groupby('Col1').cumcount(), 'Col1'])['Col2'].unstack().reindex(order, axis=1)
Col1 Z C Y B W A
0 0 1 2 3 4 5
1 6 7 8 9 10 11
I'm new to python and dataframes so I am wondering if someone knows how I could accomplish the following. I have a dataframe with many columns, some which share a beginning and have an underscore followed by a number (bird_1, bird_2, bird_3). I want to essentially merge all of the columns that share a beginning into singular columns with all the values that were contained in the constituent columns. Then I'd like to run df[columns].value_counts for each.
Initial dataframe
Final dataframe
For df[bird].value_counts(), I would get a count of 1 for A-L
For df[cat].value_counts(), I would get a count of 3 for A, 4 for B, 1 for C
The ultimate goal is to get a count of unique values for each column type (bird, cat, dog, etc.)
You can do:
df.columns=[col.split("_")[0] for col in df.columns]
df=df.unstack().reset_index(1, drop=True).reset_index()
df["id"]=df.groupby("index").cumcount()
df=df.pivot(index="id", values=0, columns="index")
Outputs:
index bird cat
id
0 A A
1 B A
2 C A
3 D B
4 E B
5 F B
6 G B
7 H C
8 I NaN
9 J NaN
10 K NaN
11 L NaN
From there to get counts of all possible values:
df.T.stack().reset_index(1, drop=True).reset_index().groupby(["index", 0]).size()
Outputs:
index 0
bird A 1
B 1
C 1
D 1
E 1
F 1
G 1
H 1
I 1
J 1
K 1
L 1
cat A 3
B 4
C 1
dtype: int64
I want to update multiple rows and columns in a CSV file, using pandas
I've tried using iterrows() method but it only works on a single column.
here is the logic I want to apply for multiple rows and columns:
if(value < mean):
value += std_dev
else:
value -= std_dev
Here is another way of doing it,
Consider your data is like this:
price strings value
0 1 A a
1 2 B b
2 3 C c
3 4 D d
4 5 E f
Now lets make strings column as the index:
df.set_index('strings', inplace='True')
#Result
price value
strings
A 1 a
B 2 b
C 3 c
D 4 d
E 5 f
Now set the values of rows C, D, E as 0
df.loc[['C', 'D','E']] = 0
#Result
price value
strings
A 1 a
B 2 b
C 0 0
D 0 0
E 0 0
or you can do more precisely
df.loc[df.strings.isin(["C", "D", "E"]), df.columns.difference(["strings"])] = 0
df
Out[82]:
price strings value
0 1 A a
1 2 B b
2 0 C 0
3 0 D 0
4 0 E 0
I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.
If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3
Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.
It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3
I have a dataframe that has dtype=object, i.e. categorical variables, for which I'd like to have the counts of each level of. I'd like the result to be a pretty summary of all categorical variables.
To achieve the aforementioned goals, I tried the following:
(line 1) grab the names of all object-type variables
(line 2) count the number of observations for each level (a, b of v1)
(line 3) rename the column so it reads "count"
stringCol = list(df.select_dtypes(include=['object'])) # list object of categorical variables
a = df.groupby(stringCol[0]).agg({stringCol[0]: 'count'})
a = a.rename(index=str, columns={stringCol[0]: 'count'}); a
count
v1
a 1279
b 2382
I'm not sure how to elegantly get the following result where all string column counts are printed. Like so (only v1 and v4 shown, but should be able to print such results for a variable number of columns):
count count
v1 v4
a 1279 l 32
b 2382 u 3055
y 549
The way I can think of doing it is:
select one element of stringCol
calculate the count of for each group of the column.
store the result in a Pandas dataframe.
store the Pandas dataframe in an object (list?)
repeat
if last element of stringCol is done, break.
but there must be a better way than that, just not sure how to do it.
I think simpliest is use loop:
df = pd.DataFrame({'A':list('abaaee'),
'B':list('abbccf'),
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aacbbb')})
print (df)
A B C D E F
0 a a 7 1 5 a
1 b b 8 3 3 a
2 a b 9 5 6 c
3 a c 4 7 9 b
4 e c 2 1 2 b
5 e f 3 0 4 b
stringCol = list(df.select_dtypes(include=['object']))
for c in stringCol:
a = df[c].value_counts().rename_axis(c).to_frame('count')
#alternative
#a = df.groupby(c)[c].count().to_frame('count')
print (a)
count
A
a 3
e 2
b 1
count
B
b 2
c 2
a 1
f 1
count
F
b 3
a 2
c 1
For list of DataFrames use list comprehension:
dfs = [df[c].value_counts().rename_axis(c).to_frame('count') for c in stringCol]
print (dfs)
[ count
A
a 3
e 2
b 1, count
B
b 2
c 2
a 1
f 1, count
F
b 3
a 2
c 1]