I have a dataframe that has dtype=object, i.e. categorical variables, for which I'd like to have the counts of each level of. I'd like the result to be a pretty summary of all categorical variables.
To achieve the aforementioned goals, I tried the following:
(line 1) grab the names of all object-type variables
(line 2) count the number of observations for each level (a, b of v1)
(line 3) rename the column so it reads "count"
stringCol = list(df.select_dtypes(include=['object'])) # list object of categorical variables
a = df.groupby(stringCol[0]).agg({stringCol[0]: 'count'})
a = a.rename(index=str, columns={stringCol[0]: 'count'}); a
count
v1
a 1279
b 2382
I'm not sure how to elegantly get the following result where all string column counts are printed. Like so (only v1 and v4 shown, but should be able to print such results for a variable number of columns):
count count
v1 v4
a 1279 l 32
b 2382 u 3055
y 549
The way I can think of doing it is:
select one element of stringCol
calculate the count of for each group of the column.
store the result in a Pandas dataframe.
store the Pandas dataframe in an object (list?)
repeat
if last element of stringCol is done, break.
but there must be a better way than that, just not sure how to do it.
I think simpliest is use loop:
df = pd.DataFrame({'A':list('abaaee'),
'B':list('abbccf'),
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aacbbb')})
print (df)
A B C D E F
0 a a 7 1 5 a
1 b b 8 3 3 a
2 a b 9 5 6 c
3 a c 4 7 9 b
4 e c 2 1 2 b
5 e f 3 0 4 b
stringCol = list(df.select_dtypes(include=['object']))
for c in stringCol:
a = df[c].value_counts().rename_axis(c).to_frame('count')
#alternative
#a = df.groupby(c)[c].count().to_frame('count')
print (a)
count
A
a 3
e 2
b 1
count
B
b 2
c 2
a 1
f 1
count
F
b 3
a 2
c 1
For list of DataFrames use list comprehension:
dfs = [df[c].value_counts().rename_axis(c).to_frame('count') for c in stringCol]
print (dfs)
[ count
A
a 3
e 2
b 1, count
B
b 2
c 2
a 1
f 1, count
F
b 3
a 2
c 1]
Related
I'm new to python and dataframes so I am wondering if someone knows how I could accomplish the following. I have a dataframe with many columns, some which share a beginning and have an underscore followed by a number (bird_1, bird_2, bird_3). I want to essentially merge all of the columns that share a beginning into singular columns with all the values that were contained in the constituent columns. Then I'd like to run df[columns].value_counts for each.
Initial dataframe
Final dataframe
For df[bird].value_counts(), I would get a count of 1 for A-L
For df[cat].value_counts(), I would get a count of 3 for A, 4 for B, 1 for C
The ultimate goal is to get a count of unique values for each column type (bird, cat, dog, etc.)
You can do:
df.columns=[col.split("_")[0] for col in df.columns]
df=df.unstack().reset_index(1, drop=True).reset_index()
df["id"]=df.groupby("index").cumcount()
df=df.pivot(index="id", values=0, columns="index")
Outputs:
index bird cat
id
0 A A
1 B A
2 C A
3 D B
4 E B
5 F B
6 G B
7 H C
8 I NaN
9 J NaN
10 K NaN
11 L NaN
From there to get counts of all possible values:
df.T.stack().reset_index(1, drop=True).reset_index().groupby(["index", 0]).size()
Outputs:
index 0
bird A 1
B 1
C 1
D 1
E 1
F 1
G 1
H 1
I 1
J 1
K 1
L 1
cat A 3
B 4
C 1
dtype: int64
I want to update multiple rows and columns in a CSV file, using pandas
I've tried using iterrows() method but it only works on a single column.
here is the logic I want to apply for multiple rows and columns:
if(value < mean):
value += std_dev
else:
value -= std_dev
Here is another way of doing it,
Consider your data is like this:
price strings value
0 1 A a
1 2 B b
2 3 C c
3 4 D d
4 5 E f
Now lets make strings column as the index:
df.set_index('strings', inplace='True')
#Result
price value
strings
A 1 a
B 2 b
C 3 c
D 4 d
E 5 f
Now set the values of rows C, D, E as 0
df.loc[['C', 'D','E']] = 0
#Result
price value
strings
A 1 a
B 2 b
C 0 0
D 0 0
E 0 0
or you can do more precisely
df.loc[df.strings.isin(["C", "D", "E"]), df.columns.difference(["strings"])] = 0
df
Out[82]:
price strings value
0 1 A a
1 2 B b
2 0 C 0
3 0 D 0
4 0 E 0
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
This is similar to LabelEncoder from scikit-learn, but with the requirement that the number value assignments occur in order of frequency of the category, i.e., the higher occurring category being assigned the highest/lowest (depending on use-case) number.
E.g. If the variable can take values [a, b, c] with frequencies such as
Category
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
2 c
2 c
a occurs 5 times, b occurs 10 times and c occurs 2 times.
Then I want the replacements be done as b=1, a=2 and c=3.
See argsort:
df['Order'] = df['Frequency'].argsort() + 1
df
returns
Category Frequency Order
0 a 5 3
1 b 10 1
2 c 2 2
If you are using pandas, you can use its map() method:
import pandas as pd
data = pd.DataFrame([['a'], ['b'], ['c']], columns=['category'])
print(data)
category
0 a
1 b
2 c
mapping_dict = {'b':1, 'a':2, 'c':3}
print(data['category'].map(mapping_dict))
0 2
1 1
2 3
LabelEncoder uses np.unique to find the unique values present in a column which returns values in alphabetically sorted order, so you cannot use the custom ordering in it.
As suggested by #Vivek Kumar, I used the map functionality, using a dict of the sorted column values as key and their position as value:
data.Category = data.Category.map(dict(zip(data.Category.value_counts().index, range(1, len(data.Category.value_counts().index)+1))))
Looks a bit dirty, would be much better to divide it into a couple of lines like this:
sorted_indices = data.Category.value_counts().index
data.Category = data.Category.map(dict(zip(sorted_indices, range(1, len(sorted_indices)+1))))
This is the closest I have to my requirement. The output looks like this:
Category
0 2
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 3
16 3
I am trying to select a bunch of single rows in bunch of dataframes and trying to make a new data frame by concatenating them together.
Here is a simple example
x=pd.DataFrame([[1,2,3],[1,2,3]],columns=["A","B","C"])
A B C
0 1 2 3
1 1 2 3
a=x.loc[0,:]
A 1
B 2
C 3
Name: 0, dtype: int64
b=x.loc[1,:]
A 1
B 2
C 3
Name: 1, dtype: int64
c=pd.concat([a,b])
I end up with this:
A 1
B 2
C 3
A 1
B 2
C 3
Name: 0, dtype: int64
Whearas I would expect the original data frame:
A B C
0 1 2 3
1 1 2 3
I can get the values and create a new dataframe, but this doesn't seem like the way to do it.
If you want to concat two series vertically (vertical stacking), then one option is a concat and transpose.
Another is using np.vstack:
pd.DataFrame(np.vstack([a, b]), columns=a.index)
A B C
0 1 2 3
1 1 2 3
Since you are slicing by index I'd use .iloc and then notice the difference between [[]] and [] which return a DataFrame and Series*
a = x.iloc[[0]]
b = x.iloc[[1]]
pd.concat([a, b])
# A B C
#0 1 2 3
#1 1 2 3
To still use .loc, you'd do something like
a = x.loc[[0,]]
b = x.loc[[1,]]
*There's a small caveat that if index 0 is duplicated in x then x.loc[0,:] will return a DataFrame and not a Series.
It looks like you want to make a new dataframe from a collection of records. There's a method for that:
import pandas as pd
x = pd.DataFrame([[1,2,3],[1,2,3]], columns=["A","B","C"])
a = x.loc[0,:]
b = x.loc[1,:]
c = pd.DataFrame.from_records([a, b])
print(c)
# A B C
# 0 1 2 3
# 1 1 2 3