How to recode and count efficiently - python

I have a large csv with three strings per row in this form:
a,c,d
c,a,e
f,g,f
a,c,b
c,a,d
b,f,s
c,a,c
I read in the first two columns recode the strings to integers and then remove duplicates counting how many copies of each row there were as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
letters = set(df.values.flat)
df.replace(to_replace=letters, value=range(len(letters)), inplace=True)
df1 = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
print df1
This gives:
ID_0 ID_1 count
0 0 1 2
1 1 0 3
2 2 4 1
3 4 3 1
which is exactly what I need.
However as my data is large I would like to make two improvements.
How can I do the groupby and then recode instead of the other way round? The problem is that I can't do df1[['ID_0','ID_0']].replace(to_replace=letters, value=range(len(letters)), inplace = True). This gives the error
"A value is trying to be set on a copy of a slice from a DataFrame"
How can I avoid creating df1? That is do the whole thing inplace.

I like to use sklearn.preprocessing.LabelEncoder to do the letter to digit conversion:
from sklearn.preprocessing import LabelEncoder
# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
The resulting output:
ID_0 ID_1 count
0 0 2 2
1 1 3 1
2 2 0 3
3 3 4 1
If you want to convert back to letters at a later point in time, you can use le.inverse_transform:
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.inverse_transform)
Which maps back as expected:
ID_0 ID_1 count
0 a c 2
1 b f 1
2 c a 3
3 f g 1
If you just want to know which digit corresponds to which letter, you can look at the le.classes_ attribute. This will give you an array of letters, which is indexed by the digit it encodes to:
le.classes_
['a' 'b' 'c' 'f' 'g']
For a more visual representation, you can cast as a Series:
pd.Series(le.classes_)
0 a
1 b
2 c
3 f
4 g
Timings
Using a larger version of the sample data and the following setup:
df2 = pd.concat([df]*10**5, ignore_index=True)
def root(df):
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
return df
def pir2(df):
unq = np.unique(df)
mapping = pd.Series(np.arange(unq.size), unq)
return df.stack().map(mapping).unstack() \
.groupby(df.columns.tolist()).size().reset_index(name='count')
I get the following timings:
%timeit root(df2)
10 loops, best of 3: 101 ms per loop
%timeit pir2(df2)
1 loops, best of 3: 1.69 s per loop

New Answer
unq = np.unique(df)
mapping = pd.Series(np.arange(unq.size), unq)
df.stack().map(mapping).unstack() \
.groupby(df.columns.tolist()).size().reset_index(name='count')
Old Answer
df.stack().rank(method='dense').astype(int).unstack() \
.groupby(df.columns.tolist()).size().reset_index(name='count')

Related

Iterating Conditions through Pandas .loc

I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0
Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]

Removing commented rows in place in pandas

I have a dataframe that may have commented characters at the bottom of it. Due to some other reasons, I cannot pass the comment character to initialize the dataframe itself. Here is an example of what I would have:
df = pd.read_csv(file,header=None)
df
0 1
0 132605 1
1 132750 2
2 # total: 100000
Is there a way to remove all rows that start with a comment character in-place -- that is, without having to re-load the data frame?
Using startswith
newdf=df[df.iloc[:,0].str.startswith('#').ne(True)]
Dataframe:
>>> df
0 1
0 132605 1
1 132750 2
2 # total: 100000
3 foo bar
Dropping in-place:
>>> to_drop = df[0].str.startswith('#').where(lambda s: s).dropna().index
>>> df.drop(to_drop, inplace=True)
>>> df
0 1
0 132605 1
1 132750 2
3 foo bar
Assumptions: you want to find rows where the column labeled 0 starts with '#'. Otherwise, adjust accordingly.

How to concatenate all (string) values in a given pandas dataframe row to one string?

I have a pandas dataframe that looks like this:
0 1 2 3 4
0 I want to join strings
1 But only in row 1
The desired output should look like this:
0 1 2 3 4 5
1 But only in row 1 I want to join strings
How to concatenate those strings to a joint string?
IIUC, by using apply , join
df.apply(lambda x :' '.join(x.astype(str)),1)
Out[348]:
0 I want to join strings
1 But only in row 1
dtype: object
Then you can assign them
df1=df.iloc[1:]
df1['5']=df.apply(lambda x :' '.join(x.astype(str)),1)[0]
df1
Out[361]:
0 1 2 3 4 5
1 But only in row 1 I want to join strings
For Timing :
%timeit df.apply(lambda x : x.str.cat(),1)
1 loop, best of 3: 759 ms per loop
%timeit df.apply(lambda x : ''.join(x),1)
1 loop, best of 3: 376 ms per loop
df.shape
Out[381]: (3000, 2000)
Use str.cat to join the first row, and assign to the second.
i = df.iloc[1:].copy() # the copy is needed to prevent chained assignment
i[df.shape[1]] = df.iloc[0].str.cat(sep=' ')
i
0 1 2 3 4 5
1 But only in row 1 I want to join strings
One other alternative way can be with add space followed by sum:
df[5] = df.add(' ').sum(axis=1).shift(1)
Result:
0 1 2 3 4 5
0 I want to join strings NaN
1 But only in row 1 I want to join strings
If your dataset is less than perfect and you want to exclude 'nan' values you can use this:
df.apply(lambda x :' '.join(x for x in x.astype(str) if x != "nan"),1)
I found this particularly helpful in joining columns containing parts of addresses together where some parts like SubLocation (e.g. apartment #) aren't relevant for all addresses.

Pandas Dataframe Reshaping

I have a dataframe as show below
>> df
A 1
B 2
A 5
B 6
A 7
B 8
How do I reformat it to make it
A 1 5 7
B 2 6 8
Thanks
Given a data frame like this
df = pd.DataFrame(dict(one=list('ABABAB'), two=range(6)))
you can do
df.groupby('one').two.apply(lambda s: s.reset_index(drop=True)).unstack()
# 0 1 2
# one
# A 0 2 4
# B 1 3 5
or (slightly slower, and giving a slightly different result)
df.groupby('one').apply(lambda d: d.two.reset_index(drop=True))
# two 0 1 2
# one
# A 0 2 4
# B 1 3 5
The first approach works with a DataFrameGroupBy, the second uses a SeriesGroupBy.
You can grab the series and use np.reshape to keep the correct dimensions.
The order = 'F' makes it scroll through columns (such as Fortran), order = 'C' scrolls through rows like C
Then it gets into a dataframe
df = pd.DataFrame(data=np.arange(10))
data = df['a'].values.reshape((2, 5), order='F')
df = pd.DataFrame(data=data, index=['a', 'b'])
how did you generate this data frame. I think it should have been generated using dictionary and then generate dataframe using that dict.
d = {'A': [1,5,7], 'B':[2,6,8]}
df = pandas.DataFrame(data=d, index=['p1','p2','p3'])
and then you can use df.T to transpose your dataframe if you need to.

Update Pandas Cells based on Column Values and Other Columns

I am looking to update many columns based on the values in one column; this is easy with a loop but takes far too long for my application when there are many columns and many rows. What is the most elegant way to get the desired counts for each letter?
Desired Output:
Things count_A count_B count_C count_D
['A','B','C'] 1 1 1 0
['A','A','A'] 3 0 0 0
['B','A'] 1 1 0 0
['D','D'] 0 0 0 2
The most elegant is definitely the CountVectorizer from sklearn.
I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.
First, we'll do it step by step:
let's create some data
raw = ['ABC', 'AAA', 'BA', 'DD']
things = [list(s) for s in raw]
Then read in some packages and initialize count vectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
Next we generate a matrix of counts
matrix = cv.fit_transform(things)
names = ["count_"+n for n in cv.get_feature_names()]
And save as a data frame
df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)
Generating a data frame like this:
count_A count_B count_C count_D
ABC 1 1 1 0
AAA 3 0 0 0
BA 1 1 0 0
DD 0 0 0 2
Elegant version:
Everything above in one line
df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
Timing:
You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.
Previous response by #piRSquared (which otherwise looks very good!)
pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)
100 loops, best of 3: 3.27 ms per loop
My answer:
pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
1000 loops, best of 3: 1.08 ms per loop
According to my testing, CountVectorizer is about 3x faster.
option 1
apply + value_counts
s = pd.Series([list('ABC'), list('AAA'), list('BA'), list('DD')], name='Things')
pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)
option 2
use pd.DataFrame(s.tolist()) + stack / groupby / unstack
pd.concat([s,
pd.DataFrame(s.tolist()).stack() \
.groupby(level=0).value_counts() \
.unstack(fill_value=0)],
axis=1)

Categories

Resources