Value counts for specific items in a DataFrame - python

I have a dataframe (df) of messages that appears similar the following:
From To
person1#gmail.com stranger1#gmail.com
person2#gmail.com stranger1#gmail.com, stranger2#gmail.com
person3#gmail.com person1#gmail.com, stranger2#gmail.com
I want to count the amount of times each email appears from a specific list. My list being:
lst = ['person1#gmail.com', 'stranger2#gmail.com', 'person3#gmail.com']
I'm hoping to receive a dataframe/series/dictionary with a result like this:
list_item Total_Count
person1#gmail.com 2
stranger2#gmail.com 2
person3#gmail.com 1
I'm tried several different things, but haven't succeeded. I thought I could try something like the for loop below (it returns a Syntax Error), but I cannot figure out the right way to write it.
for To,From in zip(df.To, df.From):
for item in lst:
if To,From contains item in emails:
Count(item)
Should this type of task be accomplished with a for loop or are there out of the box pandas methods that could solve this easier?

stack-based
Split your To column, stack everything and then do a value_counts:
v = pd.concat([df.From, df.To.str.split(', ', expand=True)], axis=1).stack()
v[v.isin(lst)].value_counts()
stranger2#gmail.com 2
person1#gmail.com 2
person3#gmail.com 1
dtype: int64
melt
Another option is to use melt:
v = (df.set_index('From')
.To.str.split(', ', expand=True)
.reset_index()
.melt()['value']
)
v[v.isin(lst)].value_counts()
stranger2#gmail.com 2
person1#gmail.com 2
person3#gmail.com 1
Name: value, dtype: int64
Note that set_index + str.split + reset_index is synonymous to pd.concat([...])...

Related

Count the number of elements in a list where the list contains the empty string

I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array

Python Return all Columns [duplicate]

Using Python Pandas I am trying to find the Country & Place with the maximum value.
This returns the maximum value:
data.groupby(['Country','Place'])['Value'].max()
But how do I get the corresponding Country and Place name?
Assuming df has a unique index, this gives the row with the maximum value:
In [34]: df.loc[df['Value'].idxmax()]
Out[34]:
Country US
Place Kansas
Value 894
Name: 7
Note that idxmax returns index labels. So if the DataFrame has duplicates in the index, the label may not uniquely identify the row, so df.loc may return more than one row.
Therefore, if df does not have a unique index, you must make the index unique before proceeding as above. Depending on the DataFrame, sometimes you can use stack or set_index to make the index unique. Or, you can simply reset the index (so the rows become renumbered, starting at 0):
df = df.reset_index()
df[df['Value']==df['Value'].max()]
This will return the entire row with max value
I think the easiest way to return a row with the maximum value is by getting its index. argmax() can be used to return the index of the row with the largest value.
index = df.Value.argmax()
Now the index could be used to get the features for that particular row:
df.iloc[df.Value.argmax(), 0:2]
The country and place is the index of the series, if you don't need the index, you can set as_index=False:
df.groupby(['country','place'], as_index=False)['value'].max()
Edit:
It seems that you want the place with max value for every country, following code will do what you want:
df.groupby("country").apply(lambda df:df.irow(df.value.argmax()))
Use the index attribute of DataFrame. Note that I don't type all the rows in the example.
In [14]: df = data.groupby(['Country','Place'])['Value'].max()
In [15]: df.index
Out[15]:
MultiIndex
[Spain Manchester, UK London , US Mchigan , NewYork ]
In [16]: df.index[0]
Out[16]: ('Spain', 'Manchester')
In [17]: df.index[1]
Out[17]: ('UK', 'London')
You can also get the value by that index:
In [21]: for index in df.index:
print index, df[index]
....:
('Spain', 'Manchester') 512
('UK', 'London') 778
('US', 'Mchigan') 854
('US', 'NewYork') 562
Edit
Sorry for misunderstanding what you want, try followings:
In [52]: s=data.max()
In [53]: print '%s, %s, %s' % (s['Country'], s['Place'], s['Value'])
US, NewYork, 854
In order to print the Country and Place with maximum value, use the following line of code.
print(df[['Country', 'Place']][df.Value == df.Value.max()])
You can use:
print(df[df['Value']==df['Value'].max()])
Using DataFrame.nlargest.
The dedicated method for this is nlargest which uses algorithm.SelectNFrame on the background, which is a performant way of doing: sort_values().head(n)
x y a b
0 1 2 a x
1 2 4 b x
2 3 6 c y
3 4 1 a z
4 5 2 b z
5 6 3 c z
df.nlargest(1, 'y')
x y a b
2 3 6 c y
import pandas
df is the data frame you create.
Use the command:
df1=df[['Country','Place']][df.Value == df['Value'].max()]
This will display the country and place whose value is maximum.
My solution for finding maximum values in columns:
df.ix[df.idxmax()]
, also minimum:
df.ix[df.idxmin()]
I'd recommend using nlargest for better performance and shorter code. import pandas
df[col_name].value_counts().nlargest(n=1)
I encountered a similar error while trying to import data using pandas, The first column on my dataset had spaces before the start of the words. I removed the spaces and it worked like a charm!!

How do I combine these two columns? Pandas

I have two columns one is Buyer ID and one is Seller ID. I'm trying to simply find out which combination of both appears the most.
def twoCptyFreq(df,col1,col2):
cols=[col1,col2]
df['TwoCptys']=df[cols].astype(str).apply('+'.join, axis=1)
return (df)
newdf=twoCptyFreq(tradedf,'BuyerID','SellerID')
I get the results I want however sometimes i get 1234+7651 and 7651+1234, so the same two but i need to aggregate these together. How do I write this into my function to allow for cases wher the buyer and seller may be switched?
You can sorting values - in lambda function by sorted:
df['TwoCptys']=df[cols].astype(str).apply(lambda x: '+'.join(sorted(x)), axis=1)
Or in columns converted to 2d array by np.sort:
df['TwoCptys']= (pd.DataFrame(np.sort(df[cols].values, axis=1))
.astype(str).apply('+'.join, axis=1))
df=pd.DataFrame({'A':[1,1,1],'B':[2,3,2],'C':[9,9,9]})
df['combination']=df['A'].astype(str) + '+' + df['B'].astype(str)
df['combination'].value_counts()
out[]:
1+2 2
1+3 1
Name: combination, dtype: int64
#This shows combination of df[A] ==1 and df[B] ==2 has more occurences

pandas add item to a series of list data type

How to properly add a single item to a series of list data type? I tried to make a copy and add an item to the list but this method affects the original dataframe instead
This is my code:
df = pd.DataFrame({'num':[['one'],['three'],['five']]})
# make copy of original df
copy_df = df.copy()
# add 'thing' to every single list
copy_df.num.apply(lambda x: x.append('thing'))
# show results of copy_df
print(copy_df) # this will show [['one', 'thing'], ['three', 'things'], ...]
print(df) # this will also show [['one', 'thing'], ['three', 'things'], ...]
# WHY?
My question is:
why the method above adds element to original copy too?
Is there any better way to add element to a Series of list?
Because you are copying the dataframe but not the list in dataframe so inner series still have reference of list from original dataframe.
Better way to achieve it;
copy_df.num = copy_df.num.apply(lambda x: x + ['thing'])
1- Pointers to the lists accessed through a dataframe, not the lists itself. So, when you modify one list in one dataframe, you modify it all implicilty (because it's a single object). You can check it - look at lists' ids:
copy_df = df.copy()
copy_df['num'].apply(id)
0 140262813220744
1 140262813299528
2 140262813298888
Name: num, dtype: int64
df['num'].apply(id)
0 140262813220744
1 140262813299528
2 140262813298888
Name: num, dtype: int64
2- Better not to store the lists in a dataframe, but instead of it use sort of a 'long' table, like this:
list_index num
0 0 "one"
0 1 "thing"
1 0 "three"
1 1 "things"
2 0 "five"
2 1 "things"
You store the same data, but it's easier to deal with it via pandas methods.
Edit
If you'll use a
copy_df.num = copy_df.num.apply(lambda x: x + 'num')
it'll return a new dataframe with a brand new lists:
copy_df.num
Out:
0 [one, thing]
1 [three, thing]
2 [five, thing]
copy_df.num.apply(id)
Out:
0 140262813289352
1 140262794045256
2 140262794050504
id's just changed!
copy.deepcopy doesn't work, too:
import copy
deepcopy_df = copy.deepcopy(df)
deepcopy_df.num.apply(id)
Out:
0 140262813220744
1 140262813299528
2 140262813298888
deepcopy_df.apply(lambda x: x.append('things'))
df.num # original DataFrame
Out:
0 [one, things]
1 [three, things]
2 [five, things]
Or a no-lambda version of Sunil's answer:
copy_df.num=copy_df.num.apply(['thing'].__add__)
If care about 'thing' being at first:
copy_df.num=copy_df.num.apply(['thing'].__add__).str[::-1]

Pandas: How to access the value of the index

I have a dataframe and would like to use the values in the index to create another column.
For instance:
df=pd.DataFrame({'idx1':range(0,5), 'idx2':range(10000,10005), 'value':np.random.randn(5)})
df.set_index(keys=['idx1','idx2'], inplace=True)
print df
value
idx1 idx2
0 10000 -1.470367
1 10001 0.260693
2 10002 -0.732319
3 10003 -0.116977
4 10004 1.106644
I'd like to do something like this:
df['idx1_mod']= df['idx1'] + 100
(Actually, I want to do more complicated things, but basically I need the value of the index.)
Right now I'm resorting to reseting the index (to get the index fields as columns), doing my calcs with access to the columns, and then re-creating the index. I'm sure I'm missing something obvious, but I've looked a ton and keep missing it!
Note - I also tried df.iterrows(), but it seems that gives a copy of the row and doesn't let me update the original dataframe.
df["idx1_mod"] = df.index.get_level_values(0).values + 100
Try this:
for idx in range(len(df)):
df['idx1_mod'][idx] = df.index[idx][0] + 100
You can use drop=False when setting the index to preserve your keys as columns. This should work:
df.set_index(keys=['idx1','idx2'], inplace=True, drop=False)
df['idx1_mod'] = df['idx'] + 100

Categories

Resources