I have a dataset like
data = {'ID': ['first_value', 'second_value', 'third_value',
'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'],
['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
And I want to create two columns:
A column that counts the number of elements that start with "a" on 'list_id'
A column that counts the number of elements that DO NOT start with "a" on "list_id"
I was thinking on doing something like:
data['list_id'].apply(lambda x: for entity in x if x.startswith("a")
I thought on counting first the ones starting with “a” and after counting the ones not starting with “a”, so I did this:
sum(1 for w in data["list_id"] if w.startswith('a'))
Moreover this does not really work and I cannot make it work.
Any ideas? :)
Assuming this input:
ID list_id
0 first_value [001, ab0, 44A]
1 second_value []
2 third_value [005, 006]
3 fourth_value [a22]
4 fifth_value [azz]
5 sixth_value [aaa, abd]
you can use:
sum(1 for l in data['list_id'] for x in l if x.startswith('a'))
output: 5
If you rather want a count per row:
df['starts_with_a'] = [sum(x.startswith('a') for x in l) for l in df['list_id']]
df['starts_with_other'] = df['list_id'].str.len()-df['starts_with_a']
NB. using a list comprehension is faster than apply
output:
ID list_id starts_with_a starts_with_other
0 first_value [001, ab0, 44A] 1 2
1 second_value [] 0 0
2 third_value [005, 006] 0 2
3 fourth_value [a22] 1 0
4 fifth_value [azz] 1 0
5 sixth_value [aaa, abd] 2 0
Using pandas something quite similar to your proposal works:
data = {'ID': ['first_value', 'second_value', 'third_value', 'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'], ['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
df["len"] = df.list_id.apply(len)
df["num_a"] = df.list_id.apply(lambda s: sum(map(lambda x: x[0] == "a", s)))
df["num_not_a"] = df["len"] - df["num_a"]
I have the sample dataset like this:
"Author", "Normal_Tokenized"
x , ["I","go","to","war","I",..]
y , ["me","you","and","us",..]
z , ["let","us","do","our","best",..]
I want a dataframe reporting the 10 most frequent words and the counts (frequencies) for each author:
"x_text", "x_count", "y_text", "y_count", "z_text", "z_count"
go , 1000 , come , 120 , let , 12
and so on ...
I attempted with the following snippet, but it just take the last author values instead of all authors values.
This code actually return the 10 most common word the author has been used in his novel
df_words = pd.concat([pd.DataFrame(
data={'Author': [row['Author'] for _ in row['Normal_Tokenized']], 'Normal_Tokenized': row['Normal_Tokenized']})
for idx, row in df.iterrows()], ignore_index=True)
df_words = df_words[~df_words['Normal_Tokenized'].isin(stop_words)]
def authorCommonWords(numWords):
for author in authors:
authorWords = df_words[df_words['Author'] == author].groupby('Normal_Tokenized').size().reset_index().rename(
columns={0: 'Count'})
authorWords.sort_values('Count', inplace=True)
df = pd.DataFrame(authorWords[-numWords:])
df.to_csv("common_word.csv", header=False,mode='a', encoding='utf-8',
index=False)
return authorWords[-numWords:]
authorCommonWords(10)
There are about 130000 samples for each author. The example get the 10 word that is most repeated word in this 130000 sample. I want this 10 words in separated column for each author.
np.unique(return_counts=True) seems to be what you're looking for.
Data
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Author": ["x", "y", "z"],
"Normal_Tokenized": [["I","go","to","war","I"],
["me","you","and","us"],
["let","us","do","our","best"]]
})
Code
n_top = 6 # count top n
df_want = pd.DataFrame(index=range(n_top))
for au, ls in df.itertuples(index=False, name=None):
words, freqs = np.unique(ls, return_counts=True)
len_words = len(words)
if len_words >= n_top:
df_want[f"{au}_text"] = words[:n_top]
df_want[f"{au}_count"] = freqs[:n_top]
else: # too few distinct words
df_want[f"{au}_text"] = [words[i] if i < len_words else "" for i in range(n_top)]
df_want[f"{au}_count"] = [freqs[i] if i < len_words else 0 for i in range(n_top)]
Result
print(df_want)
x_text x_count y_text y_count z_text z_count
0 I 2 and 1 best 1
1 go 1 me 1 do 1
2 to 1 us 1 let 1
3 war 1 you 1 our 1
4 0 0 us 1
5 0 0 0
I have unique values in a column, but they all have strange codes, and I want to instead have a numeric counter to identify these values. Is there a better way to do this?
class umm:
inc = 0
last_val = ''
#classmethod
def create_new_index(cls, new_val):
if new_val != cls.last_val:
cls.inc += 1
cls.last_val = new_val
return cls.inc
df['Doc_ID_index'] = df['Doc_ID'].apply(lambda x: umm.create_new_index(x))
Here is the dataframe:
Doc_ID Sent_ID Doc_ID_index
0 PMC2774701 S1.1 1
1 PMC2774701 S1.2 1
2 PMC2774701 S1.3 1
3 PMC2774701 S1.4 1
4 PMC2774701 S1.5 1
... ... ... ...
46019 3469-0 3469-51 6279
46020 3528-0 3528-10 6280
46021 3942-0 3942-39 6281
46022 4384-0 4384-25 6282
46023 4622-0 4622-45 6283
Method 1
#take the unique Doc ID's in the column
new_df=pd.DataFrame({'Doc_ID':df['Doc_ID'].unique()})
#assign a unique id
new_df['Doc_ID_index'] = new_df.index +1
#combine with original df to get the whole df
pd.merge(df,new_df,on='Doc_ID')
Method 2
df['Doc_ID_index'] = df.groupby(['Doc_ID']).ngroup()
I hope this helps!
I have a column "Employees" that contains the following data:
122.12 (Mark/Jen)
32.11 (John/Albert)
29.1 (Jo/Lian)
I need to count how many values match a specific condition (like x>31).
base = list()
count = 0
count2 = 0
for element in data['Employees']:
base.append(element.split(' ')[0])
if base > 31:
count= count +1
else
count2 = count2 +1
print(count)
print(count2)
The output should tell me that count value is 2, and count2 value is 1. The problem is that I cannot compare float to list. How can I make that if work ?
You have a df with a Employees column that you need to split into number and text, keep the number and convert it into a float, then filter it based on a value:
import pandas as pd
df = pd.DataFrame({'Employees': ["122.12 (Mark/Jen)", "32.11(John/Albert)",
"29.1(Jo/Lian)"]})
print(df)
# split at (
df["value"] = df["Employees"].str.split("(")
# convert to float
df["value"] = pd.to_numeric(df["value"].str[0])
print(df)
# filter it into 2 series
smaller = df["value"] < 31
remainder = df["value"] > 30
print(smaller)
print(remainder)
# counts
smaller31 = sum(smaller) # True == 1 -> sum([True,False,False]) == 1
bigger30 = sum(remainder)
print(f"Smaller: {smaller31} bigger30: {bigger30}")
Output:
# df
Employees
0 122.12 (Mark/Jen)
1 32.11(John/Albert)
2 29.1(Jo/Lian)
# after split/to_numeric
Employees value
0 122.12 (Mark/Jen) 122.12
1 32.11(John/Albert) 32.11
2 29.1(Jo/Lian) 29.10
# smaller
0 False
1 False
2 True
Name: value, dtype: bool
# remainder
0 True
1 True
2 False
Name: value, dtype: bool
# counted
Smaller: 1 bigger30: 2
This is a weird one: I have 3 dataframes, "prov_data" with contains a provider id and counts on regions and categories (ie. how many times that provider interacted with those regions and categories).
prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
'prov_cat_4':[0,3,0,0]})
"tender_data" which contains the same but for tenders.
tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
'ten_cat_4':[0,0,1]})
And finally a "no_match" DF wich contains forbidden matches between provider and tender.
no_match = DataFrame({ 'prov_id':[1122,3344,5566],
'tender_id':['AA12','BB33','CC45']})
I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (ie the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.
So that would give me this DF:
df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
'ten_cat_4':[0,0,0]})
code
# the first columns of each dataframe are the ids
# i'm going to use them several times
tid = tender_data.values[:, 0]
pid = prov_data.values[:, 0]
# first columns [1, 2, 3, 4] are cat columns
# we could have used filter, but this is good
# for this example
pc = prov_data.values[:, 1:5]
tc = tender_data.values[:, 1:5]
# columns [5, 6, 7] are rgn columns
pr = prov_data.values[:, 5:]
tr = tender_data.values[:, 5:]
# I want to mave this an m x n array, where
# m = number of rows in prov df and n = rows in tender
nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
nm = ~nm.fillna(0).astype(bool).values * 1
# the dot products of the cat arrays gets a handy
# array where there are > 1 co-positive values
# this combined with the a no_match construct
a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
a = a.mask(~a).stack().index
fp = a.get_level_values(0)
ft = a.get_level_values(1)
pd.concat([
prov_data.set_index('aprov_id').loc[fp].reset_index(),
tender_data.set_index('atender_id').loc[ft].reset_index()
], axis=1)
index prov_cat_1 prov_cat_2 prov_cat_3 prov_cat_4 prov_region_1 \
0 1122 0 1 0 0 0
1 3344 2 0 0 3 0
2 7788 0 0 4 0 0
prov_region_2 prov_region_3 atender_id ten_cat_1 ten_cat_2 ten_cat_3 \
0 2 0 BB33 0 1 1
1 0 1 AA12 1 0 0
2 0 1 BB33 0 1 1
ten_cat_4 ten_region_1 ten_region_2 ten_region_3
0 0 0 1 1
1 0 0 0 1
2 0 0 1 1
explanation
use dot products to determine matches
many other things I'll try to explain more later
Straightforward solution that uses only "standard" pandas techniques.
prov_data['tkey'] = 1
tender_data['tkey'] = 1
df1 = pd.merge(prov_data,tender_data,how='outer',on='tkey')
df1 = pd.merge(df1,no_match,how='outer',left_on = 'aprov_id', right_on = 'prov_id')
df1['dropData'] = df1.apply(lambda x: True if x['tender_id'] == x['atender_id'] else False, axis=1)
df1['dropData'] = df1.apply(lambda x: (x['dropData'] == True) or not(
((x['prov_cat_1'] > 0 and x['ten_cat_1'] > 0) or
(x['prov_cat_2'] > 0 and x['ten_cat_2'] > 0) or
(x['prov_cat_3'] > 0 and x['ten_cat_3'] > 0) or
(x['prov_cat_4'] > 0 and x['ten_cat_4'] > 0)) and(
(x['prov_region_1'] > 0 and x['ten_region_1'] > 0) or
(x['prov_region_2'] > 0 and x['ten_region_2'] > 0) or
(x['prov_region_3'] > 0 and x['ten_region_3'] > 0))),axis=1)
df1 = df1[~df1.dropData]
df1 = df1[[u'aprov_id', u'atender_id', u'prov_cat_1', u'prov_cat_2', u'prov_cat_3',
u'prov_cat_4', u'prov_region_1', u'prov_region_2', u'prov_region_3',
u'ten_cat_1', u'ten_cat_2', u'ten_cat_3', u'ten_cat_4', u'ten_region_1',
u'ten_region_2', u'ten_region_3']].reset_index(drop=True)
print df1.equals(df)
First we do a full cross product of both dataframes and merge that with the no_match dataframe, then add a boolean column to mark all rows to be dropped.
The boolean column is assigned by two boolean lambda functions with all the necessary conditions, then we just take all rows where that column is False.
This solution isn't very ressource-friendly due to the merge operations, so if your data is very large it may be disadvantageous.