Filtering inside the dataframe pandas

Filtering inside the dataframe pandas - python

I want to extract two first symbols in case three first symbols match a certain pattern (first two symbols should be any of those inside the brackets [ptkbdgG_fvsSxzZhmnNJlrwj], the third symbol should be any of those inside the brackets[IEAOYye|aouKLM#)3*<!(#0~q^LMOEK].
The first two lines work correctly.
The last lines do not work and I do not understand why. The code doesn`t give any errors, it just does nothing for those
# extract tree first symbols and save them in the new column
df['first_three_symbols'] = df['ITEM'].str[0:3]
#create a boolean column on condition whether first three symbols contain symbols
df["ccv"] = df["first_three_symbols"].str.contains('[ptkbdgG_fvsSxzZhmnNJlrwj][ptkbdgG_fvsSxzZhmnNJlrwj][IEAOYye|aouKLM#)3*<!(#0~q^LMOEK]')
#create another column for True values in the previous column
if df["ccv"].item == True:
df['first_two_symbols'] = df["ITEM"].str[0:2]
Here is my output:
ID ITEM FREQ first_three_symbols ccv
0 0 a 563 a False
1 1 OlrMndmEn 1 Olr False
2 2 OlrMndSpOrtl#r 0 Olr False
3 3 AG#l 74 AG# False
4 4 AG#lbMm 24 AG# False
... ... ... ... ... ...
51723 51723 zytzWt# 8 zyt False
51724 51724 zytzytOst 0 zyt False
51725 51725 zYxtIx 5 zYx False
51726 51726 zYxtIxkWt 0 zYx False
51727 51727 zyZe 4 zyZ False
[51728 rows x 5 columns]

you can either create a function, use apply method :
def f(row):
if row["ccv"] == True:
return row["ITEM"].str[0:2]
else:
return None
df['first_two_symbols'] = df.apply(f,axis=1)
or you can use np.wherefunction from numpy package.

Related

retrieve cell string values in a column between two unknown indexes based on substrings location

I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666').
I've tried:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks
Words table:
page no
text
font
1
they
0
1
ate
0
1
apples
0
2
and
0
2
then
1
2
their
0
2
stoma22
0
2
fe156
1
2
sligh334
0
2
pain666
1
2
given
0
2
the
1
3
fruit
0

You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the | operator.
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])
print(subtrings_output)
Output:
theirstoma22fe156sligh334pain666

IIUC, use pandas.Series.idxmax with "".join().
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that
value is returned.
So, assuming (Words) is your dataframe, try this :
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])
Output :
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()

creating new column based on multiple values of another column

I have a data frame that has these values in one of its columns:
In:
df.line.unique()
Out:
array(['Line71A', 'Line71B', 'Line75B', 'Line79A', 'Line79B', 'Line75A', 'Line74A', 'Line74B',
'Line70A', 'Line70B', 'Line58B', 'Line70', 'Line71', 'Line74', 'Line75', 'Line79', 'Line58'],
dtype=object)
And I would like to create a new column with 2 values based on if the value string contains LineXX, like so:
if (df.line.str.contains("Line70") or (df.line.str.contains("Line71") or (df.line.str.contains("Line79")):
return 1
else:
return 0
So the value should be 1 in the new column, box_type, if the values in df.line contains "Line70", "Line71", "Line79" and the rest should be 0
I tried doing this with this code:
df['box_type'] = df.line.apply(lambda x: 1 if x.contains('Line70') or x.contains('Line71') or x.contains('Line79') else 0)
But I get this error:
AttributeError: 'str' object has no attribute 'contains'
And I tried adding .str in between x and contains, like x.str.contains(), but that also gave an error.
How can I do this?
Thanks!

How about:
df['box_type'] = df.line.str.contains('70|71|79')
Sample data:
np.random.seed(1)
df = pd.DataFrame({'line':np.random.choice(a, 10)})
Output:
line box_type
0 Line75A False
1 Line70 True
2 Line71 True
3 Line70A True
4 Line70B True
5 Line70 True
6 Line75A False
7 Line79 True
8 Line71A True
9 Line58 False

Pandas Improving Efficiency

I have a pandas dataframe with approximately 3 million rows.
I want to partially aggregate the last column in seperate spots based on another variable.
My solution was to separate the dataframe rows into a list of new dataframes based on that variable, aggregate the dataframes, and then join them again into a single dataframe. The problem is that after a few 10s of thousands of rows, I get a memory error. What methods can I use to improve the efficiency of my function to prevent these memory errors?
An example of my code is below
test = pd.DataFrame({"unneeded_var": [6,6,6,4,2,6,9,2,3,3,1,4,1,5,9],
"year": [0,0,0,0,1,1,1,2,2,2,2,3,3,3,3],
"month" : [0,0,0,0,1,1,1,2,2,2,3,3,3,4,4],
"day" : [0,0,0,1,1,1,2,2,2,2,3,3,4,4,5],
"day_count" : [7,4,3,2,1,5,4,2,3,2,5,3,2,1,3]})
test = test[["year", "month", "day", "day_count"]]
def agg_multiple(df, labels, aggvar, repl=None):
if(repl is None): repl = aggvar
conds = df.duplicated(labels).tolist() #returns boolean list of false for a unique (year,month) then true until next unique pair
groups = []
start = 0
for i in range(len(conds)): #When false, split previous to new df, aggregate count
bul = conds[i]
if(i == len(conds) - 1): i +=1 #no false marking end of last group, special case
if not bul and i > 0 or bul and i == len(conds):
sample = df.iloc[start:i , :]
start = i
sample = sample.groupby(labels, as_index=False).agg({aggvar:sum}).rename(columns={aggvar : repl})
groups.append(sample)
df = pd.concat(groups).reset_index(drop=True) #combine aggregated dfs into new df
return df
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
I suppose that I could potentially apply the function to small samples of the dataframe, to prevent a memory error and then combine those, but I'd rather improve the computation time of my function.

This function does the same, and is 10 times faster.
test.groupby(["year", "month"], as_index=False).agg({"day_count":sum}).rename(columns={"day_count":"month_count"})

There are almost always pandas methods that are pretty optimized for tasks that will vastly outperform iteration through the dataframe. If I understand correctly, in your case, the following will return the same exact output as your function:
test2 = (test.groupby(['year', 'month'])
.day_count.sum()
.to_frame('month_count')
.reset_index())
>>> test2
year month month_count
0 0 0 16
1 1 1 10
2 2 2 7
3 2 3 5
4 3 3 5
5 3 4 4
To check that it's the same:
# Your original function:
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
>>> test == test2
year month month_count
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
5 True True True

Check orthography in pandas data frame column using Enchant library

I'm new to pandas and enchant. I want to check orthography in short sentences using python.
I have a pandas data frame:
id_num word
1 live haapy
2 know more
3 ssweam good
4 eeat little
5 dream alot
And I want to achieve the next table with column “check”
id_num word check
1 live haapy True, False
2 know more True, True
3 ssweam good False, True
4 eeat little False, True
5 dream alot True, False
What is the best way to do this?
I tried this code:
import enchant
dic = enchant.Dict("ru_Eng")
df['list_word'] = df['word'].str.split() #Get list of all words in each sentence using split()
row = list()
for row in df[['id_num', 'list_word']].iterrows():
r = row[1]
for word in r.list_word:
rows.append((r.id_num, word))
df2 = pd.DataFrame(rows, columns=['id_num', 'word']) #Make the table with id_num column and a column of separate words
Then I got new data frame (df2):
id_num word
1 live
1 haapy
2 know
2 more
3 ssweam
3 good
4 eeat
4 little
5 dream
5 alot
After that I check words using:
column = df2['word']
for i in column:
n = dic.check(i)
print(n)
The result is:
True
False
True
True
False
True
False
True
True
False
Check is carried out correctly but when I tried to put this result to a new pandas data frame column I got all False values for all words.
for i in column:
df2['res'] = dic.check(i)
Resulted data frame:
id_num word res
1 live False
1 haapy False
2 know False
2 more False
3 ssweam False
3 good False
4 eeat False
4 little False
5 dream False
5 alot False
I will be grateful for any help!

Convert a string column to number in a dataframe

I'm trying to convert a column in my DataFrame to numbers. The input is email domains extracted from email addresses. Sample:
>>> data['emailDomain']
0 [gmail]
1 [gmail]
2 [gmail]
3 [aol]
4 [yahoo]
5 [yahoo]
I want to create a new column where if the domain is gmail or aol, the column entry would be a 1 and 0 otherwise.
I created a method which goes like this:
def convertToNumber(row):
try:
if row['emailDomain'] == '[gmail]':
return 1
elif row['emailDomain'] == '[aol]':
return 1
elif row['emailDomain'] == '[outlook]':
return 1
elif row['emailDomain'] == '[hotmail]':
return 1
elif row['emailDomain'] == '[yahoo]':
return 1
else:
return 0
except TypeError:
print("TypeError")
and used it like:
data['validEmailDomain'] = data.apply(convertToNumber, axis=1)
However, my output column is 0 even when I know there are gmail and aol emails present in the input column.
Any idea what could be going wrong?
Also, I think this usage of conditional statements might not be the most efficient way to tackle this problem. Is there any other approach to getting this done?

you can use series.isin
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
data['emailDomain'].isin(providers)
searching the provider
instead of applying a re to each email in each row, you can use the Series.str methods to do it on a columns at a time
pattern2 = '(?<=#)([^.]+)(?=\.)'
df['email'].str.extract(pattern2, expand=False)
so this becomes something like this:
pattern2 = '(?<=#)([^.]+)(?=\.)'
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
df = pd.DataFrame(data={'email': ['test.1#gmail.com', 'test.2#aol.com', 'test3#something.eu']})
provider_serie = df['email'].str.extract(pattern2, expand=False)
0 gmail
1 aol
2 something
Name: email, dtype: object
interested_providers = df['email'].str.extract(pattern2, expand=False).isin(providers)
0 True
1 True
2 False
Name: email, dtype: bool
If you really want 0s and 1s, you can add a .astype(int)

Your code would work if your series contained strings. As such, they likely contain lists, in which case you need to extract the first element.
I would also utilise pd.Series.map instead of using any row-wise logic. Below is a complete example:
df = pd.DataFrame({'emailDomain': [['gmail'], ['gmail'], ['gmail'], ['aol'],
['yahoo'], ['yahoo'], ['else']]})
domains = {'gmail', 'aol', 'outlook', 'hotmail', 'yahoo'}
df['validEmailDomain'] = df['emailDomain'].map(lambda x: x[0]).isin(domains)\
.astype(int)
print(df)
# emailDomain validEmailDomain
# 0 [gmail] 1
# 1 [gmail] 1
# 2 [gmail] 1
# 3 [aol] 1
# 4 [yahoo] 1
# 5 [yahoo] 1
# 6 [else] 0

You could sum up the occurence checks of every Provider via list comprehensions and write the resulting list into data['validEmailDomain']:
providers = ['gmail', 'aol', 'outlook', 'hotmail', 'yahoo']
data['validEmailDomain'] = [np.sum([p in e for p in providers]) for e in data['emailDomain'].values]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering inside the dataframe pandas - python

you can either create a function, use apply method : def f(row): if row["ccv"] == True: return row["ITEM"].str[0:2] else: return None df['first_two_symbols'] = df.apply(f,axis=1) or you can use np.wherefunction from numpy package.

Related

retrieve cell string values in a column between two unknown indexes based on substrings location

creating new column based on multiple values of another column

Pandas Improving Efficiency

Check orthography in pandas data frame column using Enchant library

Convert a string column to number in a dataframe

Categories

Resources