How can I count locally by a custom condition in dataframe python3 - python

I have a dataframe called DF which contains two types of information: datetime and sentence(string).
0 2019-02-01 point say give choice invest motor today money...
1 2019-02-01 get inside car drive drunk excuse bad driving ...
2 2019-02-01 look car snow know buy car snow
3 2019-02-01 drive home car day terrify experience stay least
4 2019-02-01 quid way ferry nice trip enjoy land list celeb...
... ... ...
35818 2021-09-30 choice life drive type car holiday type carava...
35819 2021-09-30 scarlet carson bloody marvellous big car lover...
35820 2021-09-30 podcast adriano great episode dude weird car d...
35821 2021-09-30 scarlet carson smugly cruise traffic know driv...
35822 2021-09-30 hornet know fuel shortage brexit destroy suppl...
Now I generate a word list to seek whether the sentence contains these string:
word_list=['drive','car','buy','fuel','electric','panic','tax','second hand','petrol','auto']
I only need to count once if the word in word list appears in the sentence, here comes my solution
set_list=[]
for word in word_list:
for sentence in DF['new_processed_text']:
if word in sentence:
set_list.append(sentence)
count=len(set(set_list))
However, this will work for the whole dataset, and I want to do the process by day.
I have no ideas about dataframe.groupby, should I need that?

You can use a regular expression, pd.DataFrame.loc method, pd.Series.value_counts() method and pandas string methods for your purposes:
In aggregate:
len(df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True)].index)
Processing per day:
df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True), 'date_column'].value_counts()

You can remove duplicates first and then use the string methods of pandas Series objects.
import pandas as pd
s = pd.Series(['abc def', 'def xyz ijk', 'xyz ijk', 'abc def', 'abc def', 'ijk mn', 'def xyz'])
words = ['abc', 'xyz']
s_prime = s.drop_duplicates()
contains_word = s_prime.str.contains("|".join(words))
print(contains_word.sum())
In your case, s = DF['new_processed_text'] and words = word_list.
I forgot the per day part, which #Vladimir Vilimaitis provides in his answer with his last line.

Related

Compare pandas column values with words in textfile using regular expressions

I have a df dataframe like this:
product name description0 description1 description2 description3
A plane flies air passengers wings
B car rolls road NaN NaN
C boat floats sea passengers NaN
What I want to do is to compare for each value in description columns to be searched in a txt file.
Let's say my test.txt file is:
He flies to London then crosses the sea to reach New-York.
The result would look like this:
product name description0 description1 description2 description3 Match
A plane flies air passengers wings Match
B car rolls road NaN NaN No match
C boat floats sea passengers NaN Match
I know the main structure but I'm a bit lost for the rest
with open ("test.txt", 'r') as searchfile:
for line in searchfile:
print line
if re.search() in line:
print(match)
You can search the input text using str.find() since you are searching for string literals. re.search() seems to be an overkill.
A quick-and-dirty solution using .apply(axis=1):
Data
# df as given
input_text = "He flies to London then crosses the sea to reach New-York."
Code
input_text_lower = input_text.lower()
def search(row):
for el in row: # description 0,1,2,3
# skip non-string contents and if the search is successful
if isinstance(el, str) and (input_text_lower.find(el.lower()) >= 0):
return True
return False
df["Match"] = df[[f"description{i}" for i in range(4)]].apply(search, axis=1)
Result
print(df)
product name description0 description1 description2 description3 Match
0 A plane flies air passengers wings True
1 B car rolls road NaN NaN False
2 C boat floats sea passengers NaN True
Note
Word boundary, punctuation and hyphens are not considered in the original problem. In real cases, additional preprocessing steps is likely to be required. This is out of the scope of the original question.

Look for n-grams into texts

Would it be possible to look at a specific n-grams from the whole list of them and look for it in a list of sentences?
For example:
I have the following sentences (from a dataframe column):
example = ['Mary had a little lamb. Jack went up the hill' ,
'Jack went to the beach' ,
'i woke up suddenly' ,
'it was a really bad dream...']
and n-grams (bigrams) got from
word_v = CountVectorizer(ngram_range=(2,2), analyzer='word')
mat = word_v r.fit_transform(df['Example'])
frequencies = sum(mat).toarray()[0]
which generates the output of the n-grams frequency.
I would like to select
the most frequent bi-grams
a bi-gram selected manually
within the list above example.
So, let's say that the most frequent bi-gram is Jack went, how could I look for it in the example above?
Also, if I want to look, not at the most frequent bi-grams but at the hill/beach in the example, how could I do it?
To select the rows that have the most frequent ngrams in it, you can do:
df.loc[mat.toarray()[:, frequencies==frequencies.max()].astype(bool)]
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
but if two ngrams have the max frequency, you would get all the rows where both are present.
If you want the top/hill 3 and all the rows that have any of them:
top = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, -top:].any(axis=1)])
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
2 i woke up suddenly
3 it was a really bad dream...
#here it is all the rows with the example
hill = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, :hill].any(axis=1)])
Example
1 Jack went to the beach
3 it was a really bad dream...
Finally if you want a specific ngrams:
ng = 'it was'
df.loc[mat.toarray()[:, np.array(word_v.get_feature_names())==ng].astype(bool)]
Example
3 it was a really bad dream...

Regular Expressions How to append matches from raw string text into a list

Hello I have some messy text I am unable to process in any good way and I want to match all zip codes 5 digit numbers from the raw string then append them to a list. My string looks something like this:
string = '''
January 2020
Zip Code
Current Month
Sales Breakdown
(by type)
Last Month Last Year Year-to-Date
95608
Carmichael
95610
Citrus Heights
95621
Citrus Heights
95624
Elk Grove
95626
Elverta
95628
Fair Oaks
95630
Folsom
95632
Galt
95638
Herald
95641
Isleton
95655
Mather
95660
North Highlands
95662
Orangevale
Total Sales
43 REO Sales 0 45
40 43
Median Sales Price $417,000
$0 $410,000 $400,000
$417,000
'''
It can be done with re.findall and the regular expression \b\d{5}\b or even just \d{5}. Let's see an example:
import re
string = '''
January 2020
Zip Code
Current Month
Sales Breakdown
(by type)
Last Month Last Year Year-to-Date
95608
Carmichael
95610
Citrus Heights
95621
Citrus Heights
95624
Elk Grove
95626
Elverta
95628
Fair Oaks
95630
Folsom
95632
Galt
95638
Herald
95641
Isleton
95655
Mather
95660
North Highlands
95662
Orangevale
Total Sales
43 REO Sales 0 45
40 43
Median Sales Price $417,000
$0 $410,000 $400,000
$417,000
'''
regex = r'\b\d{5}\b'
zip_codes = re.findall(regex, string)
Then you can get each code from zip_codes. I recommend you to read re documentation and Regular Expression HOWTO. There are interesting tools to write and test regex, as Regex101.
I also recommend you that for the next time you ask, please investigate a bit by yourself and then try to do what you want, and then, if you have an issue, ask for this specific issue. The help page How I ask a good question? and How to create a Minimum, Reproducible example might help you to write a good question.

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

Pandas - merge many rows into one

with this:
dataset = pd.read_csv('lyrics.csv', delimiter = '\t', quoting = 3)
I print my dataset in this fashion:
lyrics,classification
0 "I should have known better with a girl like you
1 That I would love everything that you do
2 And I do, hey hey hey, and I do
3 Whoa, whoa, I
4 Never realized what I kiss could be
5 This could only happen to me
6 Can't you see, can't you see
7 That when I tell you that I love you, oh
8 You're gonna say you love me too, hoo, hoo, ho...
9 And when I ask you to be mine
10 You're gonna say you love me too
11 So, oh I never realized what I kiss could be
12 Whoa whoa I never realized what I kiss could be
13 You love me too
14 You love me too",0
but what I really need is to have all thats between "" per row. how do I make this conversion in pandas?
Solution that worked for OP (from comments):
Fixing the problem at its source (in read_csv):
#nbeuchat is probably right, just try
dataset = pd.read_csv('lyrics.csv', quoting = 2)
That should give you a dataframe with one row and two columns: lyrics (with embedded line returns in the string) and classification (0).
General solution for collapsing series of strings:
You want to use pd.Series.str.cat:
import pandas as pd
dataset = pd.DataFrame({'lyrics':pd.Series(['happy birthday to you',
'happy birthday to you',
'happy birthday dear outkast',
'happy birthday to you'])})
dataset['lyrics'].str.cat(sep=' / ')
# 'happy birthday to you / happy birthday to you / happy birthday dear outkast / happy birthday to you'
The default sep is None, which would give you 'happy birthday to youhappy birthday to youhappy ...' so pick the sep value that works for you. Above I used slashes (padded with spaces) since that's what you typically see in quotations of songs and poems.
You can also try print(dataset['lyrics'].str.cat(sep='\n')) which maintains the line breaks but stores them all in one string instead of one string per line.

Categories

Resources