How do I print out entries in a df using a keyword search? I have a legislative database I'm running a list of climate keywords against:
climate_key_words = ['climate','gas','coal','greenhouse','carbon monoxide','carbon',\
'carbon dioxide','education',\
'gas tax','regulation']
Here's my for loop:
for bill in df.title:
for word in climate_key_words:
if word in bill:
print(bill)
print(word)
print(df.state)
print('------------')
When it prints, df.state forces everything to print funky:
24313 AK
24314 AK
24315 AK
24316 AK
24317 AK
Name: state, Length: 24318, dtype: object
------------
Relating to limitations on food regulations at farms, farmers' markets, and cottage food production operations.
regulation
But when print(df.state) is absent, it looks much nicer:
------------
Higher education; providing for the protection of certain expressive activities.
education
------------
Schools; allowing a school district board of education to amend certain policy to stock inhalers. Effective date. Emergency.
education
------------
How can I include df.state (and other values) and have them printed only once?
Ideally, my output should look like this:
###bill
###corresponding title
###corresponding state
print(df.state) is going to print out the column/field 'state'. You presumably want the state associated with that row of the dataframe?
So I would suggest tweaking your approach slightly and doing something like:
for row in range(dataframe.shape[0]): #for each row in the dataframe
for word in keywords:
if word in dataframe.iloc[row][bill]
print(dataframe.iloc[row][bill]) #allows you to access values in the df by row,column
print(dataframe.iloc[row][state])
print(dataframe.iloc[row][title])
Related
I have a dataset as follows
Name Surname Username Tweet Tags
Matthew Fields m.fields I love summertime summer summertime sun holiday
Fion Stewart fion It is time to enjoy ourselves time
Christine Bold chris89 Enjoy your summer summer
Vera Lovable v.lov2 It's sunny outside sun summer holiday
I would like to search the following list of strings within three columns (Username, Tweet and Tags):
list_strings=['summer','summertime','sun','holiday']
to see if at least in one column there is one or more of the terms above. This check should be saved in a new column, Terms from list, where there will be stored the terms found in all the columns (with no duplicates, i.e. if the same term is present in more column, I would need only to mention once).
The expected output would be:
Name Surname Username Tweet Tags Terms from list
Matthew Fields m.fields I love summertime summer summertime sun holiday summer, summertime, sun, holiday
Christine Bold chris89 Enjoy your summer summer summer
Vera Lovable v.lov2 It's sunny outside sun summer holiday sun, summer, holiday
Could you please give me any advice on how to do this and point me in the right direction? thank you
You can try str.contains
df=df[df['Tweet'].str.contains('|'.join(list_strings))]
If multiple columns
df=df[df[['Tweet','Tags']].apply(lambda x : x.str.contains('|'.join(list_strings))).any(1)]
Try the steps below.
step 1: for each element in df if any word in the string (x.split(' ')[i] == string) is also a word in list_strings keep it else it will give an empty list ([]). i.e. we will have a list of lists (of length 1 or zero). So we choose the first item from the list (val[0]) if it exists.
list_strings=['summer','summertime','sun','holiday']
step1 = df[['Username', 'Tweet', 'Tags']].applymap(lambda x: (([val[0] for val in [([string for i in range(len(x.split(' '))) if (x.split(' ')[i] == string )]) for string in list_strings ] if val]) ))
step 2: we assign to the column "Terms in List" the unique elements of the combined lists from the three columns.
df['Terms in list'] = step1.apply(lambda x: set(x[0] + x[1] + x[2]), axis = 1)
I have a DataFrame containing Trump's tweets. The column polarity contains a sentiment value for each tweet, and I am trying to sort the DataFrame trump based on these values by making a call to sort_values().
If I write trump.sort_values('polarity') I get a ValueError saying:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
However, if I write trump.head().sort_values('polarity') it takes the first five rows of my DataFrame and sorts them based on their polarity value.
My question is: Why can't I sort my entire table despite being able to sort the "head" of my table?
EDIT2: (Removed unnecessary info, consolidated code/data for clarity)
>>> trump.head() # This is the table after adding the 'polarity' column
time source text no_punc polarity
786204978629185536 <time> iPhone <unformatted str> <formatted> 1
786201435486781440 <time> iPhone <unformatted str> <formatted> -6.9
786189446274248704 <time> Android <unformatted str> <formatted> 1.8
786054986534969344 <time> iPhone <unformatted str> <formatted> 1.5
786007502639038464 <time> iPhone <unformatted str> <formatted> 1.2
Here is how I created the polarity column:
Created DataFrame tidy_format w/columns num, word containing the index of a word in each tweet as well as the word itself (indexed by the id of each tweet).
Created DataFrame tidy which grouped each index/word by its id number
Created a list of each unique id from tidy_format
Used nested list comprehensions to create a list with elements as the sum of each tweet's polarity
>>> tidy_format.head()
num word
786204978629185536 0 pay
786204978629185536 1 to
786204978629185536 2 play
786204978629185536 3 politics
786204978629185536 4 crookedhillary
>>> tidy = trump['no_punc'].str.split(expand = True).stack()
>>> tidy.head()
786204978629185536 0 pay
1 to
2 play
3 politics
4 crookedhillary
dtype: object
>>> ids = list(tidy_format.index.unique())
>>> scores = [sum([sent['polarity'][word] if word in sent['polarity'] else 0 for word in tidy[_id]]) for _id in ids]
>>> trump['polarity'] = scores
>>> trump['polarity'].head()
786204978629185536 1
786201435486781440 -6.9
786189446274248704 1.8
786054986534969344 1.5
786007502639038464 1.2
Name: polarity, dtype: object
I found a solution to my problem. Rather than creating the 'polarity' column manually by assigning trump['polarity'] to the result of nested list comprehensions, I merged the tidy_format and sent DataFrames (sent has column polarity containing polarity score of every word in the VADER lexicon, indexed by each individual word) and performed operations on the resulting table:
>>> tidy_sent = tidy_format.merge(sent, left_on = 'word', right_index = True)
>>> tidy_sent.fillna(0, inplace = True)
>>> tidy_sent.index = tidy_sent.index.set_names('id')
>>> tidy_sent.head()
num word polarity
id
786204978629185536 0 pay -0.4
783477966906925056 5 pay -0.4
771294347501461504 2 pay -0.4
771210555822477313 2 pay -0.4
764552764177481728 20 pay -0.4
>>> ts_grouped = tidy_sent.groupby('id').sum()
>>> ts_grouped.head()
num polarity
id
690171403388104704 10 -2.6
690173226341691392 27 -6.0
690176882055114758 39 4.3
690180284189310976 38 -2.6
690271688127213568 18 -5.2
>>> trump['polarity'] = ts_grouped['polarity']
>>> trump.fillna(0, inplace = True)
>>> trump['polarity'].head()
786204978629185536 1.0
786201435486781440 -6.9
786189446274248704 1.8
786054986534969344 1.5
786007502639038464 1.2
Name: polarity, dtype: float64
Since my mistake was initially in my calculation of trump['polarity'], by merging tables I am able to have the correct value for this Series, thus allowing me to call sort_values() properly.
>>> print('Most negative tweets:')
>>> for t in trump.sort_values(by = 'polarity').head()['text']:
print('\n ', t)
Most negative tweets:
the trump portrait of an unsustainable border crisis is dead on. “in the last two years, ice officers made 266,000 arrests of aliens with criminal records, including those charged or convicted of 100,000 assaults, 30,000 sex crimes & 4000 violent killings.” america’s southern....
it is outrageous that poisonous synthetic heroin fentanyl comes pouring into the u.s. postal system from china. we can, and must, end this now! the senate should pass the stop act – and firmly stop this poison from killing our children and destroying our country. no more delay!
the rigged russian witch hunt goes on and on as the “originators and founders” of this scam continue to be fired and demoted for their corrupt and illegal activity. all credibility is gone from this terrible hoax, and much more will be lost as it proceeds. no collusion!
...this evil anti-semitic attack is an assault on humanity. it will take all of us working together to extract the poison of anti-semitism from our world. we must unite to conquer hate.
james comey is a proven leaker & liar. virtually everyone in washington thought he should be fired for the terrible job he did-until he was, in fact, fired. he leaked classified information, for which he should be prosecuted. he lied to congress under oath. he is a weak and.....
Use a kwarg: trump.head().sort_values(by="polarity") to sort the head or trump.sort_values(by="polarity").head() to sort everything and show the head (lowest polarity).
Here I want to search the values of paper_title column in reference column if matched/found as whole text, get the _id of that reference row (not _id of paper_title row) where it matched and save the _id in the paper_title_in column.
In[1]:
d ={
"_id":
[
"Y100",
"Y100",
"Y100",
"Y101",
"Y101",
"Y101",
"Y102",
"Y102",
"Y102"
]
,
"paper_title":
[
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"translation using information on dialogue participants",
"#emotional tweets",
"#emotional tweets",
"#emotional tweets",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns",
"#supportthecause: identifying motivations to participate in online health campaigns"
]
,
"reference":
[
"beattie, gs (2005, november) #supportthecause: identifying motivations to participate in online health campaigns may 31, 2017, from",
"burton, n (2012, june 5) depressive realism retrieved may 31, 2017, from",
"gotlib, i h, 27 hammen, c l (1992) #supportthecause: identifying motivations to participate in online health campaigns new york: wiley",
"paul ekman 1992 an argument for basic emotions cognition and emotion, 6(3):169200",
"saif m mohammad 2012a #tagspace: semantic embeddings from hashtags in mail and books to appear in decision support systems",
"robert plutchik 1985 on emotion: the chickenand-egg problem revisited motivation and emotion, 9(2):197200",
"alastair iain johnston, rawi abdelal, yoshiko herrera, and rose mcdermott, editors 2009 translation using information on dialogue participants cambridge university press",
"j richard landis and gary g koch 1977 the measurement of observer agreement for categorical data biometrics, 33(1):159174",
"tomas mikolov, kai chen, greg corrado, and jeffrey dean 2013 #emotional tweets arxiv:13013781"
]
}
import pandas as pd
df=pd.DataFrame(d)
df
Out:
Expected Results:
And finally the final result dataframe with unique values as:
Note here paper_title_in column has all the _id of title present in reference column as list.
I tried this but it returns the _id of paper_title column in paper_presented_in which is being searched than reference column where it matches. The expected result dataframe gives more clear idea. Have a look there.
def return_id(paper_title,reference, _id):
if (paper_title is None) or (reference is None):
return None
if paper_title in reference:
return _id
else:
return None
df1['paper_present_in'] = df1.apply(lambda row: return_id(row['paper_title'], row['reference'], row['_id']), axis=1)
So to solve your problem you'll be requiring two dictionaries and a list to store some values temporarily.
# A list to store unique paper titles
unique_paper_title
# A dict to store mapping of unique paper to unique ids
mapping_dict_paper_to_id = dict()
# A dict to store mapping unique idx to the ids
mapping_id_to_idx = dict()
# This gives us the unique paper title's list
unique_paper_title = df["paper_title"].unique()
# Storing values in the dict mapping_dict_paper_to_id
for value in unique_paper_title:
mapping_dict_paper_to_id[value] = df["_id"][df["paper_title"]==value].unique()[0]
# Storing values in the dict mapping_id_to_idx
for value in unique_paper_title:
# this gives us the indexes of the matched string ie. the paper_title
idx_list = df[df['reference'].str.contains(value)].index
# Storing values in the dictionary
for idx in idx_list:
mapping_id_to_idx[idx] = mapping_dict_paper_to_id[value]
# This loops check if the index have any refernce's id and then updates the paper_present_in field accordingly
for i in df.index:
if i in mapping_id_to_idx:
df['paper_present_in'][i] = mapping_id_to_idx[i]
else:
df['paper_present_in'][i] = "None"
The above code is gonna check and update the searched values in the dataframe.
TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.
I wrote this regex:
re.search(r'^SECTION.*?:', text, re.I | re.M)
re.match(r'^SECTION.*?:', text, re.I | re.M)
to run on this string:
text = 'SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:\n (a) within 95 days after the end of each fiscal year of the Parent,\n its audited consolidated balance sheet and related statements of income,\n cash flows and stockholders\' equity as of the end of and for such year,\n setting forth in each case in comparative form the figures for the previous\n fiscal year, all reported on by Arthur Andersen LLP or other independent\n public accountants of recognized national standing (without a "going\n concern" or like qualification or exception and without any qualification\n or exception as to the scope of such audit) to the effect that such\n consolidated financial statements present fairly in all material respects\n the financial condition and results of operations of the Parent and its\n consolidated Subsidiaries on a consolidated basis in accordance with GAAP\n consistently applied;\n (b) within 50 days after the end of each of the first three fiscal\n quarters of each fiscal year of the Parent, its consolidated balance sheet\n and related statements of income, cash flows and stockholders\' equity as of\n the end of and for such fiscal quarter and the then elapsed portion of the\n fiscal year, setting forth in each case in comparative form the figures for\n the corresponding period or periods of (or, in the case of the balance\n sheet, as of the end of) the previous fiscal year, all certified by one of\n its Financial Officers as presenting fairly in all material respects the\n financial condition and results of operations of the Parent and its\n consolidated Subsidiaries on a consolidated basis in accordance with GAAP\n consistently applied, subject to normal year-end audit adjustments and the\n absence of footnotes;\n '
and i was expecting the following output:
SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:
but i am getting None as the output.
Please anyone tell me what i am doing wrong here?
The .* will match all the text and since your text doesn't ended with : it returns None. You can use a negated character class instead to get the expected result:
In [32]: m = re.search(r'^SECTION[^:]*?:', text, re.I | re.M)
In [33]: m.group(0)
Out[33]: 'SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:'
In [34]: