pandas: text analysis: Transfer raw data to dataframe

pandas: text analysis: Transfer raw data to dataframe - python

I need to read lines from a text file and extract the
quoted person name and quoted text from each line.
lines look similar to this:
"Am I ever!", Homer Simpson responded.
Remarks:
Hint: Use the returned object from the 'open' method to get the file
handler. Each line you read is expected to contain a new-line in the
end of the line. Remove the new-line as following: line_cln =line.strip()
There are the options for each line (assume one of these
three options): The first set of patterns, for which the person name
appears before the quoted text. The second set of patterns, for which
the quoted text appears before the person. Empty lines.
Complete the transfer_raw_text_to_dataframe function to return a
dataframe with the extracted person name and text as explained
above. The information is expected to be extracted from the lines of
the given 'filename' file.
The returned dataframe should include two columns:
person_name - containing the extracted person name for each line.
extracted_text - containing the extracted quoted text for each line.
The returned values:
dataframe - The dataframe with the extracted information as described above.
Important Note: if a line does not contain any quotation pattern, no information should be saved in the
corresponding row in the dataframe.
what I got so far: [edited]
def transfer_raw_text_to_dataframe(filename):
data = open(filename)
quote_pattern ='"(.*)"'
name_pattern = "\w+\s\w+"
df = open(filename, encoding='utf8')
lines = df.readlines()
df.close()
dataframe = pd.DataFrame(columns=('person_name', 'extracted_text'))
i = 0
for line in lines:
quote = re.search(quote_pattern,line)
extracted_quotation = quote.group(1)
name = re.search(name_pattern,line)
extracted_person_name = name.group(0)
df2 = {'person_name': extracted_person_name, 'extracted_text': extracted_quotation}
dataframe = dataframe.append(df2, ignore_index = True)
dataframe.loc[i] = [person_name, extracted_text]
i =i+1
return dataframe
the dataframe is created with the correct shape, problem is, the person name in each row is: 'Oh man' and the quote is 'Oh man, that guy's tough to love.' (in all of them)
which is weird because it's not even in the txt file...
can anyone help me fix this?
Edit: I need to extract from a simple txt file that contains these lines only:
"Am I ever!", Homer Simpson responded.
"Hmmm. So... is it okay if I go to the women's conference with Chloe?", Lisa Simpson answered.
"Really? Uh, sure.", Bart Simpson answered.
"Sounds great.", Bart Simpson replied.
Homer Simpson responded: "Danica Patrick in my thoughts!"
C. Montgomery Burns: "Trust me, he'll say it, or I'll bust him down to Thursday night vespers."
"Gimme that torch." Lisa Simpson said.
"No! No, I've got a lot more mothering left in me!", Marge Simpson said.
"Oh, Homie, I don't care if you're a billionaire. I love you just because you're..." Marge Simpson said.
"Damn you, e-Bay!" Homer Simpson answered.

possibly in such a way:
import pandas as pd
import re
# do smth
with open("12.txt", "r") as f:
data = f.read()
# print(data)
# ########## findall text in quotes
m = re.findall(r'\"(.+)\"', data)
print("RESULT: \n", m)
df = pd.DataFrame({'rep': m})
print(df)
# ########## retrieve and replace text in quotes for nothing
m = re.sub(r'\"(.+)\"', r'', data)
# ########## get First Name & Last Name from the rest text in each line
regex = re.compile("([A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+)")
mm = regex.findall(m)
df1 = pd.DataFrame({'author': mm})
print(df1)
# ########## join 2 dataframes
fin = pd.concat([df, df1], axis=1)
print(fin)
all print just for checking (get them away for cleaner code).
Just "C. Montgomery Burns" is loosing his first letter...

for loop in folder:
# All files acc. mask ending with .txt
print(glob.glob("C:\\MyFolder\\*.txt"))
mylist=[ff for ff in glob.glob("C:\\MyFolder\\*.txt")]
print("file_list:\n", mylist)
for filepath in mylist:
# do smth with each filepath
to collect all dfs you're getting from files - smth like this (e.g. reading csv-files by-mask):
import glob
import pandas as pd
def dfs_collect():
mylist=[ff for ff in glob.glob("C:\\MyFolder\\*.txt")] # all files by-mask
print("file_list:\n", mylist)
dfa=pd.concat((pd.read_csv(file, sep=';', encoding='windows-1250', index_col=False) for file in mylist), ignore_index=True)
but to get the content of your files - the example of the content is needed... without the example of your txt file (having dummy_info but left its real structure), I doubt, that anybody will try to imagine how it should look like

I think that following does what you need. Please verify whether the output is accurate. I'll explain any line that is unclear
import pandas as pd
import numpy as np
import nltk
from nltk.tree import ParentedTree
import typing as t # This is optional
# Using `read_csv` to read in the text because I find it easier
data = pd.read_csv("dialog.txt", header = None, sep = "~", quoting=3)
dialouges = data.squeeze() # Getting a series from the above DF with one column
def tag_sentence(tokenized: t.List[str]) -> t.List[t.Tuple[str, str]]:
tagged = nltk.pos_tag(tokenized)
tagged = [(token, tag) if tag not in {"``", "''"} else (token, "Q") for token, tag in tagged]
keep = {"Q", "NNP"}
renamed = [(token, "TEXT") if tag not in keep else (token, tag) for token, tag in tagged]
return renamed
def get_parse_tree(tagged_sent):
grammar = """
NAME: {<NNP>+}
WORDS: {<TEXT>+}
DIALOUGE: {<Q><WORDS|NAME>+<Q>}
"""
cp = nltk.RegexpParser(grammar)
parse_tree = cp.parse(tagged_sent)
return parse_tree
def extract_info(parse_tree):
ptree = ParentedTree.convert(parse_tree)
trees = list(ptree.subtrees())
root = ptree.root()
for subtree in trees[1:]:
if subtree.parent() == root:
if subtree.label() == "DIALOUGE":
dialouge = ' '.join(word for word, _ in subtree.leaves()[1:-1]) # Skipping quotaton marks
if subtree.label() == "NAME":
person = ' '.join(word for word, _ in subtree.leaves())
return dialouge, person
def process_sentence(sentence):
return extract_info(get_parse_tree(tag_sentence(nltk.word_tokenize(sentence))))
processed = [process_sentence(line) for line in dialouges]
result = pd.DataFrame(processed, columns=["extracted_text", "person_name"])
The resulting DataFrame looks like this:

Related

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.

with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text

I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

Rewriting Single Words in a .txt with Python

I need to create a Database, using Python and a .txt file.
Creating new items is no Problem,the inside of the Databse.txt looks like this:
Index Objektname Objektplace Username
i.e:
1 Pen Office Daniel
2 Saw Shed Nic
6 Shovel Shed Evelyn
4 Knife Room6 Evelyn
I get the index from a QR-Scanner (OpenCV) and the other informations are gained via Tkinter Entrys and if an objekt is already saved in the Database, you should be able to rewrite Objektplace and Username.
My Problems now are the following:
If I scan the Code with the index 6, how do i navigate to that entry, even if it's not in line 6, without causing a Problem with the Room6?
How do I, for example, only replace the "Shed" from Index 4 when that Objekt is moved to f.e. Room6?
Same goes for the Usernames.
Up until now i've tried different methods, but nothing worked so far.
The last try looked something like this
def DBChange():
#Removes unwanted bits from the scanned code
data2 = data.replace("'", "")
Index = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
#Adds a whitespace at the end of the Entrys to seperate them
Userlen = len(User)
User2 = User.ljust(Userlen)
Einlagerungsortlen = len(Einlagerungsort)+1
Einlagerungsort2 = Einlagerungsort.ljust(Einlagerungsortlen)
#Navigate to the exact line of the scanned Index and replace the words
#for the place and the user ONLY in this line
file = open("Datenbank.txt","r+")
lines=file.readlines()
for word in lines[Index].split():
List.append(word)
checkWords = (List[2],List[3])
repWords = (Einlagerungsort2, User2)
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
file.close()
Return()
Thanks in advance

I'd suggest using Pandas to read and write your textfile. That way you can just use the index to select the approriate line. And if there is no specific reason to use your text format, I would switch to csv for ease of use.
import pandas as pd
def DBChange():
#Removes unwanted bits from the scanned code
# I haven't changed this part, since I guess you need this for some input data
data2 = data.replace("'", "")
Indexnr = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
# I removed the lines here. This isn't necessary when using csv and Pandas
# read in the csv file
df = pd.read_csv("Datenbank.csv")
# Select line with index and replace value
df.loc[Indexnr, 'Username'] = User
df.loc[Indexnr, 'Objektplace'] = Einlagerungsort
# Write back to csv
df.to_csv("Datenbank.csv")
Return()
Since I can't reproduce your specific problem, I haven't tested it. But something like this should work.
Edit
To read and write text-file, use ' ' as the seperator. (I assume all values do not contain spaces, and your text file now uses 1 space between values).
reading:
df = pd.read_csv('Datenbank.txt', sep=' ')
Writing:
df.to_csv('Datenbank.txt', sep=' ')

First of all, this is a terrible way to store data. My suggestion is not particularily well code, don't do this in production! (edit
newlines = []
for line in lines:
entry = line.split()
if entry[0] == Index:
#line now is the correct line
#Index 2 is the place, index 0 the ID, etc
entry[2] = Einlagerungsort2
newlines.append(" ".join(entry))
# Now write newlines back to the file

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?

I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

How to erase "forwarded message" title and unwanted content from body of Enron emails?

I am trying to append all the bodies of Enron Emails in one file so that I could process the text of these emails by eliminating Stop words and splitting it into sentences with NLTK.
My problem is with forwarded and replied messages, I am not sure how to clean them.
This is my code so far:
import os, email, sys, re,nltk, pprint
from email.parser import Parser
rootdir = '/Users/art/Desktop/maildir/lay-k/elizabeth'
#function that appends all the body parts of Emails
def email_analyse(inputfile, email_body):
with open(inputfile, "r") as f:
data = f.read()
email = Parser().parsestr(data)
email_body.append(email.get_payload())
#end of function
#defining a list that will contain bodies
email_body = []
#call the function email_analyse for every function in directory
for directory, subdirectory, filenames in os.walk(rootdir):
for filename in filenames:
email_analyse(os.path.join(directory, filename), email_body )
#the stage where I clean the emails
with open("email_body.txt", "w") as f:
for val in email_body:
if(val):
val = val.replace("\n", "")
val = val.replace("=01", "")
#for some reason I had many of ==20 and =01 in my text
val = val.replace("==20", "")
f.write(val)
f.write("\n")
This is the partial output:
Well, with the photographer and the band, I would say we've pretty much outdone our budget! Here's the information on the photographer. I have a feeling for some of the major packages we could negotiate at least a couple of hours at the rehearsal dinner. I have no idea how much this normally costs, but he isn't cheap!---------------------- Forwarded by Elizabeth Lay/HOU/AZURIX on 09/13/99 07:34 PM ---------------------------acollins#reggienet.com on 09/13/99 05:37:37 PMPlease respond to acollins#reggienet.com To: Elizabeth Lay/HOU/AZURIX#AZURIXcc: Subject: Denis Reggie Wedding PhotographyHello Elizabeth:Congratulations on your upcoming marriage! I am Ashley Collins, Mr.Reggie's Coordinator. Linda Kessler forwarded your e.mail address to me sothat I may provide you with information on photography coverage for Mr.Reggie's wedding photography.
So the result is not a pure text at all. Any ideas on how to do it right?

You might want to look at regular expressions to parse the forwarded and reply text because the format should be consistent throughout the corpus.
For deleting the forwarded text, you could use a regex like this:
-{4,}(.*)(\d{2}:\d{2}:\d{2})\s*(PM|AM)
Which will match all the content between four or more hyphens and the time in the format XX:XX:XX PM. Matching 3 dashes would probably work fine, too. We just want to avoid matching hyphens and em-dashes in the email body. You can play around with this regex and write your own for matching To and Subject headers at this link: https://regex101.com/r/VGG4bu/1/
You can also look at section 3.4 of the NLTK book, which talks about regular expressions in Python: http://www.nltk.org/book/ch03.html
Good luck! This sounds like an interesting project.

If you're still interested in this problem, I've created a pre-processing script specifically for the Enron dataset. You'll notice that a new email will always start with the tag 'subject:', I implemented a function which removes all text to the left of this tag, and only on the last 'subject:' tag to remove all forwarded messages. Specific code:
# Cleaning content column
df['content'] = df['content'].str.rsplit('Subject: ').str[-1]
df['content'] = df['content'].str.rsplit(' --------------------------- ').str[-1]
Overall script, if interested:
# Importing the dataset, and defining columns
import pandas as pd
df = pd.read_csv('enron_05_17_2015_with_labels_v2.csv', usecols=[2,3,4,13], dtype={13:str})
# Building a count of how many people are included in an email
df['Included_In_Email'] = df.To.str.count(',')
df['Included_In_Email'] = df['Included_In_Email'].apply(lambda x: x+1)
# Dropping any NaN's, and emails with >15 recipients
df = df.dropna()
df = df[~(df['Included_In_Email'] >=15)]
# Seperating remaining emails into a line-per-line format
df['To'] = df.To.str.split(',')
df2 = df.set_index(['From', 'Date', 'content', 'Included_In_Email'])
['To'].apply(pd.Series).stack()
df2 = df2.reset_index()
df2.columns = ['From','To','Date','content', 'Included_In_Email']
# Renaming the new column, dropping unneeded column, and changing indices
del df2['level_4']
df2 = df2.rename(columns = {0: 'To'})
df2 = df2[['Date','From','To','content','Included_In_Email']]
del df
# Cleaning email addresses
df2['From'] = df2['From'].map(lambda x: x.lstrip("frozenset"))
df2['To'] = df2['To'].map(lambda x: x.lstrip("frozenset"))
df2['From'] = df2['From'].str.strip("<\>(/){?}[:]*, ")
df2['To'] = df2['To'].str.strip("<\>(/){?}[:]*, ")
df2['From'] = df2['From'].str.replace("'", "")
df2['To'] = df2['To'].str.replace("'", "")
df2['From'] = df2['From'].str.replace('"', "")
df2['To'] = df2['To'].str.replace('"', "")
# Acccounting for users having different emails
email_dict = pd.read_csv('dict_email.csv')
df2['From'] = df2.From.replace(email_dict.set_index('Old')['New'])
df2['To'] = df2.To.replace(email_dict.set_index('Old')['New'])
del email_dict
# Removing emails not containing #enron
df2['Enron'] = df2.From.str.count('#enron')
df2['Enron'] = df2['Enron']+df2.To.str.count('#enron')
df2 = df2[df2.Enron != 0]
df2 = df2[df2.Enron != 1]
del df2['Enron']
# Adding job roles which correspond to staff
import csv
with open('dict_role.csv') as f:
role_dict = dict(filter(None, csv.reader(f)))
df2['Sender_Role'] = df2['From'].map(role_dict)
df2['Receiver_Role'] = df2['To'].map(role_dict)
df2 = df2[['Date','From','To','Sender_Role','Receiver_Role','content','Included_In_Email']]
del role_dict
# Cleaning content column
df2['content'] = df2['content'].str.rsplit('Subject: ').str[-1]
df2['content'] = df2['content'].str.rsplit(' --------------------------- ').str[-1]
# Condensing records into one line per email exchange, adding weights
Weighted = df2.groupby(['From', 'To']).count()
# Adding weight column, removing redundant columns, splitting indexed column
Weighted['Weight'] = Weighted['Date']
Weighted =
Weighted.drop(['Date','Sender_Role','Receiver_Role','content','Included_In_Email'], 1)
Weighted.reset_index(inplace=True)
# Re-adding job-roles to staff
with open('dict_role.csv') as f:
role_dict = dict(filter(None, csv.reader(f)))
Weighted['Sender_Role'] = Weighted['From'].map(role_dict)
del role_dict
# Dropping exchanges with a weight of <= x, or no identifiable role
Weighted2 = Weighted[~(Weighted['Weight'] <=3)]
Weighted2 = Weighted.dropna()
Two dictionaries are used in the script (for matching job-roles and changing multiple emails for the same person), and can be found here.

Searching CSV Files (Python)

I've made this CSV file up to play with.. From what I've been told before, I'm pretty sure this CSV file is valid and can be used in this example.
Basically I have this CSV file 'book_list.csv':
name,author,year
Lord of the Rings: The Fellowship of the Ring,J. R. R. Tolkien,1954
Nineteen Eighty-Four,George Orwell,1984
Lord of the Rings: The Return of the King,J. R. R. Tolkien,1954
Animal Farm,George Orwell,1945
Lord of the Rings: The Two Towers, J. R. R. Tolkien, 1954
And I also have this text file 'search_query.txt', whereby I put in keywords or search terms I want to search for in the CSV file:
Lord
Rings
Animal
I've currently come up with some code (with the help of stuff I've read) that allows me to count the number of matching entries. I then have the program write a separate CSV file 'results.csv' which just returns either 'Matching' or ' '.
The program then takes this 'results.csv' file and counts how many 'Matching' results I have and it prints the count.
import csv
import collections
f1 = file('book_list.csv', 'r')
f2 = file('search_query.txt', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
input = [row for row in c2]
for booklist_row in c1:
row = 1
found = False
for input_row in input:
results_row = []
if input_row[0] in booklist_row[0]:
results_row.append('Matching')
found = True
break
row = row + 1
if not found:
results_row.append('')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
d = collections.defaultdict(int)
with open("results.csv", "rb") as info:
reader = csv.reader(info)
for row in reader:
for matches in row:
matches = matches.strip()
if matches:
d[matches] += 1
results = [(matches, count) for matches, count in d.iteritems() if count >= 1]
results.sort(key=lambda x: x[1], reverse=True)
for matches, count in results:
print 'There are', count, 'matching results'+'.'
In this case, my output returns:
There are 4 matching results.
I'm sure there is a better way of doing this and avoiding writing a completely separate CSV file.. but this was easier for me to get my head around.
My question is, this code that I've put together only returns how many matching results there are.. how do I modify it in order to return the ACTUAL results as well?
i.e. I want my output to return:
There are 4 matching results.
Lord of the Rings: The Fellowship of the Ring
Lord of the Rings: The Return of the King
Animal Farm
Lord of the Rings: The Two Towers
As I said, I'm sure there's a much easier way to do what I already have.. so some insight would be helpful. :)
Cheers!
EDIT: I just realized that if my keywords were in lower case, it won't work.. is there a way to avoid case-sensitivity?

Throw away the query file and get your search terms from sys.argv[1:] instead.
Throw away your output file and use sys.stdout instead.
Append matched booklist titles to a result_list. The result_row that you currently have has a rather misleading name. The count that you want is len(result_list). Print that. Then print the contents of result_list.
Convert your query words to lowercase once (before you start reading the input file). As you read each book_list row, convert its title to lowercase. Do your your matching with the lowercase query words and the lowercase title.

Overall plan:
Read in the entire book list csv into a dictionary of {title: info}.
Read in the questions csv. For each keyword, filter the dictionary:
[key for key, value in books.items() if "Lord" in key]
say. Do what you will with the results.
If you want, put the results in another csv.
If you want to deal with casing issues, try turning all the titles to lowercase ("FOO".lower()) when you store them in the dictionary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas: text analysis: Transfer raw data to dataframe - python

Related

"Replace" from central file?

Rewriting Single Words in a .txt with Python

String Cutting with multiple lines

How to erase "forwarded message" title and unwanted content from body of Enron emails?

Searching CSV Files (Python)

Categories

Resources