Python read text file into dictionary, list of strings

Python read text file into dictionary, list of strings - python

I am trying to read a text file into a dictionary.
The text file contains a person's name, networks, and friends' names.
The key for dictionary is person's name, and value is that person's networks
Here is the text file:
Pritchett, Mitchell\n
Law Association\n
Dunphy, Claire\n
Tucker, Cameron\n
Dunphy, Luke\n
\n\n
Tucker, Cameron\n
Clown School\n
Wizard of Oz Fan Club\n
Pritchett, Mitchell\n
Pritchett, Gloria\n
\n\n
Dunphy, Alex\n
Orchestra\n
Chess Club\n
Dunphy, Luke\n
Here is what I did
def person_to_networks(file):
I get an error for the line 'if "\n" and "," in lst[0]'. It says list index out of range.
Please help me. I can't figure out what is wrong with this code.

you get that error because you are initializing your lst to be empty [ ], and then you check the first element which does not exist.
you say that you want to turn your file into a dictionary, I suggest this simpler code for that:
import re # import regex library
# open the file and import your data
f = open('data', 'r')
data = f.read()
f.close()
# initialize your data to be processed
dict = {}
data = data.replace('\\n', '') # remove \n characters
data = data.split('\n\n') # split it into blocks
for block in data:
block = block.split('\n') # split each bock into lines
nets = []
for line in block:
if ',' not in line and line != '': # find networks
nets.append(line)
block[0] = re.sub(r'(\w+),\s(\w+)', r'\2, \1', block[0]) # ADDED to switch first name and last name
dict.update({block[0]: nets}) # update the result dictionary
print dict
and this will give you this result for your suggested file example:
{'Pritchett, Mitchell': ['Law Association'], 'Tucker, Cameron': ['Clown School', 'Wizard of Oz Fan Club'], 'Dunphy, Alex': ['Orchestra', 'Chess Club']}
if this is not what you want, please describe in more details what it is.
Edit: in order to switch the first name and last name you can add just that single line to make that switch before you update the dictionary. I added that line in the code above, it uses a regex (don't forget to add "import re" like in the beginning of my code) :
'(\w+),\s(\w+)' # used to find the first name and last name and store them in \1 and \2 match groups.
'\2, \1' # to replace the place of the match groups as required.
OR '\2 \1' # if you don't want the comma
and you can manipulate it however you like, e.g: you can remove the , or something like that.
and after the switching the output will become like:
{'Alex, Dunphy': ['Orchestra', 'Chess Club'], 'Cameron, Tucker': ['Clown School', 'Wizard of Oz Fan Club'], 'Mitchell, Pritchett': ['Law Association']}
Edit: another way to switch between the first and last names (remove the "import re" and the previously added line and replace it with these three lines with the same indent):
s = block[0].split(', ')
s.reverse()
block[0] = ', '.join(s) # or use ' '.join(s) if you don't want the comma
hope this helps.

Because the first time through the loop, you're trying to access lst[0], when lst is still [].

First line at least, lst is empty list ([]).
You should append some values to lst first.
Probably, you want to did following:
if "\n" and "," in lst[0]:
to
if "\n" and "," in line[0]:
,
elif "," not in lst[1:]:
to
elif "," not in line[1:]:
new_person_friends in the last line is not defined. You need fix that to right one.
When line is "\n", lst will clear after networks's update.
And your data have "\n\n". That means 2 consecutive empty lines.
In the second "\n", lst is empty list because the first "\n" was processed.
You need fix your code to avoid that problem like this: if line == '\n' and lst != []:

Related

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.

with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text

I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

loop through Pandas DF and append values to a list which is a value of a dictionary where conditional value is the key

Very hard to make a short but descriptive title for this but I have a dataframe where each row is for a character's line, with the entire corpus being the entire show. I to create a dictionary where the keys are a list of the top characters, loop through the DF and append each dialogue line to their keys value, which I want as a list
I have a column called 'Character' and a column called 'dialogue':
Character dialogue
PICARD 'You will agree Data that Starfleets
order are...'
DATA 'Difficult? Simply solve the mystery of
Farpoint Station.'
PICARD 'As simple as that.'
TROI 'Farpoint Station. Even the name sounds
mysterious.'
And so on and so on... There are many minor characters so I just want the top 10 characters by dialogue count so I have a list of them called major_chars. I want a final dictionary where each character is the key and the value is a huge list of all their lines.
I don't know how to append to an empty list set up as the value for each key. My code thus far is:
char_corpuses = {}
for label, row in df.iterrows():
for char in main_chars:
if row['Character'] == char:
char_corpuses[char] = [row['dialogue']]
But the end result is only the last line each Character says in the corpus:
{'PICARD': [' so five card stud nothing wild and the skys the limit'],
'DATA': [' would you care to deal sir'],
'TROI': [' you were always welcome'],
'WORF': [' agreed'],
'Q': [' youll find out in any case ill be watching and if youre very lucky ill drop by to say hello from time to time see you out there'],
'RIKER': [' of course have a seat'],
'WESLEY': [' i will bye mom'],
'CRUSHER': [' you know i was thinking about what the captain told us about the future about how we all changed and drifted apart why would he want to tell us whats to come'],
'LAFORGE': [' sure goes against everything weve heard about not polluting the time line doesnt it'],
'GUINAN': [' thank you doctor this looks like a great racquet but er i dont play tennis never have']}
How do I get it to not clear out each line before and only take the last line for each character

Try something like this ^^
char_corpuses = {}
for char in main_chars:
char_corpuses[char] = df[df.name == char]['dialogue'].values

This line char_corpuses[char] = [row['dialogue']] overwrites the contents of the list with current dialogue line each time the loop runs. It writes a single element rather than appending.
For a 'vanilla' dictionary try:
import pandas
d = {'Character': ['PICARD', 'DATA', 'PICARD'], 'dialogue': ['You will agree Data that Starfleets order are...', 'Difficult? Simply solve the mystery of Farpoint Station.', 'As simple as that.']}
df = pandas.DataFrame(data=d)
main_chars = ['PICARD', 'DATA']
char_corpuses = {}
for label, row in df.iterrows():
for char in main_chars:
if row['Character'] == char:
try:
# Try to append the current dialogue line to array
char_corpuses[char].append(row['dialogue'])
except KeyError:
# The key doesn't exist yet, create empty list for the key [char]
char_corpuses[char] = []
char_corpuses[char].append(row['dialogue'])
Output
{'PICARD': ['You will agree Data that Starfleets order are...', 'As simple as that.'], 'DATA': ['Difficult? Simply solve the mystery of Farpoint Station.']}

TopHowmany = 10 # This you can change as you want.
subDF = df[df.Charactar.isin(df.Charactar.value_counts()[0:TopHowmany].index)]
char_corpuses = {}
for x in subDF.index:
char = subDF.loc[x,'Charactar']
dialogue = subDF.loc[x,'Dialogue']
if subDF.loc[x,'Charactar'] in char_corpuses:
char_corpuses[char].append('dialogue')
else:
char_corpuses[char] = [dialogue]

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?

I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

python - How to extract strings from each line in text file?

I have a text file that detects the amount of monitors that are active.
I want to extract specific data from each line and include it in a list.
The text file looks like this:
[EnumerateDevices]: Enumerating Devices.
DISPLAY\LGD03D7\4&ACE0355&1&UID68092928 : Generic PnP Monitor
DISPLAY\ABCF206\4&ACE0355&1&UID51249920 : Generic PnP Monitor
//
// here can be more monitors...
//
2 matching device(s) found.
I need to get the number after the UID in the middle of the text : 68092929 , 51249920 ..
I thought of doing the next:
a. enter each line in text
b. see if "UID" string exist
c. if it exists : split (here I dot know how to do it.. split by (" ") or ("&")
Is there any good idea you can advise? I don't understand how can I get the numbers after the UID (if the next number is longer than the previous ones for example)
how can I get a command that does : ("If you see UID string, get all the data until you see first blank")
any idea?
Thanks

I would use a regular expresssion to extract the UID
e.g.
import re
regexp = re.compile('UID(\d+)')
file = """[EnumerateDevices]: Enumerating Devices.
DISPLAY\LGD03D7\4&ACE0355&1&UID68092928 : Generic PnP Monitor
DISPLAY\ABCF206\4&ACE0355&1&UID51249920 : Generic PnP Monitor
//
// here can be more monitors...
//
2 matching device(s) found."""
print re.findall(regexp, file)

Use regular expressions:
import re
p =re.compile(r'.*UID(\d+)')
with open('infile') as infile:
for line in infile:
m = p.match(line)
if m:
print m.groups[0]

You can use the split() method.
s = "hello this is a test"
words = s.split(" ")
print words
The output of the above snippet is a list containing: ['hello', 'this', 'is', 'a', 'test']
In your case, you can split on the substring "UID" and grab the second element in the list to get the number that you're looking for.
See docs here: https://docs.python.org/2/library/string.html#string.split

This is a bit esoteric but does the trick with some list comprehension:
[this.split("UID")[1].split()[0] for this in txt.split("\n") if "UID" in this]
the output is the list you are looking for I presume: ['68092928', '51249920']
Explanations:
split the text into rows (split("\n")
select only rows with UID inside (for this in ... if "UID" in this)
in the remaining rows, split using "UID".
You want to keep only one element after UID hence the [1]
The resulting string contains the id and some text separated by a space so, we use a second split(), defaulting to spaces.

>>> for line in s.splitlines():
... line = line.strip()
... if "UID" in line:
... tmp = line.split("UID")
... uid = tmp[1].split(':')[0]
... print "UID " + uid
...
UID 68092928
UID 51249920

You can use the find() method:
if line.find('UID') != -1:
print line[line.find('UID') + 2 :]
Docs https://docs.python.org/2/library/string.html#string.find

if you read the whole file at once, otherwise if line by line just change the first line to line.split()
for elem in file.split():
if 'UID' in elem:
print elem.split('UID')[1]
the split will have already stripped "junk" do each elem that contains the 'UID' string will be all set to int() or just print as a string

retrieving name from number ID

I have a code that takes data from online where items are referred to by a number ID, compared data about those items, and builds a list of item ID numbers based on some criteria. What I'm struggling with is taking this list of numbers and turning it into a list of names. I have a text file with the numbers and corresponding names but am having trouble using it because it contains multi-word names and retains the \n at the end of each line when i try to parse the file in any way with python. the text file looks like this:
number name\n
14 apple\n
27 anjou pear\n
36 asian pear\n
7645 langsat\n
I have tried split(), as well as replacing the white space between with several difference things to no avail. I asked a question earlier which yielded a lot of progress but still didn't quite work. The two methods that were suggested were:
d = dict()
f=open('file.txt', 'r')
for line in f:
number, name = line.split(None,1)
d[number] = name
this almost worked but still left me with the \n so if I call d['14'] i get 'apple\n'. The other method was:
import re
f=open('file.txt', 'r')
fr=f.read()
r=re.findall("(\w+)\s+(.+)", fr)
this seemed to have gotten rid of the \n at the end of every name but leaves me with the problem of having a tuple with each number-name combo being a single entry so if i were to say r[1] i would get ('14', 'apple'). I really don't want to delete each new line command by hand on all ~8400 entries...
Any recommendations on how to get the corresponding name given a number from a file like this?

In your first method change the line ttn[number] = name to ttn[number] = name[:-1]. This simply strips off the last character, and should remove your \n.

names = {}
with open("id_file.txt") as inf:
header = next(inf, '') # skip header row
for line in inf:
id, name = line.split(None, 1)
names[int(id)] = name.strip()
names[27] # => 'anjou pear'

Use this to modify your first approach:
raw_dict = dict()
cleaned_dict = dict()
Assuming you've imported file to dictionary:
raw_dict = {14:"apple\n",27:"anjou pear\n",36 :"asian pear\n" ,7645:"langsat\n"}
for keys in raw_dict:
cleaned_dict[keys] = raw_dict[keys][:len(raw_dict[keys])-1]
So now, cleaned_dict is equal to:
{27: 'anjou pear', 36: 'asian pear', 7645: 'langsat', 14: 'apple'}
*Edited to add first sentence.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python read text file into dictionary, list of strings - python

Because the first time through the loop, you're trying to access lst[0], when lst is still [].

Related

"Replace" from central file?

loop through Pandas DF and append values to a list which is a value of a dictionary where conditional value is the key

String Cutting with multiple lines

python - How to extract strings from each line in text file?

retrieving name from number ID

Categories

Resources