How to save regular expression objects in a dictionary? - python

For a first-semester task I'm supposed to write a script that finds first and last names in a file and displays them in the following order (last name, first name) next to the original entry (first name, last name).
The file has one entry per line which looks as follows: "Srđa Slobodan ĐINIC POPOVIC".
My questions are probably basic but I'm stuck:
How can I save all the entries of the file in a hash (multi-part first names/multi-part lastnames)? With re.compile() and re.search() I only manage to get one result. With re.findall() I get all, but can't group.() them and get encoding errors.
How can I connect the original name entry (last name/first name) to the new entry (first name/last name).
import re, codecs
file = codecs.open('FILE.tsv', encoding='utf-8')
test = file.read()
list0 = test.rstrip()
for word in list0:
p = re.compile('(([A-Z]+\s\-?)+)')
u = re.compile('((\(?[A-Z][a-z]+\)?\s?-?\.?)+)')
hash1 = {}
hash1[p.search(test).group()] = u.search(test).group()
hash2 = {}
hash2[u.search(test).group()] = p.search(test).group()
print hash1,'\t',hash2

Related

Problem skipping line whilst iterating using previous line and current line comparison

I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.
Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file
I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.

Apply a function to a specific expression for every line in a file

I currently am reading the contents of a file to a new file for every case instance where the lines fit a specific criteria. See the code below
from string import punctuation
fpath = open('Redshift_twb_1.txt', 'r')
lines = fpath.readlines()
fpath_write = open('Redshift_1_new.txt', 'w+')
# filter the list; with the string 'apple'
# replace 'apple' with whatever string you want to find
temp_out_lines = [line for line in lines if '<column caption' in line]
out_lines = [line for line in temp_out_lines if 'param-domain-type' not in line]
# Lambda function that maps .lower() function to every element of the list out_lines
lower_lines = map(lambda x:x.lower(), out_lines)
# Join the lines into a single string
output = '\n'.join(lower_lines)
# write it
fpath_write.write(output)
fpath.close()
fpath_write.close()
My goal is to implement functionality that can read take a line and downcase or lowercase a specific parameter before that line is then written to the new file.
Currently, the process takes in a line, checks if it matches <column caption, then checks if it does not contain param-domain-type. and if both of those pass, the line is then added to the new txt file.
An example line is below:
<column caption='Section' datatype='string' name='[SECTION]' role='dimension' type='nominal'>
The goal is to check every line before it is added to the new txt file, and for every instance of name='[****]', make the value within the [] lowercase. currently, they are upper case.
Note: only the value within the []'s for the param name= can be lowercased. there are other params in the line that must stay capitalized.
Thanks!
Edit: Another option would be to do a make shift find and replace that would find all instances with name='[ABC]', and replace it with name='[abc]'. But still, I do not know how to go about this on my own.
Edit2: Upon implementing Regex, I have also used a for loop to loop through every instance of the txt file... see below code.
for x in range(len(out_lines)):
print(out_lines[x])
test = str(out_lines[x])
out_lines[x] = re.sub(r"(name='([.*?])')", lambda m: m.group(1).lower(), test)
print(out_lines[x])
However when I do so I still get the same output:
<column caption='Location' datatype='string' name='[MANAGEMENT_LOCATION]' role='dimension' type='nominal' />
<column caption='Location' datatype='string' name='[MANAGEMENT_LOCATION]' role='dimension' type='nominal' />
you can use re python module to replace necessary substring.
import re
re.sub(r"(name='(\[.*?\])')", lambda m: m.group(1).lower(), <YOUR TEXT>)

extract data at specific columns in a line if there is any data at them

I have a file with lines of data like below I need to pull out the characters at 74-79 and 122-124 some lines will not have any character at 74-79 and I want to skip those lines.
import re
def main():
file=open("CCDATA.TXT","r")
lines =file.readlines()
file.close()
for line in lines:
lines=re.sub(r" +", " ", line)
print(lines)
main()
CF214L214L1671310491084111159 Customer Name 46081 171638440 0000320800000000HCCCIUAW 0612170609170609170300000000003135
CF214L214L1671310491107111509 Customer Name 46144 171639547 0000421200000000DRNRIUAW 0612170613170613170300000000003135
CF214L214L1671380999999900002000007420
CF214L214L1671310491084111159 Customer Name 46081 171638440 0000320800000000DRCSIU 0612170609170609170300000000003135
CF214L214L1671380999999900001000003208
CF214L214L1671510446646410055 Customer Name 46436 171677320 0000027200000272AA 0616170623170623170300000050003001
CF214L214L1671510126566110169 Customer Name 46450 171677321 0000117900001179AA 0616170623170623170300000250003001
CF214L214L1671510063942910172 Customer Name 46413 171677322 0000159300001593AA 0616170623170623170300000150003001
CF214L214L1671510808861010253 Customer Name 46448 171677323 0000298600002986AA 0616170623170623170300000350003001
CF214L214L1671510077309510502 Customer Name 46434 171677324 0000294300002943AA 0616170622170622170300000150003001
CF214L214L1671580999999900029000077728
CF214L214L1671610049631611165 Customer Name 46221 171677648 0000178700000000 0616170619170619170300000000003000
CF214L214L1671610895609911978 Customer Name 46433 171677348 0000011800000118AC 0616170622170622170300000150003041
CF214L214L1671680999999900002000001905
Short answer:
Just take line[74:79] and such as Roelant suggested. Since the lines in your input are always 230 chars long though, there'll never be an IndexError, so you rather need to check if the result is all whitespace with isspace():
field=line[74:79]
<...>
if isspace(field): continue
A more robust approach that would also validate input (check if you're required to do so) is to parse the entire line and use a specific element from the result.
One way is a regex as per Parse a text file and extract a specific column, Tips for reading in a complex file - Python and an example at get the path in a file inside {} by python .
But for your specific format that appears to be an archaic, punchcard-derived one, with column number defining the datum's meaning, the format can probably be more conveniently expressed as a sequence of column numbers associated with field names (you never told us what they mean so I'm using generic names):
fields=[
("id1",(0,39)),
("cname_text":(40,73)),
("num2":(74:79)),
("num3":(96,105)),
#whether to introduce a separate field at [122:125]
# or parse "id4" further after getting it is up to you.
# I'd suggest you follow the official format spec.
("id4":(106,130)),
("num5":(134,168))
]
line_end=230
And parsed like this:
def parse_line(line,fields,end):
result={}
#for whitespace validation
# prev_ecol=0
for fname,(scol,ecol) in format.iteritems():
#optionally validate delimiting whitespace
# assert prev_ecol==scol or isspace(line[prev_ecol,scol])
#lines in the input are always `end' symbols wide, so IndexError will never happen for a valid input
field=line[scol:ecol]
#optionally do conversion and such, this is completely up to you
field=field.rstrip(' ')
if not field: field=None
result[fname]=field
#for whitespace validation
# prev_ecol=ecol
#optionally validate line end
# assert ecol==end or isspace(line[ecol:end])
All that leaves is skip lines where the field is empty:
for line in lines:
data = parse_line(line,fields,line_end)
if any(data[fname] is None for fname in ('num2','id4')): continue
#handle the data
def read_all_lines(filename='CCDATA.TXT'):
with open(filename,"r") as file:
for line in file:
try:
first = line[74:79]
second = line[122:124]
except IndexError:
continue # skip line
else:
do_something_with(first, second)
Edit: Thanks for commenting, apparently it should have been:
for line in file:
first = line[74:79]
second = line[122:124]
if set(first) != set(' ') and set(second) != set(' '):
do_something_with(first, second)

Google search from python app

I'm trying to take an input file read each line and search google with that line and print the search results from the query. I get the first search result which is from wikipedia which is great but then I get the error: File "test.py", line 24, in
dictionary[str(lineToRead)].append(str(i))
KeyError: 'mouse'
input file pets.txt looks like this:
cat
dog
bird
mouse
inputFile = open("pets.txt", 'r') # Makes File object
outputFile = open("results.csv", "w")
dictionary = {} # Our "hash table"
compare = "https://en.wikipedia.org/wiki/" # urls will compare against this string
for line in inputFile.read().splitlines():
# ---- testing ---
print line
lineToRead = line
inputFile.close()
from googlesearch import GoogleSearch
gs = GoogleSearch(lineToRead)
#gs.results_per_page = 5
#results = gs.get_results()
for i in gs.top_urls():
print i # check to make sure this is printing out url's
compare2 = i
if compare in compare2: # compare the two url's
dictionary[str(lineToRead)].append(str(i)) #write out query string to dictionary key & append the urls
for i in dictionary:
print i
outputFile.write(str(i))
for j in dictionary[i]:
print j
outputFile.write(str(j))
#outputFile.write(str(i)) #write results for the query string to the results file.
#to check if hash works print key /n print values /n print : /n print /n
#-----------------------------------------------------------------------------
Jeremy Banks is right. If you write dictionary[str(lineToRead)].append(str(i)) without first initializing a value for dictionary[str(lineToRead)] you will get an error.
It looks like you have an additional bug. The value of lineToRead will always be mouse, since you have already looped through and closed your input file before searching for anything. Likely, you want to loop thru every word in inputFile (i.e. cat, dog, bird, mouse)
To fix this, we can write the following (assuming you want to keep a list of query strings as values in the dictionary for each search term):
for line in inputFile.read().splitlines(): # loop through each line in input file
lineToRead = line
dictionary[str(lineToRead)] = [] #initialize to empty list
for i in gs.top_urls():
print i # check to make sure this is printing out url's
compare2 = i
if compare in compare2: # compare the two url's
dictionary[str(lineToRead)].append(str(i)) #write out query string to dictionary key & append the urls
inputfile.close()
You can delete the for loop you wrote for 'testing' the inputFile.

retrieving name from number ID

I have a code that takes data from online where items are referred to by a number ID, compared data about those items, and builds a list of item ID numbers based on some criteria. What I'm struggling with is taking this list of numbers and turning it into a list of names. I have a text file with the numbers and corresponding names but am having trouble using it because it contains multi-word names and retains the \n at the end of each line when i try to parse the file in any way with python. the text file looks like this:
number name\n
14 apple\n
27 anjou pear\n
36 asian pear\n
7645 langsat\n
I have tried split(), as well as replacing the white space between with several difference things to no avail. I asked a question earlier which yielded a lot of progress but still didn't quite work. The two methods that were suggested were:
d = dict()
f=open('file.txt', 'r')
for line in f:
number, name = line.split(None,1)
d[number] = name
this almost worked but still left me with the \n so if I call d['14'] i get 'apple\n'. The other method was:
import re
f=open('file.txt', 'r')
fr=f.read()
r=re.findall("(\w+)\s+(.+)", fr)
this seemed to have gotten rid of the \n at the end of every name but leaves me with the problem of having a tuple with each number-name combo being a single entry so if i were to say r[1] i would get ('14', 'apple'). I really don't want to delete each new line command by hand on all ~8400 entries...
Any recommendations on how to get the corresponding name given a number from a file like this?
In your first method change the line ttn[number] = name to ttn[number] = name[:-1]. This simply strips off the last character, and should remove your \n.
names = {}
with open("id_file.txt") as inf:
header = next(inf, '') # skip header row
for line in inf:
id, name = line.split(None, 1)
names[int(id)] = name.strip()
names[27] # => 'anjou pear'
Use this to modify your first approach:
raw_dict = dict()
cleaned_dict = dict()
Assuming you've imported file to dictionary:
raw_dict = {14:"apple\n",27:"anjou pear\n",36 :"asian pear\n" ,7645:"langsat\n"}
for keys in raw_dict:
cleaned_dict[keys] = raw_dict[keys][:len(raw_dict[keys])-1]
So now, cleaned_dict is equal to:
{27: 'anjou pear', 36: 'asian pear', 7645: 'langsat', 14: 'apple'}
*Edited to add first sentence.

Categories

Resources