Searching in a text file in a nested loop in Python - python

I have a 4 columns tab separated text file. Also I have a list of values which need to be iterated through and searched in the text file to get the value of one of the columns:
Here's my code (Python 2.7):
def populate_data():
file = open('file.txt', 'r')
values = ['value1', 'value2', 'value3']
secondary_values = ['second_value1', 'second_value2', 'second_value3']
os = 'iOS'
i = 0
outputs = []
while i < len(values):
value = values[i]
secondary_value = secondary_values[i]
output = lookup(file, os, value, secondary_value)
if output != None:
outputs.append(output)
i += 1
def lookup(file, input_os, input_value, input_secondary_value):
for line in file:
columns = line.strip().split('\t')
if len(columns) != 4:
continue
else:
value = columns[0]
secondary_value = columns[1]
os = columns[2]
output = columns[3]
if input_os == os and input_value == value and input_secondary_value == secondary_value:
return output
The search basically should work as this SQL statement:
SELECT output FROM data_set WHERE os='os' AND value='value' and secondary_value='secondary_value'
The problem I'm experiencing is that the lookup method works in the while look and also maintains a for loop and obviously the parent while loop doesn't wait for the inner loop to finish and return the value before continue. This results in a problem that despite of the fact of the match the data is not returned. If this was JavaScript I would do that with Promises, but not sure how to achieve it in Python.
Any clues how this could be solved?

The correct thing to do here was to read the file and insert all of the rows into a dict like so:
dc = dict()
dc[value+secondary_value+os] = output
Then accessing the values in the main while loop.

Related

How to count the changes done in new csv file compared to the previous

We have two csv files - new.csv and old.csv.
old.csv contains with four rows:
abc done
xyz done
pqr done
rst pending
The new.csv contains four new rows:
abc pending
xyz not_done
pqr pending
rst done
I need to use count two things without using pandas
count1 = number of entries changed from done to pending = 2 (abc, pqr)
count2 = number of entries changed from done to not_done = 1 (xyz)
CASE 1: CSV Files are in the same order
Firstly import the two files into python lists:
oldcsv = []
with open("old.csv") as f:
for line in f:
oldcsv.append(line.strip().split(","))
newcsv = []
with open("new.csv") as f:
for line in f:
newcsv.append(line.strip().split(","))
Now you would simply iterate through both lists simultaneously, using zip(). I am assuming that both CSV files list the entries in the same order.
count1 = 0
count2 = 0
for oldentry, newentry in zip(oldcsv, newcsv):
assert(oldentry[0] == newentry[0]) # Throw error if entry names do not match
if oldentry[1] == "done":
if newentry[1] == "pending":
count1 += 1
elif newentry[1] == "not_done":
count2 += 1
CASE 2: CSV Files are in arbitrary order
Here, given you are going to be needing to look up entries by their names, I would use a dictionary rather than a list to store the old.csv data, mapping the entry names to their values:
# Load old.csv data into a dictionary mapping entry_name: entry_value
old_values = {}
with open("old.csv") as f:
for line in f:
old_entry = line.strip().split(",")
entry_name, old_entry_value = old_entry[0], old_entry[1]
old_values[entry_name] = old_entry_value
count1 = 0
count2 = 0
with open("new.csv") as f:
for line in f:
# For each entry in new_entry, look up the corresponding old entry in old_entries, and compare their values.
new_entry = line.strip().split(",")
entry_name, new_entry_value = new_entry[0], new_entry[1]
old_entry_value = old_values.get(entry_name) # Get the old value for this entry (will be None if there is no old entry)
# Essentially same code as before:
print(f"{entry_name}: old entry status is {old_entry_value} and new entry status is {new_entry_value}")
if old_entry_value == "done":
if new_entry_value == "pending":
print("Incrementing count1")
count1 += 1
elif new_entry_value == "not_done":
print("Incrementing count2")
count2 += 1
print(count1)
print(count2)
This should work, as long as the input data is properly formatted. I am assuming each .csv file has one entry per line, and each line begins with the entry name (e.g. "abc"), then a comma, then the entry value (e.g. "done","not_done").
Here is a pure python straightforward implementation:
import csv
with open("old.csv") as old_fl:
with open("new.csv") as new_fl:
old = csv.reader(old_fl)
new = csv.reader(new_fl)
old_rows = [row for row in old]
new_rows = [row for row in new]
# see if this is really needed
assert len(old_rows) == len(new_rows)
n = len(old_rows)
# assume that left key is identical,
# and in the same order in both files
assert all(old_rows[i][0] == new_rows[i][0] for i in range(n))
# once the data is guaranteed to align,
# just count what you want
done_to_pending = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "pending"
]
done_to_notdone = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "not_done"
]
It uses the python native csv reader so you don't need to parse the csv yourself. Note that there are various assumptions (assert statements) throughout the code - you might need to adjust the code to handle more cases.

Problem skipping line whilst iterating using previous line and current line comparison

I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.
Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file
I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.

Looping through a dictionary in python

I am creating a main function which loops through dictionary that has one key for all the values associated with it. I am having trouble because I can not get the dictionary to be all lowercase. I have tried using .lower but to no avail. Also, the program should look at the words of the sentence, determine whether it has seen more of those words in sentences that the user has previously called "happy", "sad", or "neutral", (based on the three dictionaries) and make a guess as to which label to apply to the sentence.
an example output would be like
Sentence: i started screaming incoherently about 15 mins ago, this is B's attempt to calm me down.
0 appear in happy
0 appear in neutral
0 appear in sad
I think this is sad.
You think this is: sad
Okay! Updating.
CODE:
import csv
def read_csv(filename, col_list):
"""This function expects the name of a CSV file and a list of strings
representing a subset of the headers of the columns in the file, and
returns a dictionary of the data in those columns, as described below."""
with open(filename, 'r') as f:
# Better covert reader to a list (items represent every row)
reader = list(csv.DictReader(f))
dict1 = {}
for col in col_list:
dict1[col] = []
# Going in every row of the file
for row in reader:
# Append to the list the row item of this key
dict1[col].append(row[col])
return dict1
def main():
dictx = read_csv('words.csv', ['happy'])
dicty = read_csv('words.csv', ['sad'])
dictz = read_csv('words.csv', ['neutral'])
dictxcounter = 0
dictycounter = 0
dictzcounter = 0
a=str(raw_input("Sentence: ")).split(' ')
for word in a :
for keys in dictx['happy']:
if word == keys:
dictxcounter = dictxcounter + 1
for values in dicty['sad']:
if word == values:
dictycounter = dictycounter + 1
for words in dictz['neutral']:
if word == words:
dictzcounter = dictzcounter + 1
print dictxcounter
print dictycounter
print dictzcounter
Remove this line from your code:
dict1 = dict((k, v.lower()) for k,v in col_list)
It overwrites the dictionary that you built in the loop.

Splitting or stripping a variable number of characters from a line of text in Python?

I have a large amount of data of this type:
array(14) {
["ap_id"]=>
string(5) "22755"
["user_id"]=>
string(4) "8872"
["exam_type"]=>
string(32) "PV Technical Sales Certification"
["cert_no"]=>
string(12) "PVTS081112-2"
["explevel"]=>
string(1) "0"
["public_state"]=>
string(2) "NY"
["public_zip"]=>
string(5) "11790"
["email"]=>
string(19) "ivorabey#zeroeh.com"
["full_name"]=>
string(15) "Ivor Abeysekera"
["org_name"]=>
string(21) "Zero Energy Homes LLC"
["org_website"]=>
string(14) "www.zeroeh.com"
["city"]=>
string(11) "Stony Brook"
["state"]=>
string(2) "NY"
["zip"]=>
string(5) "11790"
}
I wrote a for loop in python which reads through the file, creating a dictionary for each array and storing elements like thus:
a = 0
data = [{}]
with open( "mess.txt" ) as messy:
lines = messy.readlines()
for i in range( 1, len(lines) ):
line = lines[i]
if "public_state" in line:
data[a]['state'] = lines[i + 1]
elif "public_zip" in line:
data[a]['zip'] = lines[i + 1]
elif "email" in line:
data[a]['email'] = lines[i + 1]
elif "full_name" in line:
data[a]['contact'] = lines[i + 1]
elif "org_name" in line:
data[a]['name'] = lines[i + 1]
elif "org_website" in line:
data[a]['website'] = lines[i + 1]
elif "city" in line:
data[a]['city'] = lines[i + 1]
elif "}" in line:
a += 1
data.append({})
I know my code is terrible, but I am fairly new to Python. As you can see, the bulk of my project is complete. What's left is to strip away the code tags from the actual data. For example, I need string(15) "Ivor Abeysekera" to become Ivor Abeysekera".
After some research, I considered .lstrip(), but since the preceding text is always different.. I got stuck.
Does anyone have a clever way of solving this problem? Cheers!
Edit: I am using Python 2.7 on Windows 7.
Depending on how the code tags are formatted, you could split the line on " then pick out the second element.
s = 'string(15) "Ivor Abeysekera"'
temp = s.split('"')[1]
# temp is 'Ivor Abeysekera'
Note that this will get rid of the trailing ", if you need it you can always just add it back on. In your example this would look like:
data[a]['state'] = lines[i + 1].split('"')[1]
# etc. for each call of lines[i + 1]
Because you are calling it so much (regardless of what answer you use) you should probably turn it into a function:
def prepare_data(line_to_fix):
return line_to_fix.split('"')[1]
# latter on...
data[a]['state'] = prepare_data(lines[i + 1])
This will give you some more flexibility.
BAD SOLUTION Based on current question
but to answer your question just use
info_string = lines[i + 1]
value_str = info_string.split(" ",1)[-1].strip(" \"")
BETTER SOLUTION
do you have access to the php generating that .... if you do just do echo json_encode($data); instead of using var_dump
if instead you have them output json it(the json output) will look like
{"variable":"value","variable2","value2"}
you can then read it in like
import json
json_str = requests.get("http://url.com/json_dump").text # or however you get the original text
data = json.loads(json_str)
print data
You should use regular expressions (regex) for this:
http://docs.python.org/2/library/re.html
What you intend to do can be easily done with the following code:
# Import the library
import re
# This is a string just to demonstrate
a = 'string(32) "PV Technical Sales Certification"'
# Create the regex
p = re.compile('[^"]+"(.*)"$')
# Find a match
m = p.match(a)
# Your result will be now in s
s = m.group(1)
Hope this helps!
You can do this statefully by looping across all the lines and keeping track of where you are in a block:
# Make field names to dict keys
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
data = []
current = {}
key = None
with open( "mess.txt" ) as messy:
for line in messy.split('\n'):
line = line.lstrip()
if line.startswith('}'):
data.append(current)
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
This avoids having to keep track of your position in the file, and also means that you could work across enormous data files (if you process the dictionary after each record) without having to load the whole thing into memory at once. In fact, let's restructure that as a generator that processes blocks of data at a time and yields dicts for you to work with:
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
def dict_maker(fileobj):
current = {}
key = None
for line in fileobj:
line = line.lstrip()
if line.startswith('}'):
yield current
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
with open("mess.txt") as messy:
for d in dict_maker(messy):
print d
That makes your main loop tiny and understandable: you loop across the potentially enormous set of dicts, one at a time, and do something with them. It totally separates the act of making the dictionaries from the act of consuming them. And since the generator is stateful, and only processes one line at a time, you could pass in anything that looks like a file, like a list of strings, the output of a web request, input from another programming writing to sys.stdin, or whatever.

Embedded for loop with regex

def find_string(header,file_1,counter):
ab = re.compile(str(header))
for line in file_1:
if re.search(ab,line) !=None:
print line
counter+=1
return counter
file_1 = open("text_file_with_headers.txt",'r')
header_array = []
header_array.append("header1")
header_array.append("header2")
# ...
counter = 0
for header in header_array:
counter = find_string(header,file_1,counter)
Every time I run this it searches for only one of the headers and I cannot figure out why.
Because when the loop for line in file_1: has ended for the first header, the file's pointer is at the end of the file. You must move this pointer to the file's beginning again, that is done with function seek() . You must add seek(0,0) like that
counter = 0
for header in header_array:
counter = find_string(header,file_1,counter)
f1.seek(0,0)
.
EDIT
1) ab is a compiled regex, then you can write ab.search(line)
2) bool(None) is False, then you can write if ab.search(line): no need of != None
3)
def find_string(header,file_1,counter):
lwh = re.compile('^.*?'+header+'.*$',re.MULTILINE)
lines_with_header = lwh.findall(file-1.read())
print ''.join(lines_with_header)
return counter + 1
and even
def find_string(header,file_1,counter):
lwh = re.compile('^.*?'+header+'.*$',re.MULTILINE)
print ''.join(matline.group() for matline in lwh.finditer(file-1.read()) )
return counter + 1
4)
def find_string(header,file_1):
lwh = re.compile('^.*?'+header+'.*$',re.MULTILINE)
lines_with_header = lwh.findall(file-1.read())
print ''.join(lines_with_header)
file_1 = open("text_file_with_headers.txt",'r')
header_list = ["header1","header2",....]
for counter,header in header_list:
find_string(header,file_1)
file_1.seek(0,0)
counter += 1 # because counter began at 0
5) You run through file_1 as many times that there are headers in header_list.
You should run through it only one time and recording each line containing one of the headers in a list being one of the values of a dictionary whose keys should be the headers. It would be faster.
6) An array in Python is an array
The file object keeps track of your position in the file, and after you've gone through the outer loop once, you're at the end of the file and there are no more lines to read.
If I were you, I would reverse the order in which your loops are nested: I would iterate through the file line by line, and for each line, iterate through the list of strings you want to find. That way, I would only have to read each line from the file once.

Categories

Resources