while iterating if statement wont evaluate - python

this little snippet of code is my attempt to pull multiple unique values out of rows in a CSV. the CSV looks something like this in the header:
descr1, fee part1, fee part2, descr2, fee part1, fee part2,
with the descr columns having many unique names in a single column. I want to take these unique fee names and make a new header out of them. to do this I decided to start by getting all the different descr columns names, so that when I start pulling data from the actual rows I can check to see if that row has a fee amount or one of the fee names I need. There are probably a lot of things wrong with this code, but I am a beginner. I really just want to know why my first if statement is never triggered when the l in fin does equal a comma, I know it must at some point as it writes a comma to my row string. thanks!
row = ''
header = ''
columnames = ''
cc = ''
#fout = open(","w")
fin = open ("raw data.csv","rb")
for l in fin:
if ',' == l:
if 'start of cust data' not in row:
if 'descr' in row:
columnames = columnames + ' ' + row
row = ''
else:
pass
else:
pass
else:
row = row+l
print(columnames)
print(columnames)

When you iterate over a file, you get lines, not characters -- and they have the newline character, \n, at the end. Your if ',' == l: statement will never succeed because even if you had a line with only a single comma in it, the value of l would be ",\n".
I suggest using the csv module: you'll get much better results than trying to do this by hand like you're doing.

Related

Nested lists: Append string in list to list before

as an exercise I want to try analyzing a Whatsapp chat of mine. I opened the .txt file, used reader() and list() on it and removed the blank lines/lists. The remaining lists have the following format: chat = [[01.01.2019, 12:00 - name1: message1][message2] … ]
I would like to take the lists that only contain messages (not date, time and name) and merge them with the list that came just before it.
This is how it should look like in the end:
chat = [[01.01.2019, 12:00 - name1: message1 message2] … ]
I tried the following loops where if the list begins not with a number, the content will be stored inside a variable, but none of them is appended and when the loop is done, the variable has the last instance of a message only list stored inside.
for row in chat: # add to row before if no date in line
row = list(row)
without = ""
if row[0].isalpha():
without = row[0]
else:
row.append(without)
Thanks in advance :)
Take a complicated task, and break it up into different easy tasks.
This is an example of a generator that reads from a multi-line source, and outputs the actual lines you want, with some formatting to handle newlines.
# this is the condition from your code
def is_new_line(line):
tokens = list(line)
if tokens and not tokens[0].isalpha():
return True
return False
# this is a generator that takes multiline chats and outputs full rows without newlines
def line_generator(chat):
row = []
for line in chat:
if is_new_line(line):
if (row):
yield ' '.join(row)
row = [line.rstrip()]
else:
row.append(line.rstrip())
if (row):
yield ' '.join(row)
# sample data
chat = ['1 one\n', 'two\n', 'three\n', '2 one\n', 'two\n', 'three\n']
# the generator just outputs the rows as you want them defined
for row in line_generator(chat):
print(row)
1 one two three
2 one two three

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?
I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

"\r\n" is ignored at csv file end

import csv
impFileName = []
impFileName.append("file_1.csv")
impFileName.append("file_2.csv")
expFileName = "MasterFile.csv"
l = []
overWrite = False
comma = ","
for f in range(len(impFileName)):
with open(impFileName[f], "r") as impFile:
table = csv.reader(impFile, delimiter = comma)
for row in table:
data_1 = row[0]
data_2 = row[1]
data_3 = row[2]
data_4 = row[3]
data_5 = row[4]
data_6 = row[5]
dic = {"one":data_1, "two":data_2, "three":data_3, "four":data_4, "five":data_5, "six":data_6}
for i in range(len(l)):
if l[i]["one"] == data_1:
print("Data, where one = " + data_1 + " has been updated using the data from " + impFileName[f])
l[i] = dic
overWrite = True
break
if overWrite == False:
l.append(dic)
else:
overWrite = False
print(impFileName[f] + " has been added to the list 'l'")
with open(expFileName, "a") as expFile:
print("Master file now being created...")
for i in range(len(l)):
expFile.write(l[i]["one"] + comma + l[i]["two"] + comma + l[i]["three"] + comma + l[i]["four"] + comma + l[i]["five"] + comma + l[i]["six"] + "\r\n")
print("Process Complete")
This program takes 2 (or more) .csv files and compares the uniqueID (data_1) of each row to all others. If they match, it then assumes that the current row is an updated version so overwrites it. If there is no match then it's a new entry.
I store each row's data in a dictionary, which is then stored in the list "l".
Once all the files have been processed, I output the list "l" to the "MasterFile.csv" in the specified format.
---THE PROBLEM---
The last row of "File_1.csv" and the first row of "File_2.csv" end up on the same line in the output file. I would like it to continue on a new line.
--Visual
...
data_1,data_2,data_3,data_4,data_5,data_6
data_1,data_2,data_3,data_4,data_5,data_6DATA_1,DATA_2,DATA_3,DATA_4,DATA_5,DATA_6
DATA_1,DATA_2,DATA_3,DATA_4,DATA_5,DATA_6
...
NOTE: There are no header rows in any of the .csv files.
I've also tried this using only "\n" at the end of the "expFile.write" - Same result
Just a little suggestion. Comparing two files in your way looks too expensive . Try using pandas in the following way.
import pandas
data1 = pandas.read_csv("file_1.csv")
data2 = pandas.read_csv("file_2.csv")
# Merging Two Dataframes
combinedData = data1.append(data2,ignore_index=True)
# Dropping Duplicates
# give the name of the column on which you are comparing the uniqueness
uniqueData = combinedData.drop_duplicates(["columnName"])
I tried running your program and it is OK. Your only problem is in the line
with open(expFileName, "a") as expFile:
where you use "a" (as append), so if you run your program again and again, it will append to this file.
Use "w" instead of "a".
A'ight guys. I think I made a booboo.
1) Because I was using "a" (append) not "w" (write) at the end; and my last 2 or 3 tests I'd forgotten to clear the file, I was always looking at the same (top 50 or so) rows. Which meant I'd fixed my bug ages ago but was still looking at the old data....
2) Carriage returns were being read into the last value of the dictionary (data_6) so when they were appended to the Master file I ended up with "\r\r\n" at the end.
Thanks Vivek Srinivasan for expanding my python knowledge. I will look at pandas and have a play.
Thanks to MarianD for pointing out the "a"/"w" error.
Thanks to Moses Koledoye for pointing out the "\r" error.
Sorry for wasting your time.

How to Remove Random Text Breaks in HTML Text

I'm looking to scrape some of the text out of a couple HTML documents, but I can't get rid of some of the line breaks. Currently I have Beautiful Soup parsing the web pages, then I read in all the lines and attempt to strip all the newline characters out of the text, but I can't get rid of the ones in the middle of strings. For example,
<font face="ARIAL" size="2">Thomas
H. Lenagh </font>
I'm looking to get the name of this person out on one line, but there is a newline character of some sort in the middle. Here's what I've tried so far:
line=line.replace("\n"," ")
line=line.replace("\\n"," ")
line=line.replace("\r\n", " ")
line=line.replace("\t", " ")
line=line.replace("\\r\\n"," ")
I've also tried the following regex expressions:
line=re.sub("\n"," ",line)
line=re.sub("\\n", " ",line)
line=re.sub("\s\s+", " ",line)
None have worked so far and I'm not sure what character I'm missing. Any ideas?
EDIT: Here's the full code that I'm using (minus error checking):
soup=BeautifulSoup(threePage) #make the soup
paragraph=soup.stripped_strings
if paragraph is not None:
for i in range (len(data)): #for all rows...
lineCounter=lineCounter+1
row =data[i]
row=row.replace("\n"," ") #remove newline (<enter>) characters
row = re.sub("---+"," ",row) #remove dashed lines
row =re.sub(","," ",row) #replace commas with spaces
row=re.sub("\s\s+", " ",row) #remove
if ("/s/" in row): #if /s/ is in the row, remove it
row=re.sub(".*/s/"," ",row)
if ("/S/" in row): #upper case of the last removal
row=re.sub(".*/S/"," ",row)
row = row.replace("\n"," ")
row=row.strip()#remove any weird characters
You haven't shared what the rest of your code looks like after the for loop, but I'm guessing a very simplified version is something like:
data = ["a\nb", "c\nd", "e\nf"]
for i in range(len(data)):
row = data[i]
row = row.replace("\n", "")
#let's see if that fixed it...
print(data)
#output: ['a\nb', 'c\nd', 'e\nf']
#hey, the newlines are still there! What gives?
This occurs because calling replace on a string doesn't mutate it in-place, and assigning new values to row doesn't change what values are being stored in data. If you want data to be changed too, you've got to assign the values back.
data = ["a\nb", "c\nd", "e\nf"]
for i in range(len(data)):
row = data[i]
row = row.replace("\n", "")
data[i] = row
#let's see if that fixed it...
print data
#output: ['ab', 'cd', 'ef']
#looking good!
Bonus style tip: if your replacement logic is simple enough to express in one expression, you can get it all on one line and avoid messing around with range and indices etc:
data = [row.replace("\n", "") for row in data]

python - matching string and replacing

I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.
ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()

Categories

Resources