New to python, read a bunch and watched a lot of videos. I can't get it to work and i'm getting frustrated.
I have a List of links like below:
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
I'm trying to get python to go to "URL" and save it in a folder named "location" as filename "API.las".
ex) ......"location"/Section/"API".las
C://.../T32S R29W/Sec.27/15-119-00164.las
The file has hundred of rows and links to download. I wanted to implement a sleep function at the also to not bombard the servers.
What are some of the different ways to do this? I've tried pandas and a few other methods... any ideas?
You will have to do something like this
for link, file_name in zip(links, file_names):
u = urllib.urlopen(link)
udata = u.read()
f = open(file_name+".las", "w")
f.write(udata)
f.close()
u.close()
If the contents of your file are not what you wanted, you might want to look at a scraping library like BeautifulSoup for parsing.
This might be a little dirty, but it's a first pass at solving the problem. This is all contingent on each value in the CSV being encompassed in double quotes. If this is not true, this solution will need heavy tweaking.
Code:
import os
csv = """
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
""".strip() # trim excess space at top and bottom
root_dir = '/tmp/so_test'
lines = csv.split('\n') # break CSV on newlines
header = lines[0].strip('"').split('","') # grab first line and consider it the header
lines_d = [] # we're about to perform the core actions, and we're going to store it in this variable
for l in lines[1:]: # we want all lines except the top line, which is a header
line_broken = l.strip('"').split('","') # strip off leading and trailing double-quote
line_assoc = zip(header, line_broken) # creates a tuple of tuples out of the line with the header at matching position as key
line_dict = dict(line_assoc) # turn this into a dict
lines_d.append(line_dict)
section_parts = [s.strip() for s in line_dict['Location'].split(',')] # break Section value to get pieces we need
file_out = os.path.join(root_dir, '%s%s%s%sAPI.las'%(section_parts[0], os.path.sep, section_parts[1], os.path.sep)) # format output filename the way I think is requested
# stuff to show what's actually put in the files
print file_out, ':'
print ' ', '"%s"'%('","'.join(header),)
print ' ', '"%s"'%('","'.join(line_dict[h] for h in header))
output:
~/so_test $ python so_test.py
/tmp/so_test/T32S R29W/Sec. 27/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
/tmp/so_test/T34S R26W/Sec. 2/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
~/so_test $
Approach 1 :-
Your file has suppose 1000 rows.
Create masterlist which has the data stored in this form ->
[row1,row2,row3 and so on]
Once done, loop through this masterlist. You will get a a row in string format in every iteration.
split it make a list and splice the last column of url i.e. row[-1]
and append it to a empty list named result_url. Once it has run for all rows, save it in a file and you can easily create a directory using os module and move your file over there
Approach 2 :-
If file is too huge, read line one by one in try block and process your data (using csv module you can get each row as a list, splice url and write it to file API.las everytime).
Once your program moves 1001th line it will move to except block where you can just 'pass' or write a print to get notified.
In approach 2, you are not saving all data in any data structure, you just storing a single row at executing it, so it is more fast.
import csv, os
directory_creater = os.mkdir('Locations')
fme = open('./Locations/API.las','w+')
with open('data.csv','r') as csvfile:
spamreader = csv.reader(csvfile, delimiter = ',')
print spamreader.next()
while True:
try:
row= spamreader.next()
get_url = row[-1]
to_write = get_url+'\n'
fme.write(to_write)
except:
print "Program has run. Check output."
exit(1)
This code can do all that you mentioned efficiently in less time.
Related
I built a simple graphical user interface (GUI) with basketball info to make finding information about players easier. The GUI utilizes data that has been scraped from various sources using the 'requests' library. It works well but there is a problem; within my code lies a list of players which must be compared against this scraped data in order for everything to work properly. This means that if I want to add or remove any names from this list, I have to go into my IDE or directly into my code - I need to change this. Having an external text file where all these player names can be stored would provide much needed flexibility when managing them.
#This is how the players list looks in the code.
basketball = ['Adebayo, Bam', 'Allen, Jarrett', 'Antetokounmpo, Giannis' ... #and many others]
#This is how the info in the scrapped file looks like:
Charlotte Hornets,"Ball, LaMelo",Out,"Injury/Illness - Bilateral Ankle, Wrist; Soreness (L Ankle, R Wrist)"
"Hayward, Gordon",Available,Injury/Illness - Left Hamstring; Soreness
"Martin, Cody",Out,Injury/Illness - Left Knee; Soreness
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,
"Okogie, Josh",Questionable,Injury/Illness - Nasal; Fracture,
#The rest of the code is working well, this is the final part where it uses the list to write the players that were found it both files.
with open("freeze.csv",'r') as freeze:
for word in basketball:
if word in freeze:
freeze.write(word)
# Up to this point I get the correct output, but now I need the list 'basketball' in a text file so can can iterate the same way
# I tried differents solutions but none of them work for me
with open('final_G_league.csv') as text, open('freeze1.csv') as filter_words:
st = set(map(str.rstrip,filter_words))
txt = next(text).split()
out = [word for word in txt if word not in st]
# This one gives me the first line of the scrapped text
import csv
file1 = open("final_G_league.csv",'r')
file2 = open("freeze1.csv",'r')
data_read1= csv.reader(file1)
data_read2 = csv.reader(file2)
# convert the data to a list
data1 = [data for data in data_read1]
data2 = [data for data in data_read2]
for i in range(len(data1)):
if data1[i] != data2[i]:
print("Line " + str(i) + " is a mismatch.")
print(f"{data1[i]} doesn't match {data2[i]}")
file1.close()
file2.close()
#This one returns a list with a bunch of names and a list index error.
file1 = open('final_G_league.csv','r')
file2 = open('freeze_list.txt','r')
list1 = file1.readlines()
list2 = file2.readlines()
for i in list1:
for j in list2:
if j in i:
# I also tried the answers in this post:
#https://stackoverflow.com/questions/31343457/filter-words-from-one-text-file-in-another-text-file
Let's assume we have following input files:
freeze_list.txt - comma separated list of filter words (players) enclosed in quotes:
'Adebayo, Bam', 'Allen, Jarrett', 'Antetokounmpo, Giannis', 'Anthony, Cole', 'Anunoby, O.G.', 'Ayton, Deandre',
'Banchero, Paolo', 'Bane, Desmond', 'Barnes, Scottie', 'Barrett, RJ', 'Beal, Bradley', 'Booker, Devin', 'Bridges, Mikal',
'Brown, Jaylen', 'Brunson, Jalen', 'Butler, Jimmy', 'Forbes, Bryn'
final_G_league.csv - scrapped lines that we want to filter, using words from the freeze_list.txt file:
Charlotte Hornets,"Ball, LaMelo",Out,"Injury/Illness - Bilateral Ankle, Wrist; Soreness (L Ankle, R Wrist)"
"Hayward, Gordon",Available,Injury/Illness - Left Hamstring; Soreness
"Martin, Cody",Out,Injury/Illness - Left Knee; Soreness
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,
"Okogie, Josh",Questionable,Injury/Illness - Nasal; Fracture,
I would split the responsibilities of the script in code segments to make it more readable and manageable:
Define constants (later you could make them parameters)
Read filter words from a file
Filter scrapped lines
Dump output to a file
The constants:
FILTER_WORDS_FILE_NAME = "freeze_list.txt"
SCRAPPED_FILE_NAME = "final_G_league.csv"
FILTERED_FILE_NAME = "freeze.csv"
Read filter words from a file:
with open(FILTER_WORDS_FILE_NAME) as filter_words_file:
filter_words = eval('(' + filter_words_file.read() + ')')
Filter lines from the scrapped file:
matched_lines = []
with open(SCRAPPED_FILE_NAME) as scrapped_file:
for line in scrapped_file:
# Check if any of the keywords is found in the line
for filter_word in filter_words:
if filter_word in line:
matched_lines.append(line)
# stop checking other words for performance and
# to avoid sending same line multipe times to the output
break
Dump filtered lines into a file:
with open(FILTERED_FILE_NAME, "w") as filtered_file:
for line in matched_lines:
filtered_file.write(line)
The output freeze.csv after running above segments in a sequence is:
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,
Suggestion
Not sure why you have chosen to store the filter words in a comma separated list. I would prefer using a plain list of words - one word per line.
freeze_list.txt:
Adebayo, Bam
Allen, Jarrett
Antetokounmpo, Giannis
Butler, Jimmy
Forbes, Bryn
The reading becomes straightforward:
with open(FILTER_WORDS_FILE_NAME) as filter_words_file:
filter_words = [word.strip() for word in filter_words_file]
The output freeze.csv is the same:
"Forbes, Bryn",Questionable,Injury/Illness - N/A; Illness,
If file2 is just a list of names and want to extract those rows in first file where the name column matches a name in the list.
Suggest you make the "freeze" file a text file with one-name per line and remove the single quotes from the names then can more easily parse it.
Can then do something like this to match the names from one file against the other.
import csv
# convert the names data to a list
with open("freeze1.txt",'r') as file2:
names = [s.strip() for s in file2]
print("names:", names)
# next open league data and extract rows with matching names
with open("final_G_league.csv",'r') as file1:
reader = csv.reader(file1)
next(reader) # skip header
for row in reader:
if row[0] in names:
# print matching name that matches
print(row[0])
If names don't match exactly as appears in the final_G_league file then may need to adjust accordingly such as doing a case-insensitive match or normalizing names (last, first vs first last), etc.
I need to create a Database, using Python and a .txt file.
Creating new items is no Problem,the inside of the Databse.txt looks like this:
Index Objektname Objektplace Username
i.e:
1 Pen Office Daniel
2 Saw Shed Nic
6 Shovel Shed Evelyn
4 Knife Room6 Evelyn
I get the index from a QR-Scanner (OpenCV) and the other informations are gained via Tkinter Entrys and if an objekt is already saved in the Database, you should be able to rewrite Objektplace and Username.
My Problems now are the following:
If I scan the Code with the index 6, how do i navigate to that entry, even if it's not in line 6, without causing a Problem with the Room6?
How do I, for example, only replace the "Shed" from Index 4 when that Objekt is moved to f.e. Room6?
Same goes for the Usernames.
Up until now i've tried different methods, but nothing worked so far.
The last try looked something like this
def DBChange():
#Removes unwanted bits from the scanned code
data2 = data.replace("'", "")
Index = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
#Adds a whitespace at the end of the Entrys to seperate them
Userlen = len(User)
User2 = User.ljust(Userlen)
Einlagerungsortlen = len(Einlagerungsort)+1
Einlagerungsort2 = Einlagerungsort.ljust(Einlagerungsortlen)
#Navigate to the exact line of the scanned Index and replace the words
#for the place and the user ONLY in this line
file = open("Datenbank.txt","r+")
lines=file.readlines()
for word in lines[Index].split():
List.append(word)
checkWords = (List[2],List[3])
repWords = (Einlagerungsort2, User2)
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
file.close()
Return()
Thanks in advance
I'd suggest using Pandas to read and write your textfile. That way you can just use the index to select the approriate line. And if there is no specific reason to use your text format, I would switch to csv for ease of use.
import pandas as pd
def DBChange():
#Removes unwanted bits from the scanned code
# I haven't changed this part, since I guess you need this for some input data
data2 = data.replace("'", "")
Indexnr = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
# I removed the lines here. This isn't necessary when using csv and Pandas
# read in the csv file
df = pd.read_csv("Datenbank.csv")
# Select line with index and replace value
df.loc[Indexnr, 'Username'] = User
df.loc[Indexnr, 'Objektplace'] = Einlagerungsort
# Write back to csv
df.to_csv("Datenbank.csv")
Return()
Since I can't reproduce your specific problem, I haven't tested it. But something like this should work.
Edit
To read and write text-file, use ' ' as the seperator. (I assume all values do not contain spaces, and your text file now uses 1 space between values).
reading:
df = pd.read_csv('Datenbank.txt', sep=' ')
Writing:
df.to_csv('Datenbank.txt', sep=' ')
First of all, this is a terrible way to store data. My suggestion is not particularily well code, don't do this in production! (edit
newlines = []
for line in lines:
entry = line.split()
if entry[0] == Index:
#line now is the correct line
#Index 2 is the place, index 0 the ID, etc
entry[2] = Einlagerungsort2
newlines.append(" ".join(entry))
# Now write newlines back to the file
so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?
I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)
Very new to Python and I'm now learning by trying to write a program to process data from first few lines from multiple text files. so far so good - getting the data in and reformatting it for output.
Now I'd like to change the format of one output field based on what row it sits in in the csv file. The file is 15 rows with variable number of columns.
The idea is that:
I preload the CSV file - I'd like to hardcode it into a list or dictionary - not sure what works better for the next step.
Go thru the 15 rows in the list/dictionary and if a match is found - set the output to column 1 in the same row.
Example Data:
BIT, BITSIZE, BITM, BS11, BIT, BS4, BIT1, BIT_STM
CAL, ID27, CALP, HCALI, IECY, CLLO, RD2, RAD3QI, ID4
DEN, RHO8[1], RHOZ1, RHOZ2, RHOB_HR, RHOB_ME, LDENX
DENC, CRHO, DRHO1, ZCOR2, HDRH2, ZCORQK
DEPT, DEPTH, DEPT,MD
DPL, PDL, PORZLS1, PORDLSH_Y, DPRL, HDPH_LIM, PZLS
DPS, HDPH_SAN1, DPHI_SAN2, DPUS, DPOR, PZSS1
DTC, DTCO_MFM[1], DT4PT2, DTCO_MUM[1], DTC
DTS, DT1R[1], DTSH, DT22, DTSM[1], DT24S
GR, GCGR, GR_R3, HGR3, GR5, GR6, GR_R1, MGSGR
NPL, NEU, NPOR_LIM, HTNP_LIM, NPOR, HNPO_LIM1
NPS, NPRS, CNC, NPHILS, NPOR_SS, NPRS1, CNCS, PORS
PE, PEFZ_2, HPEF, PEQK, PEF81, PEF83, PEDN, PEF8MBT
RD, AST90, ASF60, RD, RLLD, RTCH, LLDC, M2R9, LLHD
RS, IESN, FOC, ASO10, MSFR, AO20, RS, SFE, LL8, MLL
For example:
BIT, BITSIZE, BITM, BS11, BIT, BS4, BIT1, BIT_STM
returns BIT
Questions:
Is a list or dictionary a better for search speed?
If I use the csv module to load the data does it matter that number of columns aren't same for every row?
Is there a way to search either list or dictionary without using a loop?
My attempt to load into list and search:
import csv
with open('lookup.csv', 'rb') as f:
reader = csv.reader(f)
codelist = list(reader)
Would this work for searching for matching code searchcode?
for subcodes in codelist:
if searchcode in subcodes:
print "Found it!", subcodes[0]
break
I think that you should try with two-dimensional dictionary
new_dic = {}
new_dic[0] = {BIT, BITSIZE, BITM, BS11, BIT, BS4, BIT1, BIT_STM}
new_dic[1] = {CAL, ID27, CALP, HCALI, IECY, CLLO, RD2, RAD3QI, ID4}
Then you can search for the element and print it.
you can use "index" to search for an item in a list. if there that item in the list it will return the location of the first occurrence of it.
my_list = ['a','b','c','d','e','c'] # defines the list
copy_at = my_list.index('b') # checks if 'b' is in the list
copy_at # prints the location in the list where 'b' was at
1
copy_at = my_list.index('c')
copy_at
2
copy_at = my_list.index('f')
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
my_list.index('f')
ValueError: 'f' is not in list
you can catch the error with a "try" "except" and keep searching.
If we have the following input and we would like to keep the rows if their "APPID colum " (4rd column) are the same and their column "Category" (the 18th column) are one "Cell" and one "Biochemical" or one "Cell" and one "Enzyme".
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-2 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
The ideal output will be
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
"APP-1" is kept because their column 3 are the same and their Category are one "Cell" and the other one is "Enzyme". The same thing for "APP-3", which has one "Cell" and the other one is "Biochemical" in its "Category" column.
The following attempt could do the trick:
import os
App=["1"]
for a in App:
outname="App_"+a+"_target_overlap.csv"
out=open(outname,'w')
ticker=0
cell_comp_id=[]
final_comp_id=[]
# make compound with cell activity (to a target) list first
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
file = open (filename)
line=file.readlines()
if(ticker==0): # Deal with the title
out.write(line[0])
ticker=ticker+1
for c in line[1:]:
c=c.split(',')
if(c[17]==" Cell "):
cell_comp_id.append(c[3])
else:
cell_comp_id=list(set(cell_comp_id))
# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for comp in cell_comp_id:
if (c[3]==comp and c[17]==" Biochemical "):
final_comp_id.append(comp)
out.write(str(temporary_line))
elif (c[3]==comp and c[17]==" Enzyme "):
final_comp_id.append(comp)
out.write(str(temporary_line))
else:
final_comp_id=list(set(final_comp_id))
# After we obatin a final compound list in target a , we go through all the csv again for output the cell data
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for final in final_comp_id:
if (c[3]==final and c[17]==" Cell "):
out.write(str(temporary_line))
out.close()
When the input file is small (tens of thousands lines), this script can finish its job in reasonable time. However, the input files become millions to billions of lines, this script will take forever to finish (days...). I think the issue is we create a list of APPID with "Cell" in the 18th column. Then we go back to compare this "Cell" list (maybe half million lines) to the whole file (1 million lines for example): If any APPID in the Cell list and the whole file is the same and the 18th column of the row in the whole file is "Enzyme" or "Biochemical", we keep the information. This step seems to be very time consuming.
I am thinking maybe preparing "Cell" , "Enzyme" and "Biochemical" dictionaries and compare them will be faster? May I know if any guru has better way to process it? Any example/comment will be helpful. Thanks.
We use python 2.7.6.
reading the file(s) efficiently
One big problem is that you're reading the file all in one go using readlines. This will require loading it ALL into memory at one go. I doubt if you have that much memory available.
Try:
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh: #iterate through lines 'lazily', reading as you go.
c = line.split(',')
style code to start with. This should help a lot. Here, in context:
# make compound with cell activity (to a target) list first
if os.path.exists(filename):
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh:
cols = line.split(',')
if cols[17] == " Cell ":
cell_comp_id.append(cols[3])
the with open(...) as syntax is a very common python idiom which automatically handles closing the file when you finish the with block, or if there is an error. Very useful.
sets
Next thing is, as you suggest, using sets a little better.
You don't need to recreate the set each time, you can just update it to add items. Here's some example set code (written in the python interperter style, >>> at the beginning means it's a line of stuff to type - don't actually type the >>> bit!):
>>> my_set = set()
>>> my_set
set()
>>> my_set.update([1,2,3])
>>> my_set
set([1,2,3])
>>> my_set.update(["this","is","stuff"])
>>> my_set
set([1,2,3,"this","is","stuff"])
>>> my_set.add('apricot')
>>> my_set
set([1,2,3,"this","is","stuff","apricot"])
>>> my_set.remove("is")
>>> my_set
set([1,2,3,"this","stuff","apricot"])
so you can add items, and remove them from a set without creating a new set from scratch (which you are doing each time with the cell_comp_id=list(set(cell_comp_id)) bit.
You can also get differences, intersections, etc:
>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])
>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])
See the docs for more info.
So lets try something like:
cells = set()
enzymes = set()
biochemicals = set()
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells.add(row_id)
elif row_category == ' Biochemical ':
biochemicals.add(row_id)
elif row_category == ' Enzyme ':
enzymes.add(row_id)
Now you have sets of cells, biochemicals and enzymes. You only want the cross section of these, so:
cells_and_enzymes = cells & enzymes
cells_and_biochemicals = cells & biochemicals
You can then go through all the files again and simply check if row_id (or c[3]) is in either of those lists, and if so, print it.
You can actually combine those two lists even further:
cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals
which would be the cells which have enzymes or biochemicals.
So then when you run through the files the second time, you can do:
if row_id in cells_with_enz_or_bio:
out.write(line)
after all that?
Just using those suggestions might be enough to get you by. You still are storing in memory the entire sets of cells, biochemicals and enzymes, though. And you're still running through the files twice.
So one there are two ways we could potentially speed it up, while still staying with a single python process. I don't know how much memory you have available. If you run out of memory, then it might possibly slow things down slightly.
reducing sets as we go.
If you do have a million records, and 800000 of them are pairs (have a cell record and a biochemical record) then by the time you get to the end of the list, you're storing 800000 IDs in sets. To reduce memory usage, once we've established that we do want to output a record, we could save that information (that we want to print the record) to a file on disk, and stop storing it in memory. Then we could read that list back later to figure out which records to print.
Since this does increase disk IO, it could be slower. But if you are running out of memory, it could reduce swapping, and thus end up faster. It's hard to tell.
with open('to_output.tmp','a') as to_output:
for a in App:
# ... do your reading thing into the sets ...
if row_id in cells and (row_id in biochemicals or row_id in enzymes):
to_output.write('%s,' % row_id)
cells.remove(row_id)
biochemicals.remove(row_id)
enzymes.remove(row_id)
once you've read through all the files, you now have a file (to_output.tmp) which contains all the ids that you want to keep. So you can read that back into python:
with open('to_output.tmp') as ids_file:
ids_to_keep = set(ids_file.read().split(','))
which means you can then on your second run through the files simply say:
if row_id in ids_to_keep:
out.write(line)
using dict instead of sets:
If you have plenty of memory, you could bypass all of that and use dicts for storing the data, which would let you run through the files only once, rather than using sets at all.
cells = {}
enzymes = {}
biochemicals = {}
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells[row_id] = line
elif row_category == ' Biochemical ':
biochemicals[row_id] = line
elif row_category == ' Enzyme ':
enzymes[row_id] = line
if row_id in cells and row_id in biochemicals:
out.write(cells[row_id])
out.write(biochemicals[row_id])
if row_id in enzymes:
out.write(enzymes[row_id])
elif row_id in cells and row_id in enzymes:
out.write(cells[row_id])
out.write(enzymes[row_id])
The problem with this method is that if any rows are duplicated, it will get confused.
If you are sure that the input records are unique, and that they either have enzyme or biochemical records, but not both, then you could easily add del cells[row_id] and the appropriate others to remove rows from the dicts once you've printed them, which would reduce memory usage.
I hope this helps :-)
A technique I have used to deal with massive files quickly in Python is to use the multiprocessing library to split the file into large chunks, and process those chunks in parallel in worker subprocesses.
Here's the general algorithm:
Based on the amount of memory you have available on the system that will run this script, decide how much of the file you can afford to read into memory at once. The goal is to make the chunks as large as you can without causing thrashing.
Pass the file name and chunk beginning/end positions to subprocesses, which will each open the file, read in and process their sections of the file, and return their results.
Specifically, I like to use a multiprocessing pool, then create a list of chunk start/stop positions, then use the pool.map() function. This will block until everyone has completed, and the results from each subprocess will be available if you catch the return value from the map call.
For example, you could do something like this in your subprocesses:
# assume we have passed in a byte position to start and end at, and a file name:
with open("fi_name", 'r') as fi:
fi.seek(chunk_start)
chunk = fi.readlines(chunk_end - chunk_start)
retriever = operator.itemgetter(3, 17) # extracts only the elements we want
APPIDs = {}
for line in chunk:
ID, category = retriever(line.split())
try:
APPIDs[ID].append(category) # we've seen this ID before, add category to its list
except KeyError:
APPIDs[ID] = [category] # we haven't seen this ID before - make an entry
# APPIDs entries will look like this:
#
# <APPID> : [list of categories]
return APPIDs
In your main process, you would retrieve all the returned dictionaries and resolve duplicates or overlaps, then output something like this:
for ID, categories in APPIDs.iteritems():
if ('Cell' in categories) and ('Biochemical' in categories or 'Enzyme' in categories):
# print or whatever
A couple of notes/caveats:
Pay attention to the load on your hard disk/SSD/wherever your data is located. If your current method is already maxing out its throughput, you probably won't see any performance improvements from this. You can try implementing the same algorithm with threading instead.
If you do get a heavy hard disk load that's NOT due to memory thrashing, you can also reduce the number of simultaneous subprocesses you're allowing in the pool. This will result in fewer read requests to the drive, while still taking advantage of truly parallel processing.
Look for patterns in your input data you can exploit. For example, if you can rely on matching APPIDs to be next to each other, you can actually do all of your comparisons in the subprocesses and let your main process hang out until its time to combine the subprocess data structures.
TL;DR
Break your file up into chunks and process them in parallel with the multiprocessing library.